Site Reliability Engineering

19 Apr 2017 - Build

When a digital product is launched to the public, one of the major concerns is performance and availability.

Is the product available? Can it accept change rapidly, responding to opportunities and threats? Does it scale to meet demand?

Google introduced the concept of Site Reliability Engineering (SRE) to tackle many of these questions. Whilst SRE is hard to define, Ben Treynor from Google describes the characteristics of the people who work within SRE being:

What happens when a software engineering is tasked with what used
to be called operations.

The benefits of your product are easier to understand. SRE unlocks the to build and run robust reliable technology platforms and services, growing your online business whilst increasing (and preserving) revenue.

Within 33 Bondi, we delegate as much responsibility as possible to Cloud providers, removing the need for dedicated teams managing traditional IT infrastructures and systems, with the sweet spot being SaaS based services (an important mechanism to unlock quality at speed).

For products requiring flexibility, an IaaS deployment may be unavoidable. When the base atomic units become storage, network and compute an SRE team may be required.At we have developers, on-call, which can respond to failure of systems (incidents) and recover, whilst monitoring performance to ensure it operates within defined SLAs.

After each incident, we conduct a review, forming part of our process for continuous improvement, ensuring not only your product is getting better (but we are too).

  • Author name
    Jonas Allen