Files
2024-01-31 16:23:01 -05:00

2.2 KiB

aliases, author, lastRevised
aliases author lastRevised
site reliability engineering
Jason Thistlethwaite 2021-12-18

Site Reliability Engineering, aka SRE

A practice originating at Google from 2003, SRE is a methodology for ensuring the reliability and performance of interoperating systems.

Much of this article is the personal opinion of our current CEO and co-founder, Jason.

Why

We see droves of people prepping for the end of the world in ways that probably won't help them, then they die in car crashes because they don't wear seatbelts. During COVID, we saw people panic buying household supplies they don't need, then losing their jobs.

Humans struggle to be honest with themselves and one another about how well they are truly prepared for disasters, as well as how probable disasters truly are.

SRE as a practice, although it comes from software engineering, works to identify and implement reliable solutions to probable disasters.

What

SRE as a practice focuses on the following 4 things:

  • Automation or elimination of anything repetitive that's also cost-effective to automate or eliminate (also known as Toil).

  • Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself.

  • Systems design with a bias toward reduction of risks to availability, latency, and efficiency.

  • Observability, as in, the ability to be able to ask arbitrary questions about your system without having to know ahead of time what you wanted to ask.

How

SRE was designed for application to software systems, but like many such practices, it can be helpful to adapt for other uses. A proper SRE implementation includes the following aspects:

  1. Toil management as the implementation of principle 1.

  2. Defining and measuring reliability goals, such as SLIs, SLOs, and error budgets.

  3. Designing for and implementing observability.

  4. Defining, testing, and running an incident management process.

  5. Capacity planning.

  6. Change and release management.

  7. Chaos Engineering

Toil Management

Toil management is the practice of regularly identifying Toil in the business and ensuring there is forward progress at reducing it. LDR implements this in a few different ways