Site Reliability Engineering book

Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.

Beyer, Jones, Petoff, Murphy (eds.) · 2016 · platform

Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.

why it matters

Codified the SRE practice that most modern operations teams now use as a baseline. Even if you don't work at Google scale, the vocabulary (SLI/SLO/SLA, error budget, toil, golden signals) is now the lingua franca of running production. Free online.

key ideas

SLO + error budget = how to negotiate reliability investment with product teams: spend the budget on velocity, save it when reliability erodes
Toil is the work that scales linearly with your service — the engineering goal is to eliminate it, not just do it faster
Four golden signals: latency, traffic, errors, saturation — minimum viable monitoring
Blameless postmortems: optimize for organizational learning, not for assigning fault
Eliminating cascading failures via timeouts, retries with budgets, load shedding, graceful degradation

memorable framings

'Hope is not a strategy.'
100% reliability is the wrong target — for any given service, the right target is just below the point where the next 9 isn't worth the cost

who should read it

Anyone responsible for production. Skim the org-specific chapters; read the operational ones carefully. The Workbook (companion) is more practical and worth pairing with this.

covers

references:

SRE Book (free online)