recall

← recall

Site Reliability Engineering book

Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.

Beyer, Jones, Petoff, Murphy (eds.) · 2016 · platform

Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.

why it matters

Codified the SRE practice that most modern operations teams now use as a baseline. Even if you don't work at Google scale, the vocabulary (SLI/SLO/SLA, error budget, toil, golden signals) is now the lingua franca of running production. Free online.

key ideas

  • SLO + error budget = how to negotiate reliability investment with product teams: spend the budget on velocity, save it when reliability erodes
  • Toil is the work that scales linearly with your service — the engineering goal is to eliminate it, not just do it faster
  • Four golden signals: latency, traffic, errors, saturation — minimum viable monitoring
  • Blameless postmortems: optimize for organizational learning, not for assigning fault
  • Eliminating cascading failures via timeouts, retries with budgets, load shedding, graceful degradation

memorable framings

  • 'Hope is not a strategy.'
  • 100% reliability is the wrong target — for any given service, the right target is just below the point where the next 9 isn't worth the cost

who should read it

Anyone responsible for production. Skim the org-specific chapters; read the operational ones carefully. The Workbook (companion) is more practical and worth pairing with this.

covers

references: