Site Reliability Engineering book
Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.
Google's foundational book on running production systems — SLOs, error budgets, monitoring, incident response, postmortems.
why it matters
Codified the SRE practice that most modern operations teams now use as a baseline. Even if you don't work at Google scale, the vocabulary (SLI/SLO/SLA, error budget, toil, golden signals) is now the lingua franca of running production. Free online.
key ideas
- SLO + error budget = how to negotiate reliability investment with product teams: spend the budget on velocity, save it when reliability erodes
- Toil is the work that scales linearly with your service — the engineering goal is to eliminate it, not just do it faster
- Four golden signals: latency, traffic, errors, saturation — minimum viable monitoring
- Blameless postmortems: optimize for organizational learning, not for assigning fault
- Eliminating cascading failures via timeouts, retries with budgets, load shedding, graceful degradation
memorable framings
- 'Hope is not a strategy.'
- 100% reliability is the wrong target — for any given service, the right target is just below the point where the next 9 isn't worth the cost
who should read it
Anyone responsible for production. Skim the org-specific chapters; read the operational ones carefully. The Workbook (companion) is more practical and worth pairing with this.