cascading failure pattern

One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.

One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.

symptoms

multiple unrelated services alerting at once
queues filling up across the graph
saturation spreading upstream from one root cause
everything red on the dashboard

causes

no timeouts or unbounded retries
shared resource contention
no bulkheads between callers
synchronous deep call chains

fixes

timeouts on every call
circuit breakers
bulkheads / pool isolation
load shedding at the edge
backpressure

you might say

the failure cascaded through everything
one bad pod took out the cluster
everything caught fire at once

related

aliases: cascade failure, domino failure

topics: failure-modes, distributed-systems

references:

SRE Book — Addressing Cascading Failures