cascading failure pattern
One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.
One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.
symptoms
- multiple unrelated services alerting at once
- queues filling up across the graph
- saturation spreading upstream from one root cause
- everything red on the dashboard
causes
- no timeouts or unbounded retries
- shared resource contention
- no bulkheads between callers
- synchronous deep call chains
fixes
- timeouts on every call
- circuit breakers
- bulkheads / pool isolation
- load shedding at the edge
- backpressure
you might say
- the failure cascaded through everything
- one bad pod took out the cluster
- everything caught fire at once