recall

← recall

cascading failure pattern

One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.

One service fails, the services calling it back up, their callers back up, and the whole graph stops. The first failure isn't the problem — the lack of isolation is.

symptoms

  • multiple unrelated services alerting at once
  • queues filling up across the graph
  • saturation spreading upstream from one root cause
  • everything red on the dashboard

causes

  • no timeouts or unbounded retries
  • shared resource contention
  • no bulkheads between callers
  • synchronous deep call chains

fixes

  • timeouts on every call
  • circuit breakers
  • bulkheads / pool isolation
  • load shedding at the edge
  • backpressure

you might say

  • the failure cascaded through everything
  • one bad pod took out the cluster
  • everything caught fire at once

related

aliases: cascade failure, domino failure

topics: failure-modes, distributed-systems

references: