recall

← recall

metastable failure pattern

The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.

The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.

symptoms

  • error rate stays high after load returns to normal
  • draining or restart is the only fix
  • retries / queues sustain the overload themselves

causes

  • positive feedback from retries during overload
  • warm caches replaced by cold ones under churn
  • queues growing faster than drain

fixes

  • aggressive load shedding
  • retry budgets / token buckets
  • exponential backoff with jitter
  • circuit breakers that stay open until truly recovered

you might say

  • it won't recover even though traffic dropped
  • we had to drain it to get it back
  • it's stuck in a bad state

related

aliases: metastability, stuck overload

topics: failure-modes, distributed-systems

references: