metastable failure pattern

The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.

The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.

symptoms

error rate stays high after load returns to normal
draining or restart is the only fix
retries / queues sustain the overload themselves

causes

positive feedback from retries during overload
warm caches replaced by cold ones under churn
queues growing faster than drain

fixes

aggressive load shedding
retry budgets / token buckets
exponential backoff with jitter
circuit breakers that stay open until truly recovered

you might say

it won't recover even though traffic dropped
we had to drain it to get it back
it's stuck in a bad state

related

aliases: metastability, stuck overload

topics: failure-modes, distributed-systems

references:

Metastable Failures (Bronson et al., HotOS 2021)