metastable failure pattern
The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.
The system is fine, then a small trigger pushes it into an overloaded state — and it stays there even after the trigger is gone. Removing load doesn't recover; you have to break the cycle.
symptoms
- error rate stays high after load returns to normal
- draining or restart is the only fix
- retries / queues sustain the overload themselves
causes
- positive feedback from retries during overload
- warm caches replaced by cold ones under churn
- queues growing faster than drain
fixes
- aggressive load shedding
- retry budgets / token buckets
- exponential backoff with jitter
- circuit breakers that stay open until truly recovered
you might say
- it won't recover even though traffic dropped
- we had to drain it to get it back
- it's stuck in a bad state