recall

← recall

Metastable Failures (paper) conceptpaper

Bronson, Aghayev, Charapko, Zhu (HotOS), 2021.

Bronson, Aghayev, Charapko, Zhu (HotOS), 2021. Names and characterizes a failure mode the industry kept rediscovering without a label: a system in steady state hits a temporary trigger, enters a degraded state, and stays there even after the trigger goes away. The cycle is sustained by the system's own behavior — retries, queue growth, cache thrashing. Distinct from cascading failure (which propagates) and overload (which goes away when load drops). The paper gave the industry vocabulary for incidents that 'should have recovered but didn't,' and for prescribing real fixes (retry budgets, hedged requests, circuit breakers that stay open until truly recovered).

see also

aliases: Metastable Failures in Distributed Systems

topics: distributed-systems, failure-modes, resilience

references: