recall

← recall

Gray Failure (paper) conceptpaper

Huang et al.

Huang et al. (Microsoft Research), 2017. Names the failure mode where a component is partially broken — degraded but not down — and the system's failure detector says it's healthy. Common causes: slow disk, memory pressure, network jitter that's worse than 'down' because it confuses retries. The paper argues that distributed systems need to treat 'is it healthy?' as a much harder question than 'is it responding?' Influenced the design of modern health checks, hedged requests, and outlier detection in load balancers.

see also

aliases: Gray Failure: The Achilles Heel of Cloud-Scale Systems

topics: distributed-systems, failure-modes, observability

references: