recall

← recall

retry amplification pattern

A downstream service slows down. Callers retry. Their callers retry. The retries multiply at every layer until the original service is buried in N times its real load — exactly when it can least handle it.

A downstream service slows down. Callers retry. Their callers retry. The retries multiply at every layer until the original service is buried in N times its real load — exactly when it can least handle it.

symptoms

  • incoming RPS suddenly spikes during a partial outage
  • load on a degraded service goes UP after errors start
  • fixing the root cause doesn't immediately recover the service

causes

  • unbounded retries at every layer
  • no retry budgets
  • synchronized backoff with no jitter
  • client-side retry on top of library retry on top of platform retry

fixes

  • retry budgets (cap retries as a fraction of requests)
  • exponential backoff with jitter
  • retry only at the top layer
  • circuit breakers

you might say

  • the retries are killing it
  • we're DDoSing ourselves
  • retry storm

related

aliases: retry storm

topics: failure-modes, resilience

references: