retry amplification pattern
A downstream service slows down. Callers retry. Their callers retry. The retries multiply at every layer until the original service is buried in N times its real load — exactly when it can least handle it.
A downstream service slows down. Callers retry. Their callers retry. The retries multiply at every layer until the original service is buried in N times its real load — exactly when it can least handle it.
symptoms
- incoming RPS suddenly spikes during a partial outage
- load on a degraded service goes UP after errors start
- fixing the root cause doesn't immediately recover the service
causes
- unbounded retries at every layer
- no retry budgets
- synchronized backoff with no jitter
- client-side retry on top of library retry on top of platform retry
fixes
- retry budgets (cap retries as a fraction of requests)
- exponential backoff with jitter
- retry only at the top layer
- circuit breakers
you might say
- the retries are killing it
- we're DDoSing ourselves
- retry storm