How retry with exponential backoff works — jitter, max retries, idempotency requirements, and why naive retries cause thundering herd failures.

Retry with Exponential Backoff

Retry with exponential backoff is a fault-tolerance strategy where failed operations are retried after progressively longer wait intervals, giving overloaded or temporarily unavailable services time to recover without being overwhelmed by retry storms.

What It Really Means

In distributed systems, transient failures are normal. A network packet gets dropped. A database connection times out. A downstream service restarts during a deployment. These failures are temporary — retrying the same request a few seconds later often succeeds.

Naive retrying — retry immediately, as fast as possible — makes things worse. If a service is overloaded and 1,000 clients retry simultaneously, those 1,000 retries add even more load, preventing recovery. This is the thundering herd problem: the retries themselves become the cause of the continued failure.

Exponential backoff solves this by increasing the wait time between retries exponentially: wait 1 second, then 2, then 4, then 8. This gives the failing service progressively more time to recover. Adding random jitter (randomizing the wait time within a range) prevents all clients from retrying at the same moment, further spreading the load.

How It Works in Practice

The Algorithm

Why Jitter Is Essential

Types of Jitter

Full jitter: wait = random(0, base * 2^attempt) — Most effective at spreading retries. AWS recommends this.*

Equal jitter: wait = (base * 2^attempt) / 2 + random(0, (base * 2^attempt) / 2) — Guarantees a minimum wait time.

Decorrelated jitter: wait = random(base, previous_wait * 3) — Each wait is based on the previous wait, not the attempt number. Produces good spread without synchronized waves.*

Real System Example: Payment Processing

Implementation

Python retry with exponential backoff and jitter:

python

When NOT to retry — non-retryable errors:

python

Trade-offs

Benefits:

Handles transient failures transparently — callers do not need to handle retries manually
Exponential growth gives failing services time to recover
Jitter prevents thundering herd — retries spread out over time
Simple to implement and reason about

Costs:

Increases end-to-end latency (user waits through retries)
Retries amplify load if not capped — 5 retries = 5x the traffic during outages
Requires idempotent operations — unsafe to retry non-idempotent writes without idempotency keys
Max retry delay may exceed user patience or request timeout

When to retry:

Network timeouts, connection resets, DNS failures
HTTP 429 (rate limited), 502/503/504 (server errors)
Database connection pool exhaustion
Message queue delivery failures

When NOT to retry:

HTTP 400/401/403/404/422 (client errors — the request is wrong)
Business logic failures (insufficient funds, validation errors)
Non-idempotent operations without idempotency keys
When the circuit breaker is open (the dependency is known to be down)

Common Misconceptions

"Retry immediately, then back off" — Even the first retry should have a small delay. Immediate retry after a failure often hits the same problem (connection in reset state, server still restarting).
"Exponential backoff without jitter is fine" — Without jitter, all clients retry at the same exponential intervals, creating periodic load spikes. Always add jitter.
"More retries are always better" — Each retry multiplies load on the failing service. 3-5 retries with exponential backoff is usually sufficient. Beyond that, use a circuit breaker.
"Retries are safe for all operations" — Retrying a non-idempotent operation (e.g., POST /charge without an idempotency key) can charge the customer multiple times. Only retry operations that are safe to repeat.
"Exponential backoff solves rate limiting" — Backoff helps, but respect the Retry-After header when present. Some APIs tell you exactly when to retry.

How This Appears in Interviews

"How do you handle a flaky third-party API?" — Retry with exponential backoff and jitter. Circuit breaker for prolonged failures. Fallback response when possible.
"Your retries are making an outage worse. Why?" — No jitter (thundering herd), too many retries (amplified load), or retrying non-transient errors (400s).
"How do you ensure exactly-once processing with retries?" — You cannot guarantee exactly-once delivery. Use at-least-once delivery with idempotent consumers. Idempotency keys for payment APIs.
"Design a reliable webhook delivery system" — Retry with exponential backoff, cap at 24 hours, store delivery status, allow manual retry. This is how Stripe, GitHub, and Twilio deliver webhooks.

Related Concepts

Bulkhead Pattern — limit concurrent retries to prevent resource exhaustion
Transactional Outbox Pattern — reliably produce events that may need retry
Pub-Sub Pattern — message queues provide built-in retry with backoff
Interview Questions: Distributed Systems
System Design Interview Guide
Algoroq Pricing — access all concept deep-dives

Retry with Exponential Backoff Explained: Handling Transient Failures Gracefully