Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why retry with exponential backoff — and why jitter?

Retrying on failure sounds simple until you ship it at scale. Hammer the server and you make outages worse; back off but synchronize, and you accidentally rebuild the herd. Backoff is the timing rule; jitter is the part that keeps it from biting itself.

Networking intro Apr 29, 2026

Why it exists

You call an API. It returns 503, or times out, or your TCP connection resets. The naive thing is to retry immediately. Do that in a loop and you’ve built a denial-of-service attack against a service that was already struggling. You make the outage worse. You arrive at the bad moment with friends.

So engineers learned a softer rule: wait, then retry, and wait longer each time you fail. This is exponential backoff — the wait doubles (or grows by some constant factor) on each failure. The intuition is that you don’t know how long the bad condition will last, and exponential growth covers a huge range of timescales — milliseconds to minutes — without you having to guess in advance.

That’s half the answer. The other half is the part that surprises people the first time they meet it: if every client retries on the same exponential schedule, you’ve replaced a stampede with a metronome. All the clients that failed at second 0 retry at second 1. The ones that fail there retry at second 3. The ones that fail there retry at second 7. The server sees neat periodic spikes that, if anything, are easier to overload than random arrivals would be — because every spike is a near-simultaneous burst.

The fix is jitter — randomization on the wait. Each client picks its retry time from a window, not a point. The thundering herd flattens into something the server can actually serve.

Why it matters now

If you ship anything that talks to an external service in 2026, you ship backoff. The list of places it shows up has only grown:

The cost of getting it wrong is no longer “my script is annoying.” It’s “my agent fleet just synchronized into a wave that took down the upstream service for everyone.”

The short answer

retry policy = exponential backoff + jitter + a stopping rule

Three pieces. Exponential backoff spaces the retries out so a slow or flapping service has time to recover. Jitter breaks the synchronization between independent clients so their retries don’t pile up at the same instant. A stopping rule — max retries, max total wait, or both — keeps the loop from becoming an infinite background apology. Drop any of the three and you’ve built a footgun.

How it works

1. The exponential part

Pick a base delay (say 100 ms) and a factor (usually 2). The wait before attempt n is base × factor^(n-1), capped at some ceiling so it doesn’t drift into “retry next Tuesday” territory:

attempt 1: fail, wait 100 ms
attempt 2: fail, wait 200 ms
attempt 3: fail, wait 400 ms
attempt 4: fail, wait 800 ms
...
attempt k: fail, wait min(base × 2^(k-1), cap)

Why exponential and not linear? Linear backoff tends to spend most of the retry budget on small early waits and never reaches an interval long enough to ride out a real multi-second outage. Exponential covers many orders of magnitude with few retries: six attempts at base 100 ms reaches several seconds; ten reaches a minute or two. That’s the right shape for “I don’t know if this is a 50 ms blip or a 30 second deploy.”

2. The jitter part — and why “full jitter” is the surprising winner

Plain backoff schedules every client onto the same ladder. Add jitter and each client picks a random delay from a window. The interesting question is which window.

The variants Marc Brooker used in the canonical AWS Architecture Blog post on this (March 4, 2015, “Exponential Backoff And Jitter”):

Brooker’s simulations on a synthetic workload found Full Jitter reduced server load substantially compared to no jitter, and that Decorrelated Jitter finished the work in slightly fewer total calls than Full Jitter on his benchmark. The headline result — “spread the retries out randomly across the exponential window” — is the part that’s stuck. The post nudged the broader ecosystem toward jittered retries; whether any specific AWS SDK uses Full vs Decorrelated today varies by SDK and version, and I won’t claim more than that without checking the source.

The shape worth memorizing: exponential window, uniform random within it. Pick a variant, document it, and move on.

3. The stopping rule

Backoff without a stopping rule is a quiet disaster. Two common shapes:

Both have a subtler companion: the per-attempt timeout has to be shorter than the deadline. A retry loop where each attempt blocks for 30 seconds with a 30-second total budget gives you exactly one try. The classic “we have retries, why didn’t they help?” postmortem ends here more often than it should.

4. The seams nobody puts in the diagram

Going deeper