Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why circuit breakers exist

Backoff makes a single retry polite. But when a downstream is plainly down, every caller in your fleet generously retrying it is the actual problem. A circuit breaker is the small piece that says: stop calling for a while — the answer isn't going to change in the next 50 ms.

Systems intro Apr 29, 2026

Why it exists

You’ve already shipped retries with backoff and jitter. A downstream service gets sick. Each of your callers fails, waits, retries, fails again, waits longer, retries. Polite. Reasonable.

Now multiply by ten thousand processes. The “sick” service is being held underwater by a tide of retries from a fleet that was carefully told to be nice about it. Each individual caller is well-behaved; the aggregate is a sustained denial-of-service against a service that’s trying to recover. The slower the downstream gets, the more requests pile up in the caller’s threadpool or event loop, and the slower the caller gets — which often takes the caller down with it. This failure shape has a name: cascading failure.

Backoff is local. Each call decides independently when to try again. There is no caller in the system whose job it is to notice that the answer is clearly “down” and stop asking on behalf of the rest. That’s the gap a circuit breaker fills. It sits in front of the call site, watches the recent error rate, and when failure crosses a threshold, it stops making the call at all for a window. Requests in that window fail fast — locally, without touching the network — so the downstream gets quiet air and the caller doesn’t pile up work it can’t finish.

The metaphor is the household kind: an electrical circuit breaker doesn’t fix a short circuit. It refuses to keep delivering current into one, which is what stops the wiring from catching fire. Same instinct here: when something downstream is wrong, stop pushing into it.

Why it matters now

Anywhere a service has many callers and at least one downstream that can be slow, breakers show up:

The cost of getting it wrong is the cost of the cascading-failure postmortem: you discover that the primary service didn’t really go down — its dependency did, and your service held the door open for the fire to walk through.

The short answer

circuit breaker = error counter + state machine + cooldown window

A breaker watches the recent results of a call. When errors cross a threshold, it opens — meaning subsequent calls fail immediately without going to the network. After a cooldown, it goes half-open, letting a small probe through. If the probe succeeds, it closes and normal traffic resumes; if it fails, the cooldown starts again. Three states, one counter, one timer. The whole pattern.

How it works

The three states

CLOSED ── error rate exceeds threshold ──▶ OPEN
  ▲                                          │
  │                                          │  cooldown elapses
  │                                          ▼
  └──── probe succeeds ──── HALF_OPEN ◀──────┘

                              └── probe fails ──▶ OPEN

The half-open state is the part most homemade implementations get wrong. Without it, you either flap (open, cooldown, closed, instant flood, re-open) or stay open too long (open, cooldown, closed, but you have no evidence the downstream is actually back). Letting one request through and gating the verdict on it is the cheapest experiment that answers the question.

What “error” means

The breaker only works if it counts the right things. Two calibrations matter:

Where it sits

A breaker is a per-destination object, not a per-call one. You need one breaker per downstream you want to protect — typically per (service, endpoint) or per (service, region) tuple. Sharing one breaker across unrelated dependencies opens the door to either over-tripping (one bad backend opens the whole world) or under-tripping (a healthy backend’s traffic disguises a sick one’s failures).

In a service mesh, the breaker lives in the sidecar, keyed by upstream cluster. In application code, it’s typically a singleton per logical client (paymentsClient.breaker, inventoryClient.breaker).

Fallbacks: the part the diagram doesn’t show

A breaker that just throws a “circuit open” error is half a feature. The other half is what the caller does instead. Common shapes:

The fallback is application-specific, which is why most generic libraries make you provide it. The breaker handles the when; you handle the what.

Show the seams

Going deeper