Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does speculative decoding exist?

A small fast model guesses, a big slow model checks. Somehow you get the big model's exact output, faster. The trick isn't cleverness — it's that your GPU was already sitting idle.

AI & ML intermediate Apr 29, 2026

Why it exists

Here is the thing that should bother you about LLM inference once you’ve stared at it for a while.

Generating one token of output forces the GPU to read every weight of the model out of memory. A 70B-parameter model in 16-bit precision is ~140 GB of weights. To produce the next token, the GPU reads all 140 GB again. And again. Token by token. The arithmetic involved per token is small; the reading is what kills you. This is the memory-bandwidth wall, and it’s why your single-stream inference is slow even on a card that brags about petaflops.

Here’s the part that should really bother you: while all that reading is happening, the GPU’s compute units are mostly idle. You paid for petaflops. On dense, low-batch decode you’re using a small fraction of them. The hardware is begging you to give it more arithmetic to do per byte read.

Speculative decoding is the move that takes the hardware up on that offer.

Why it matters now

Most serious inference stacks support it: vLLM, TensorRT-LLM, llama.cpp, and several hosted providers. For agent workloads — long, decode-heavy chains of tool calls and reasoning — it’s one of the few ways to cut wall-clock latency without dropping to a smaller model and eating the quality loss.

It also matters because it composes (mostly — real stacks have feature-by-feature gaps) with the other tricks. MoE, KV-cache reuse, prompt caching — speculative decoding stacks on top of them.

The short answer

speculative decoding = small draft model + big model verifies in parallel

A cheap “draft” model guesses the next K tokens. The big “target” model then runs one forward pass that scores all K guesses simultaneously, accepts the longest prefix it agrees with, and produces one bonus token of its own. If the draft is right most of the time, you get several tokens per expensive forward pass instead of one — and the output distribution is provably the same as what the big model would have produced alone.

How it works

The mechanism rests on one asymmetry that most people miss the first time they hear about this technique: scoring K tokens in parallel costs the big model almost the same as scoring one.

That’s because, in the memory-bound regime, the cost of one decode step is dominated by streaming the weights through the memory hierarchy, not by the matmuls themselves. Once a weight tile is loaded into the GPU’s caches it can be reused against a length-1 input or a length-K input at roughly the same wall-clock cost, until K gets large enough that you finally become compute-bound. So the big model can verify “did I agree with all K of these guesses?” in basically one forward pass.

The loop:

  1. Draft. A small model (say a 1B parameter sibling of your 70B target) generates K candidate tokens autoregressively. This is cheap because the small model is small.
  2. Verify. The target model runs one forward pass on the prompt + the K drafted tokens. This gives you the target model’s probability distribution at each of those K positions, in parallel.
  3. Accept or reject. Walk the K positions left to right. At each one, compare what the target says to what the draft proposed, using a specific accept/reject rule (see below). Accept the longest prefix that passes the check.
  4. Bonus token. Wherever you stop, you get one extra token essentially for free. If all K were accepted, you sample from the target’s distribution at the K+1-th position (which step 2 already gave you). If you rejected at position j, you instead sample from the corrected distribution (p_target − p_draft)+ at position j — same forward pass, no extra big-model work. Then loop.

The accept/reject rule is what makes the output distribution the same as running the target model alone. For greedy decoding it’s easy: accept the draft token iff it’s also the target’s argmax. For sampling it’s a clever trick from the Leviathan et al. paper: accept with probability min(1, p_target(x) / p_draft(x)), and on rejection sample from (p_target − p_draft)+ normalized. The joint distribution over generated sequences is provably the target’s.

That last bit is the part that surprises engineers. You’re not approximating the big model. You’re not trading quality for speed. The output distribution is identical to what the target would have produced alone. (Implementations can still differ by tiny amounts due to floating-point numerics — the guarantee is at the math level, not at the bit level.) The draft model is just a guess source; the target retains full veto power.

Where the speedup comes from

Two ways to think about it, both useful:

The two failure modes are symmetric. If the draft is too bad, acceptance rates collapse, and you’re paying for the small model’s forward passes plus the big model’s, with little to show. If the draft is too good (e.g. a 7B drafting for a 70B), the small model itself is now slow, and the cost of running it eats into the savings. Picking a draft is a Goldilocks problem.

A real-world acceptance rate of 60–80% on natural prose is typical with a well-matched draft, giving end-to-end speedups of roughly 2–3× on memory-bound inference. Reported numbers vary widely by workload — code, structured output, and chat all behave differently — and I don’t have a single benchmark I’d cite as canonical, so treat “2–3×” as a rough order of magnitude rather than a guarantee.

Where it stops working

Speculative decoding helps memory-bound inference. If you’re already compute-bound — large batch sizes serving many users in parallel, or very small models where the weights aren’t the bottleneck — there are no idle flops to recover, and the math stops being free. This is why hosted providers sometimes turn it on for low-batch / low-traffic regimes and off when traffic is heavy. The exact crossover point depends on the model, the hardware, and the batch size; I don’t have a clean public number for where modern inference servers flip the switch.

Going deeper