Why does speculative decoding exist?
A small fast model guesses, a big slow model checks. Somehow you get the big model's exact output, faster. The trick isn't cleverness — it's that your GPU was already sitting idle.
Why it exists
Here is the thing that should bother you about LLM inference once you’ve stared at it for a while.
Generating one token of output forces the GPU to read every weight of the model out of memory. A 70B-parameter model in 16-bit precision is ~140 GB of weights. To produce the next token, the GPU reads all 140 GB again. And again. Token by token. The arithmetic involved per token is small; the reading is what kills you. This is the memory-bandwidth wall, and it’s why your single-stream inference is slow even on a card that brags about petaflops.
Here’s the part that should really bother you: while all that reading is happening, the GPU’s compute units are mostly idle. You paid for petaflops. On dense, low-batch decode you’re using a small fraction of them. The hardware is begging you to give it more arithmetic to do per byte read.
Speculative decoding is the move that takes the hardware up on that offer.
Why it matters now
Most serious inference stacks support it: vLLM, TensorRT-LLM, llama.cpp, and several hosted providers. For agent workloads — long, decode-heavy chains of tool calls and reasoning — it’s one of the few ways to cut wall-clock latency without dropping to a smaller model and eating the quality loss.
It also matters because it composes (mostly — real stacks have feature-by-feature gaps) with the other tricks. MoE, KV-cache reuse, prompt caching — speculative decoding stacks on top of them.
The short answer
speculative decoding = small draft model + big model verifies in parallel
A cheap “draft” model guesses the next K tokens. The big “target” model then runs one forward pass that scores all K guesses simultaneously, accepts the longest prefix it agrees with, and produces one bonus token of its own. If the draft is right most of the time, you get several tokens per expensive forward pass instead of one — and the output distribution is provably the same as what the big model would have produced alone.
How it works
The mechanism rests on one asymmetry that most people miss the first time they hear about this technique: scoring K tokens in parallel costs the big model almost the same as scoring one.
That’s because, in the memory-bound regime, the cost of one decode step is dominated by streaming the weights through the memory hierarchy, not by the matmuls themselves. Once a weight tile is loaded into the GPU’s caches it can be reused against a length-1 input or a length-K input at roughly the same wall-clock cost, until K gets large enough that you finally become compute-bound. So the big model can verify “did I agree with all K of these guesses?” in basically one forward pass.
The loop:
- Draft. A small model (say a 1B parameter sibling of your 70B target) generates K candidate tokens autoregressively. This is cheap because the small model is small.
- Verify. The target model runs one forward pass on the prompt + the K drafted tokens. This gives you the target model’s probability distribution at each of those K positions, in parallel.
- Accept or reject. Walk the K positions left to right. At each one, compare what the target says to what the draft proposed, using a specific accept/reject rule (see below). Accept the longest prefix that passes the check.
- Bonus token. Wherever you stop, you get one extra token essentially for free. If all K were accepted, you sample from the target’s distribution at the K+1-th position (which step 2 already gave you). If you rejected at position j, you instead sample from the corrected distribution
(p_target − p_draft)+at position j — same forward pass, no extra big-model work. Then loop.
The accept/reject rule is what makes the output distribution the same as running the target model alone. For greedy decoding it’s easy: accept the draft token iff it’s also the target’s argmax. For sampling it’s a clever trick from the Leviathan et al. paper: accept with probability min(1, p_target(x) / p_draft(x)), and on rejection sample from (p_target − p_draft)+ normalized. The joint distribution over generated sequences is provably the target’s.
That last bit is the part that surprises engineers. You’re not approximating the big model. You’re not trading quality for speed. The output distribution is identical to what the target would have produced alone. (Implementations can still differ by tiny amounts due to floating-point numerics — the guarantee is at the math level, not at the bit level.) The draft model is just a guess source; the target retains full veto power.
Where the speedup comes from
Two ways to think about it, both useful:
- Compute side: you’ve converted a memory-bound workload into a slightly more compute-bound one. Each expensive forward pass over the big model now produces somewhere between 1 and K+1 tokens of output instead of exactly 1. If the draft is accepted ~70% of the time and K=4, the expected output per big-model step is around 3 tokens. That’s an upper bound on per-step gain — the end-to-end speedup is smaller after you pay for running the draft.
- Hardware side: those previously idle compute units now have work to do — verifying the K drafted positions in parallel. You stopped wasting flops.
The two failure modes are symmetric. If the draft is too bad, acceptance rates collapse, and you’re paying for the small model’s forward passes plus the big model’s, with little to show. If the draft is too good (e.g. a 7B drafting for a 70B), the small model itself is now slow, and the cost of running it eats into the savings. Picking a draft is a Goldilocks problem.
A real-world acceptance rate of 60–80% on natural prose is typical with a well-matched draft, giving end-to-end speedups of roughly 2–3× on memory-bound inference. Reported numbers vary widely by workload — code, structured output, and chat all behave differently — and I don’t have a single benchmark I’d cite as canonical, so treat “2–3×” as a rough order of magnitude rather than a guarantee.
Where it stops working
Speculative decoding helps memory-bound inference. If you’re already compute-bound — large batch sizes serving many users in parallel, or very small models where the weights aren’t the bottleneck — there are no idle flops to recover, and the math stops being free. This is why hosted providers sometimes turn it on for low-batch / low-traffic regimes and off when traffic is heavy. The exact crossover point depends on the model, the hardware, and the batch size; I don’t have a clean public number for where modern inference servers flip the switch.
Famous related terms
- Draft model —
draft model = small LLM same tokenizer as target— a smaller sibling of the target whose job is just to propose. Often a distilled version of the target, or a much smaller model from the same family. - Medusa — adds extra “decoding heads” to the target model itself so it predicts several future tokens at once; no separate draft model.
- EAGLE — drafts using a small autoregressive head fed by the target’s own hidden states. Different mechanism from Medusa, similar goal.
- Self-speculative decoding — uses earlier layers of the target as the “draft,” with later layers as the verifier. One model, two roles.
- Lookahead decoding — generates draft tokens via Jacobi iteration over the target’s own forward passes, no draft model required. Different mechanism, similar goal.
- Prompt / KV caching —
prompt cache = KV-cache + reuse across requests— orthogonal trick, also goes after the cost of re-reading work the model already did. Stacks with speculative decoding. - Continuous batching —
continuous batching = dynamic batch + per-token scheduling— the other big inference-serving win. It improves utilization and pushes the system toward compute-bound at high concurrency. When it dominates, speculative decoding’s value shrinks.
Going deeper
- Leviathan, Kalman, Matias — Fast Inference from Transformers via Speculative Decoding (arXiv 2022, ICML 2023) — the original paper, with the unbiased sampling proof.
- Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper — Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind, 2023) — concurrent independent work; same idea, different presentation.
- Looking back at speculative decoding — Google Research blog — the authors’ own retrospective on what surprised them about adoption.
- NVIDIA — An Introduction to Speculative Decoding for Reducing Latency in AI Inference — practitioner walkthrough with diagrams.