Why does continuous batching exist?
Static batching works fine for image classifiers and breaks immediately for LLMs. The problem isn't the batch — it's that generation lengths vary, and the slowest sequence holds the GPU hostage.
Why it exists
Here is the way batching is supposed to work, drilled into anyone who has ever served a deep-learning model. You collect N requests, stack them into a tensor, hand the tensor to the GPU, get N outputs back, return them. One weight load amortized across N inputs. The fixed cost of dragging the model through memory gets divided by N; throughput goes up almost linearly until something else fills up. For an image classifier that produces one logit vector per image, this is a great deal.
For an LLM, this is a disaster. And the reason is mundane enough to be embarrassing: different requests generate different numbers of tokens.
Picture eight users all hitting the same model. One asks for a one-word answer. One asks for a 2,000-token essay. The other six want something in between. If you batch them with static batching — pick the eight, run them together, return when they’re all done — you have just told the user who wanted “yes” that they get to wait for the essay-writer to finish. The GPU keeps decoding the batch, iteration after iteration, even after seven of the eight sequences have produced their EOS token and have nothing left to do. Their slots in the batch are still “computing” — multiplying weights against padding — because that’s how fixed-shape tensor execution works.
So you’ve taken the latency of every fast request and pinned it to the slowest one in the batch. New requests arriving in the meantime queue up behind the whole batch, even though half its slots are doing fake work.
Continuous batching exists because someone noticed that this isn’t a small loss. On real workloads with realistic length distributions, it’s one of the largest sources of wasted GPU in a generation pipeline.
Why it matters now
Every major LLM inference stack does some form of this. vLLM, TensorRT-LLM, TGI, SGLang, llama.cpp’s server mode — the schedulers differ in detail, but the core idea is the same: don’t wait for a batch to finish; reshuffle the batch every iteration.
It matters now because LLM serving economics live and die on aggregate throughput. Hosted providers price per million tokens. Their margin is the gap between what they bill you and what their GPUs cost per token produced. Continuous batching is one of the handful of optimizations that move that gap meaningfully. The Anyscale write-up reports up to 23× throughput with vLLM over naive/static batching while also reducing p50 latency; in the same comparison TGI’s continuous batching shows roughly 8× over naive. The original Orca paper reports up to 36.9× throughput at the same latency vs NVIDIA FasterTransformer on a GPT-3 175B workload. The multiplier you see in practice depends heavily on workload, baseline, model, and hardware — long, varied generation lengths give the biggest wins; short uniform ones give you almost nothing — so treat any single number as a data point, not a guarantee.
It also matters because it shapes the public APIs you use. Streaming responses, request-level cancellation, and mixing long and short prompts in the same deployment all benefit substantially from per-iteration scheduling. They don’t strictly require it, but without it the behavior of “how does this request feel?” depends on whoever else happened to be in your batch.
The short answer
continuous batching = iteration-level scheduling + per-token batch reshuffling
Instead of treating a batch as a fixed group of requests that runs to completion, treat each forward pass through the model as the unit of scheduling. After every iteration, look at what’s in the batch: any sequence that finished gets evicted, freeing its slot; any waiting request can be slotted in for the next iteration. The batch is a revolving door, not a fixed cohort. The GPU spends its cycles on sequences that still have tokens to produce, instead of grinding through padding for ones that already emitted their EOS.
How it works
Static batching is structured like this:
- Wait for a batch of N requests.
- Run the prompts through the model (prefill).
- Decode tokens autoregressively, one iteration per token, for the whole batch, until every sequence has hit EOS or the max-length limit.
- Return all N responses.
The pain is step 3. Suppose request A wants 50 tokens and request B wants 500. After 50 iterations, A is done — but the loop keeps running for 450 more iterations, and A’s slot has to keep producing something (usually just a no-op masked-out forward pass) because the tensor has fixed shape. A’s user is now waiting hundreds of milliseconds per token longer than they need to, and any new request that arrives during those 450 iterations has to wait until the whole batch finishes before it can even start.
Continuous batching restructures step 3:
- After each iteration of the decode loop, the scheduler looks at the running batch.
- Any sequence that just emitted EOS is removed; its response gets streamed back to its user immediately, and its slot becomes free.
- Any waiting request can be promoted into a free slot. (In some schedulers, a fresh request gets its prefill done first, possibly interleaved with decode steps from the existing batch — this is where the implementations diverge.)
- Run the next iteration on the new batch composition. Loop.
This is iteration-level scheduling. The original term comes from the 2022 Orca paper at OSDI (Yu, Jeong, Kim, Kim, Chun — Orca: A Distributed Serving System for Transformer-Based Generative Models), which introduced both this idea and a complementary one called selective batching — the observation that some operators in a transformer layer (like attention, where each request has its own KV cache) can’t be trivially batched across sequences with different lengths, so you batch the parts you can (the matrix multiplies for the linear projections) and run the parts you can’t (attention) per-sequence. Continuous batching is the catchier name that stuck.
Why this isn’t just “smaller batches”
The first time you hear the pitch — “evict finished sequences, slot in new ones” — you might think it’s just dynamic batch sizing. It is more than that. The non-obvious part is that prefill and decode have wildly different costs and arithmetic profiles, and continuous batching has to make a call every iteration about how to mix them.
A new request arriving has a long prompt to process: maybe hundreds or thousands of tokens, all in parallel. That’s a prefill step, and it’s heavy and compute-bound — the GPU finally gets to use its tensor cores near peak because each weight is reused across all the prompt tokens.
A request that’s been around for a while is in the decode phase: one new token at a time, against a growing KV cache. Decode is memory-bandwidth-bound; the compute units sit half-idle while weights stream from HBM.
If you naively let prefill and decode share an iteration, the prefill of a long prompt will dominate the wall-clock of that iteration, and every already-running sequence sees a latency spike on the token they were mid-decoding. This is the classic prefill stalls decode problem. Modern schedulers handle it in a few ways:
- Chunked prefill — split a long prompt’s prefill into pieces small enough to fit alongside ongoing decodes without dominating the iteration. The technique was named and analyzed in the SARATHI paper (Agrawal et al., 2023); vLLM and others have adopted variants.
- Disaggregated prefill/decode — run the two phases on different GPU pools entirely, paying the cost of shipping the KV cache between them, in exchange for keeping the two phases from stepping on each other’s latency. DistServe (OSDI 2024) is the cleanest reference; it’s a serving pattern in its own right rather than a continuous-batching detail.
- Priority and admission control — sometimes you’d rather hold a new request for one extra iteration than spike the latency of 30 ongoing decodes.
So “continuous batching” in 2026 isn’t a single algorithm; it’s a family of schedulers built around iteration-level granularity, all wrestling with the same prefill/decode tension. The Orca paper laid the groundwork; vLLM’s PagedAttention paper (Kwon et al., SOSP 2023) (ACM proceedings) pushed the practical ceiling much higher by attacking KV-cache fragmentation — once you start reshuffling sequences in and out of slots, fragmentation eats real memory, and the max batch size you can sustain shrinks with it.
What you actually get
The headline number people quote is “throughput at iso-latency” — how many tokens per second the system can deliver across all users while keeping each user’s per-token latency under some target. Continuous batching wins on this axis for two reasons:
- Less padding waste. Slots that would have idled on already- finished sequences get reused for new work instead.
- No head-of-line blocking. A short request doesn’t have to wait for a long one to finish before it gets served.
The wins are biggest when generation lengths are variable and arrival rates are bursty — i.e. when static batching’s worst-case behavior gets triggered constantly. On a workload where every request happens to want exactly the same number of tokens and arrives in lockstep, continuous batching reduces to static batching, and the gap closes.
Where the seams show
A few honest caveats:
- Per-request latency variance gets weirder, not necessarily smaller. Your tokens-per-second now depends on the iteration-by-iteration composition of the batch, which depends on what every other user is doing. Tail latency improves on average; predictability for any one request can get harder to reason about.
- It interacts with everything. Speculative decoding, prompt caching, KV-cache eviction, MoE expert balancing — all of them now have to be implemented to play with a batch whose membership changes every iteration. A surprising amount of inference-engine engineering is “make feature X coexist with continuous batching.” The reason feature parity across engines feels patchy is largely this.
- The headline numbers are workload- and baseline-specific. Anyscale’s 23× was vLLM vs naive/static batching; Orca’s 36.9× was vs FasterTransformer on GPT-3 175B at iso-latency. The right way to read these: “the looser the comparison baseline, the bigger the multiplier reported.” Don’t quote the headline number without a caveat.
Famous related terms
- Iteration-level scheduling —
iteration-level scheduling = treat each forward pass as the scheduling unit + reshuffle the batch each step. The mechanism underneath continuous batching; the name from the Orca paper. - Selective batching —
selective batching = batch the layers you can + run per-sequence the layers you can't— the trick that makes iteration-level scheduling work despite per-sequence attention state. - Static batching —
static batching = pick N requests + run them together until they all finish— the obvious approach; the one continuous batching replaces. - PagedAttention —
PagedAttention = OS-style paging applied to the KV cache— reduces the memory fragmentation that would otherwise cap how much continuous batching can buy you at scale. See vLLM. - Chunked prefill —
chunked prefill = split long prompt prefills into pieces + interleave with ongoing decodes— keeps a single big new request from spiking everyone else’s per-token latency. - Disaggregated prefill/decode —
disaggregated serving = prefill GPUs + decode GPUs + shipped KV cache between them— physical separation of the two phases when even chunked prefill isn’t isolation enough. - KV cache —
KV cache = per-request stored attention K/V tensors + reused on every decode step— the per-request state that makes scheduling messy: every running sequence carries its own growing chunk of GPU memory, and when you evict a sequence you have to reclaim it.
Going deeper
- Yu, Jeong, Kim, Kim, Chun — Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). Paper. The original; introduces both iteration-level scheduling and selective batching, with the 36.9×-vs-FasterTransformer headline.
- Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica — Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). arXiv. The vLLM paper; the part that made continuous batching production-grade.
- Anyscale — How continuous batching enables 23x throughput in LLM inference while also reducing p50 latency. Blog post. The clearest practitioner explanation, with charts.
- NVIDIA — Mastering LLM Techniques: Inference Optimization. Blog post. Good context on prefill vs decode and how batching interacts with both.