Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does continuous batching exist?

Static batching works fine for image classifiers and breaks immediately for LLMs. The problem isn't the batch — it's that generation lengths vary, and the slowest sequence holds the GPU hostage.

AI & ML intermediate Apr 29, 2026

Why it exists

Here is the way batching is supposed to work, drilled into anyone who has ever served a deep-learning model. You collect N requests, stack them into a tensor, hand the tensor to the GPU, get N outputs back, return them. One weight load amortized across N inputs. The fixed cost of dragging the model through memory gets divided by N; throughput goes up almost linearly until something else fills up. For an image classifier that produces one logit vector per image, this is a great deal.

For an LLM, this is a disaster. And the reason is mundane enough to be embarrassing: different requests generate different numbers of tokens.

Picture eight users all hitting the same model. One asks for a one-word answer. One asks for a 2,000-token essay. The other six want something in between. If you batch them with static batching — pick the eight, run them together, return when they’re all done — you have just told the user who wanted “yes” that they get to wait for the essay-writer to finish. The GPU keeps decoding the batch, iteration after iteration, even after seven of the eight sequences have produced their EOS token and have nothing left to do. Their slots in the batch are still “computing” — multiplying weights against padding — because that’s how fixed-shape tensor execution works.

So you’ve taken the latency of every fast request and pinned it to the slowest one in the batch. New requests arriving in the meantime queue up behind the whole batch, even though half its slots are doing fake work.

Continuous batching exists because someone noticed that this isn’t a small loss. On real workloads with realistic length distributions, it’s one of the largest sources of wasted GPU in a generation pipeline.

Why it matters now

Every major LLM inference stack does some form of this. vLLM, TensorRT-LLM, TGI, SGLang, llama.cpp’s server mode — the schedulers differ in detail, but the core idea is the same: don’t wait for a batch to finish; reshuffle the batch every iteration.

It matters now because LLM serving economics live and die on aggregate throughput. Hosted providers price per million tokens. Their margin is the gap between what they bill you and what their GPUs cost per token produced. Continuous batching is one of the handful of optimizations that move that gap meaningfully. The Anyscale write-up reports up to 23× throughput with vLLM over naive/static batching while also reducing p50 latency; in the same comparison TGI’s continuous batching shows roughly 8× over naive. The original Orca paper reports up to 36.9× throughput at the same latency vs NVIDIA FasterTransformer on a GPT-3 175B workload. The multiplier you see in practice depends heavily on workload, baseline, model, and hardware — long, varied generation lengths give the biggest wins; short uniform ones give you almost nothing — so treat any single number as a data point, not a guarantee.

It also matters because it shapes the public APIs you use. Streaming responses, request-level cancellation, and mixing long and short prompts in the same deployment all benefit substantially from per-iteration scheduling. They don’t strictly require it, but without it the behavior of “how does this request feel?” depends on whoever else happened to be in your batch.

The short answer

continuous batching = iteration-level scheduling + per-token batch reshuffling

Instead of treating a batch as a fixed group of requests that runs to completion, treat each forward pass through the model as the unit of scheduling. After every iteration, look at what’s in the batch: any sequence that finished gets evicted, freeing its slot; any waiting request can be slotted in for the next iteration. The batch is a revolving door, not a fixed cohort. The GPU spends its cycles on sequences that still have tokens to produce, instead of grinding through padding for ones that already emitted their EOS.

How it works

Static batching is structured like this:

  1. Wait for a batch of N requests.
  2. Run the prompts through the model (prefill).
  3. Decode tokens autoregressively, one iteration per token, for the whole batch, until every sequence has hit EOS or the max-length limit.
  4. Return all N responses.

The pain is step 3. Suppose request A wants 50 tokens and request B wants 500. After 50 iterations, A is done — but the loop keeps running for 450 more iterations, and A’s slot has to keep producing something (usually just a no-op masked-out forward pass) because the tensor has fixed shape. A’s user is now waiting hundreds of milliseconds per token longer than they need to, and any new request that arrives during those 450 iterations has to wait until the whole batch finishes before it can even start.

Continuous batching restructures step 3:

  1. After each iteration of the decode loop, the scheduler looks at the running batch.
  2. Any sequence that just emitted EOS is removed; its response gets streamed back to its user immediately, and its slot becomes free.
  3. Any waiting request can be promoted into a free slot. (In some schedulers, a fresh request gets its prefill done first, possibly interleaved with decode steps from the existing batch — this is where the implementations diverge.)
  4. Run the next iteration on the new batch composition. Loop.

This is iteration-level scheduling. The original term comes from the 2022 Orca paper at OSDI (Yu, Jeong, Kim, Kim, Chun — Orca: A Distributed Serving System for Transformer-Based Generative Models), which introduced both this idea and a complementary one called selective batching — the observation that some operators in a transformer layer (like attention, where each request has its own KV cache) can’t be trivially batched across sequences with different lengths, so you batch the parts you can (the matrix multiplies for the linear projections) and run the parts you can’t (attention) per-sequence. Continuous batching is the catchier name that stuck.

Why this isn’t just “smaller batches”

The first time you hear the pitch — “evict finished sequences, slot in new ones” — you might think it’s just dynamic batch sizing. It is more than that. The non-obvious part is that prefill and decode have wildly different costs and arithmetic profiles, and continuous batching has to make a call every iteration about how to mix them.

A new request arriving has a long prompt to process: maybe hundreds or thousands of tokens, all in parallel. That’s a prefill step, and it’s heavy and compute-bound — the GPU finally gets to use its tensor cores near peak because each weight is reused across all the prompt tokens.

A request that’s been around for a while is in the decode phase: one new token at a time, against a growing KV cache. Decode is memory-bandwidth-bound; the compute units sit half-idle while weights stream from HBM.

If you naively let prefill and decode share an iteration, the prefill of a long prompt will dominate the wall-clock of that iteration, and every already-running sequence sees a latency spike on the token they were mid-decoding. This is the classic prefill stalls decode problem. Modern schedulers handle it in a few ways:

So “continuous batching” in 2026 isn’t a single algorithm; it’s a family of schedulers built around iteration-level granularity, all wrestling with the same prefill/decode tension. The Orca paper laid the groundwork; vLLM’s PagedAttention paper (Kwon et al., SOSP 2023) (ACM proceedings) pushed the practical ceiling much higher by attacking KV-cache fragmentation — once you start reshuffling sequences in and out of slots, fragmentation eats real memory, and the max batch size you can sustain shrinks with it.

What you actually get

The headline number people quote is “throughput at iso-latency” — how many tokens per second the system can deliver across all users while keeping each user’s per-token latency under some target. Continuous batching wins on this axis for two reasons:

The wins are biggest when generation lengths are variable and arrival rates are bursty — i.e. when static batching’s worst-case behavior gets triggered constantly. On a workload where every request happens to want exactly the same number of tokens and arrives in lockstep, continuous batching reduces to static batching, and the gap closes.

Where the seams show

A few honest caveats:

Going deeper