Why does PagedAttention exist?
Naive KV-cache allocation reserves a contiguous slab for the worst-case sequence length, then watches 60–80% of it sit unused. PagedAttention asks: what if we treated GPU memory the way an operating system treats RAM?
Why it exists
Picture how an inference engine in 2022 served a request. A user shows up asking for, say, “up to 2,048 tokens” of generation. The engine has to decide how much GPU memory to set aside for that request’s KV cache — the chunk of state that grows by one entry per generated token and is read on every subsequent step. The engine doesn’t know in advance how many tokens will actually be produced; the user might stop at 12. So the safe move is to allocate the worst case: a contiguous slab big enough for 2,048 tokens, right now, before generation even starts.
Multiply that by every concurrent request and you get the situation the vLLM team measured: existing systems were wasting 60–80% of the HBM they had set aside for KV caches. Some of it was internal fragmentation — slots reserved for tokens the model never generated. Some was external — the engine knows there are 200 MB free somewhere across all the gaps between live caches, but no single contiguous 200 MB block, so a new request has to wait or be rejected. Both failure modes have the same cause: KV caches were allocated as one contiguous tensor per sequence, sized for the worst case, and contiguous worst-case allocation does not survive a workload of varied generation lengths.
The fix turns out to be a 50-year-old idea from operating systems.
Why it matters now
LLM serving is overwhelmingly KV-cache-bound when you push the batch size up. Continuous batching already wins back the GPU cycles wasted on padding; the next bottleneck is whether you can fit enough sequences in memory at once for that scheduler to have something useful to schedule. Reduce KV-cache waste from 60–80% to under 4% (the figure the vLLM team reports) and you fit materially more concurrent sequences on the same hardware — enough, in the paper’s measurements, to translate into 2–4× higher throughput at matched latency vs the contemporary baselines (FasterTransformer, Orca).
That is why many modern inference stacks have either adopted PagedAttention directly (vLLM, which originated it) or shipped a close variant of block-based KV-cache management (TensorRT-LLM, SGLang, TGI). It also unlocks a class of features that look unrelated until you see the mechanism: prefix caching across requests, parallel sampling without duplicating the prompt’s KV state, beam search that doesn’t quadratically blow up memory. All of those are paging tricks in disguise.
The short answer
PagedAttention = KV cache stored in fixed-size non-contiguous blocks + page table mapping logical positions to physical blocks
Instead of giving each sequence one contiguous slab of GPU memory for its
KV cache, chop the cache into small fixed-size blocks (16 tokens in the
original vLLM paper; current vLLM treats block_size as
platform/backend-dependent rather than a hard default) scattered anywhere
in HBM. Keep a per-sequence table
that maps “logical token positions 0–15” to “physical block #427”,
“positions 16–31” to “physical block #88”, and so on. The attention
kernel reads through this table the way a CPU reads through a
page table.
Allocation now happens one block at a time, on demand, instead of
worst-case up front. Fragmentation collapses, and a bunch of useful
sharing patterns become almost free.
How it works
The trick is borrowed almost verbatim from how operating systems manage RAM. A process doesn’t get a contiguous block of physical memory; it gets a virtual address space, and the OS maintains a page table that maps virtual pages to wherever they happen to live in physical RAM. Two processes can share a page (e.g. a shared library) by pointing their page tables at the same physical frame.
PagedAttention applies that scheme to the KV cache:
- The KV cache for a sequence is conceptually a flat array indexed by token position 0, 1, 2, ….
- Physically, that array is split into blocks of a fixed token count — 16 in the original vLLM paper (current vLLM picks the size per backend). Each block lives somewhere in HBM, not necessarily next to the previous one.
- Each running sequence has a small block table: a list of physical
block IDs in logical order. To attend to token t, the kernel looks
up
block_table[t // 16], then offsets byt % 16inside that block. - Allocation is one block at a time, when a sequence’s current last block fills up. There is no reservation for tokens not yet generated.
Internal fragmentation is now bounded by block_size − 1 tokens per sequence — at most 15 tokens of waste in the last partial block. External fragmentation effectively disappears, because every allocation is the same size, so the free list never gets stuck holding unusable scraps.
Why the OS analogy actually works here
You might reasonably ask: paging in operating systems exists because processes have unpredictable, sparse memory access patterns and need isolation. LLM attention reads a sequence’s KV cache fully and densely on every step. Why does the same mechanism help?
It helps because the useful part of OS paging here isn’t the “isolation” part or the “trap on miss” part. It’s the decoupling of logical address from physical layout. Once you have that, two properties fall out:
- Allocate-on-write. No worst-case reservation; cache grows one block at a time.
- Cheap sharing. If two sequences share a prefix — same prompt, same first k tokens — they can point their block tables at the same physical blocks for that prefix. No copy. When one of them later diverges and writes into a shared block, the system does copy-on-write at block granularity: allocate one fresh block, copy that single shared block’s contents in, redirect just that sequence’s block-table entry. Earlier shared blocks stay shared.
The second property is where PagedAttention pays for itself a second time. Parallel sampling of n completions from one prompt used to mean n full copies of the prompt’s KV cache. With PagedAttention, all n candidates share the prompt blocks by reference; only the divergent suffixes consume new memory. Beam search, which the vLLM paper benchmarks specifically, gets an even bigger lift because beams share long prefixes among themselves, not just with the prompt — the paper reports vLLM’s improvement over Orca going from 1.3× in basic sampling to 2.3× in beam search at width 6 on OPT-13B with the Alpaca dataset.
What it costs
Paging is not free. Every attention computation now has an extra
indirection: instead of keys[t], it’s
physical_blocks[block_table[t // 16]][t % 16]. The vLLM team wrote
custom CUDA
kernels (and rewrote them several times since the 2023 paper) so that
this indirection happens at the warp level without serializing memory
loads. On the hardware they tested, the per-token cost was small enough
to be dominated by the throughput gains from fitting more sequences in
the batch — but it is a real cost, and a from-scratch implementation
that doesn’t fuse the lookup into the attention kernel will give some of
it back.
There’s also a tuning knob: block size. Larger blocks mean less table-walking overhead but more internal fragmentation per sequence and worse prefix-cache hit rates (because two requests have to share whole blocks, not partial ones, to benefit from caching). Smaller blocks invert that trade. vLLM’s choice of 16 is a working default, not a universal optimum.
Where the seams show
A few honest caveats:
- The “60–80% wasted” figure is from the original vLLM paper’s measurements on the systems it compared against in 2023 (notably earlier FasterTransformer/Orca configurations on specific workloads). The exact percentage is workload- and baseline-dependent. The direction — naive contiguous allocation is bad, especially with varied generation lengths — is robust.
- PagedAttention isn’t the only way. You can also attack KV-cache waste with eviction policies, KV compression/quantization, or sliding windows. Paging targets the allocation problem specifically; it stacks with those rather than replacing them.
- The kernel engineering matters more than the idea. A page table is a few-line concept; the reason vLLM specifically dominated for a while was the fused attention kernel that read through that table efficiently. A clone that gets the algorithm right but the kernel wrong will lose to a well-tuned non-paged baseline.
- It interacts with everything. Quantization, speculative decoding, multi-LoRA serving, MoE routing, prefix caching across requests — each one has to be made paging-aware. A non-trivial fraction of inference-engine engineering since 2023 has been “make feature X coexist with paging.”
Famous related terms
- Virtual memory / paging (the OS feature) —
paging = virtual addresses + page table + physical frames— the original; PagedAttention is paging applied to the KV cache instead of process memory. - KV cache —
KV cache = stored attention keys and values + reused on every decode step— the thing PagedAttention is managing. - Continuous batching —
continuous batching = iteration-level scheduling + per-token batch reshuffling— the partner technique; without paging, fragmentation caps how much continuous batching can buy you. - Copy-on-write —
CoW = share until someone writes + copy the page on write— the OS trick that lets parallel sampling and beam search share prompt blocks for free. - Prefix caching —
prefix caching = remember KV blocks for shared prompt prefixes + reuse them across requests— falls out almost for free once the cache is paged. See why prompt caching exists. - Block size — the tuning knob; small means better sharing and less internal fragmentation, large means less indirection overhead. vLLM defaults to 16 tokens.
Going deeper
- Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica — Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). arXiv · ACM. The original paper; the 60–80% waste figure and the OS-paging analogy come from here.
- vLLM team — vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. Blog post. The plain-English version, with the throughput charts.
- vLLM docs — Paged Attention. Page. The historical design doc for the original paper-era kernel — the page itself notes it no longer reflects the code shipping in vLLM today, but it’s still the cleanest walk-through of the block-table mechanism.
- vLLM docs — Automatic Prefix Caching. Page. The “free sharing” property, made into a user-facing feature.