Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does PagedAttention exist?

Naive KV-cache allocation reserves a contiguous slab for the worst-case sequence length, then watches 60–80% of it sit unused. PagedAttention asks: what if we treated GPU memory the way an operating system treats RAM?

AI & ML intermediate Apr 30, 2026

Why it exists

Picture how an inference engine in 2022 served a request. A user shows up asking for, say, “up to 2,048 tokens” of generation. The engine has to decide how much GPU memory to set aside for that request’s KV cache — the chunk of state that grows by one entry per generated token and is read on every subsequent step. The engine doesn’t know in advance how many tokens will actually be produced; the user might stop at 12. So the safe move is to allocate the worst case: a contiguous slab big enough for 2,048 tokens, right now, before generation even starts.

Multiply that by every concurrent request and you get the situation the vLLM team measured: existing systems were wasting 60–80% of the HBM they had set aside for KV caches. Some of it was internal fragmentation — slots reserved for tokens the model never generated. Some was external — the engine knows there are 200 MB free somewhere across all the gaps between live caches, but no single contiguous 200 MB block, so a new request has to wait or be rejected. Both failure modes have the same cause: KV caches were allocated as one contiguous tensor per sequence, sized for the worst case, and contiguous worst-case allocation does not survive a workload of varied generation lengths.

The fix turns out to be a 50-year-old idea from operating systems.

Why it matters now

LLM serving is overwhelmingly KV-cache-bound when you push the batch size up. Continuous batching already wins back the GPU cycles wasted on padding; the next bottleneck is whether you can fit enough sequences in memory at once for that scheduler to have something useful to schedule. Reduce KV-cache waste from 60–80% to under 4% (the figure the vLLM team reports) and you fit materially more concurrent sequences on the same hardware — enough, in the paper’s measurements, to translate into 2–4× higher throughput at matched latency vs the contemporary baselines (FasterTransformer, Orca).

That is why many modern inference stacks have either adopted PagedAttention directly (vLLM, which originated it) or shipped a close variant of block-based KV-cache management (TensorRT-LLM, SGLang, TGI). It also unlocks a class of features that look unrelated until you see the mechanism: prefix caching across requests, parallel sampling without duplicating the prompt’s KV state, beam search that doesn’t quadratically blow up memory. All of those are paging tricks in disguise.

The short answer

PagedAttention = KV cache stored in fixed-size non-contiguous blocks + page table mapping logical positions to physical blocks

Instead of giving each sequence one contiguous slab of GPU memory for its KV cache, chop the cache into small fixed-size blocks (16 tokens in the original vLLM paper; current vLLM treats block_size as platform/backend-dependent rather than a hard default) scattered anywhere in HBM. Keep a per-sequence table that maps “logical token positions 0–15” to “physical block #427”, “positions 16–31” to “physical block #88”, and so on. The attention kernel reads through this table the way a CPU reads through a page table. Allocation now happens one block at a time, on demand, instead of worst-case up front. Fragmentation collapses, and a bunch of useful sharing patterns become almost free.

How it works

The trick is borrowed almost verbatim from how operating systems manage RAM. A process doesn’t get a contiguous block of physical memory; it gets a virtual address space, and the OS maintains a page table that maps virtual pages to wherever they happen to live in physical RAM. Two processes can share a page (e.g. a shared library) by pointing their page tables at the same physical frame.

PagedAttention applies that scheme to the KV cache:

Internal fragmentation is now bounded by block_size − 1 tokens per sequence — at most 15 tokens of waste in the last partial block. External fragmentation effectively disappears, because every allocation is the same size, so the free list never gets stuck holding unusable scraps.

Why the OS analogy actually works here

You might reasonably ask: paging in operating systems exists because processes have unpredictable, sparse memory access patterns and need isolation. LLM attention reads a sequence’s KV cache fully and densely on every step. Why does the same mechanism help?

It helps because the useful part of OS paging here isn’t the “isolation” part or the “trap on miss” part. It’s the decoupling of logical address from physical layout. Once you have that, two properties fall out:

  1. Allocate-on-write. No worst-case reservation; cache grows one block at a time.
  2. Cheap sharing. If two sequences share a prefix — same prompt, same first k tokens — they can point their block tables at the same physical blocks for that prefix. No copy. When one of them later diverges and writes into a shared block, the system does copy-on-write at block granularity: allocate one fresh block, copy that single shared block’s contents in, redirect just that sequence’s block-table entry. Earlier shared blocks stay shared.

The second property is where PagedAttention pays for itself a second time. Parallel sampling of n completions from one prompt used to mean n full copies of the prompt’s KV cache. With PagedAttention, all n candidates share the prompt blocks by reference; only the divergent suffixes consume new memory. Beam search, which the vLLM paper benchmarks specifically, gets an even bigger lift because beams share long prefixes among themselves, not just with the prompt — the paper reports vLLM’s improvement over Orca going from 1.3× in basic sampling to 2.3× in beam search at width 6 on OPT-13B with the Alpaca dataset.

What it costs

Paging is not free. Every attention computation now has an extra indirection: instead of keys[t], it’s physical_blocks[block_table[t // 16]][t % 16]. The vLLM team wrote custom CUDA kernels (and rewrote them several times since the 2023 paper) so that this indirection happens at the warp level without serializing memory loads. On the hardware they tested, the per-token cost was small enough to be dominated by the throughput gains from fitting more sequences in the batch — but it is a real cost, and a from-scratch implementation that doesn’t fuse the lookup into the attention kernel will give some of it back.

There’s also a tuning knob: block size. Larger blocks mean less table-walking overhead but more internal fragmentation per sequence and worse prefix-cache hit rates (because two requests have to share whole blocks, not partial ones, to benefit from caching). Smaller blocks invert that trade. vLLM’s choice of 16 is a working default, not a universal optimum.

Where the seams show

A few honest caveats:

Going deeper