Why is the KV cache a thing?
The model has to read your whole prompt every time it picks a token. Why doesn't it choke? Because of a quiet trick almost nobody mentions in the docs.
Why it exists
Here’s the thing that should bother you the first time you really look at how an LLM generates text.
A model takes your prompt and produces one next token. To produce the second next token, it has to look at the prompt plus that first generated token. To produce the third, it looks at the prompt plus the first two generated tokens. And so on. Every step extends the input by one token and runs the whole forward pass again.
If you take that literally, generating a 500-token answer to a 2,000-token prompt means doing 500 forward passes over sequences of length 2,001, 2,002, 2,003, … 2,500. Each of those forward passes is, naively, quadratic in the sequence length because of attention. That should be unusably slow. Open a chat interface, watch tokens stream in at 50–100 a second on a sequence that’s already thousands of tokens long, and you should be asking: how?
The answer is the KV cache. It is the single most important inference-time optimization in modern LLM serving, and somehow it almost never makes it into the explainer posts. It’s why your tokens stream instead of stalling. It’s also why “context window” has a memory cost, why long prompts get expensive in a non-obvious way, and why your GPU runs out of VRAM before it runs out of compute.
Why it matters now
KV cache is load-bearing in a way most people don’t notice until something breaks:
- Throughput and latency in production. Without it, generating long outputs from long prompts is roughly cubic in length. With it, each new token is roughly linear. Every serving system you’ve used — vLLM, TensorRT-LLM, llama.cpp, hosted Claude, hosted GPT — relies on it.
- VRAM as the real bottleneck. People assume “context window” is a software limit. It’s mostly a memory limit. The KV cache is the thing taking up that memory. Doubling your context roughly doubles cache size per request, and a server holds one cache per concurrent request.
- Why prompt caching APIs exist. When OpenAI, Anthropic, and others ship “prompt caching” features that drop the cost of repeated prefixes, what they’re actually doing is keeping the KV cache for a prefix on the GPU between requests instead of recomputing it. The user-visible price drop is the cache miss you didn’t have to pay.
- Why speculative decoding works. It exploits the fact that verifying a batch of guesses uses the KV cache the same way generating one token does — so checking ten guesses is barely more expensive than generating one.
If you’re building anything that calls an LLM at scale, the cost model in your head should have “KV cache” in it.
The short answer
KV cache = a per-layer store of the key and value vectors for every token already in the sequence, reused so attention only has to compute K and V for the new token
A transformer’s attention layer, for each token, computes three vectors: Q (query), K (key), V (value). Q is what this token is asking about; K and V are what every token offers up to be attended to. The crucial asymmetry: when you generate token N+1, the K and V for tokens 1…N don’t change. They were already computed last step. The KV cache just keeps them around so you don’t redo the work.
How it works
Picture transformer inference as a loop:
prompt -> forward pass -> next token
prompt + token1 -> forward pass -> token2
prompt + token1 + token2 -> forward pass -> token3
...
Inside each forward pass, every attention layer does, for each input token, something like:
Q_i = x_i · W_Q
K_i = x_i · W_K
V_i = x_i · W_V
attention_i = softmax(Q_i · K_all / sqrt(d)) · V_all
Two things to notice:
K_iandV_ionly depend on the token at positioniand the model’s weights. They don’t depend on what comes after. So once computed for a given token, they’re frozen for the rest of generation.- To compute attention for the new token, you need the full
K_allandV_all— the keys and values for every position so far. But you already computed almost all of them on previous steps.
The KV cache is the obvious move: keep K and V for every token, every
layer, in GPU memory. On each new generation step:
- Run the forward pass on only the one new token.
- Compute its
Q,K,V. - Append the new
KandVto the cache. - Compute attention using the new
Qagainst the full cachedKandV.
That changes the per-step cost from “redo everything” to “do one token’s worth of compute, plus a single attention read against the cache.” The quadratic blow-up is gone for the generation phase.
There are two distinct phases people sometimes conflate:
- Prefill — the first forward pass over the whole prompt. This is expensive (roughly quadratic in prompt length) because nothing is cached yet. But it happens once.
- Decode — every subsequent step, generating one token at a time. This is the cheap, cache-using phase.
That split is why “time to first token” and “tokens per second after the first” are reported separately. Prefill cost lives in the first; cache amortization lives in the second.
What it costs
The cache size, per request, is roughly:
2 (K and V) × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element
For a model in the 70B-parameter range with a long context, this lands in the multiple-gigabytes-per-request territory. Multiply by however many requests you’re serving concurrently. This is why batching, paging (PagedAttention), quantization of the cache, and architectures like grouped-query attention exist. Half the work in modern inference engines is squeezing this thing.
Where it gets subtle
- It only works because attention is causal. Each token only attends to earlier tokens, so adding token N+1 doesn’t change anything about how tokens 1…N attended to each other. Bidirectional models (like classic BERT) can’t cache this way during inference because every token’s representation depends on the whole sequence in both directions.
- Cache correctness is fragile. Anything that changes the past — editing a previous token, inserting a system message after the fact, swapping model weights — invalidates the cache. This is one reason “edit your message” in a chat UI is implemented as starting a new generation, not patching mid-stream.
- Prompt caching across requests is the same trick at a different scope. If two requests share a long prefix (a system prompt, a long document), the KV cache for that prefix can be reused across them. The provider charges less because they did less work. The mechanism is the same; only the lifetime of the cache changed.
- I’m soft on exact numbers. Specific KV cache sizes per model and the precise per-request VRAM footprint depend on architecture details (number of KV heads after GQA, head dimension, dtype, whether the cache is quantized) that aren’t always public for hosted models. Treat the size formula above as the shape of the cost, not a promise about any particular deployment.
The thing to take away: the KV cache is what turns “decoding is quadratic” into “decoding is linear, with a memory bill.” That memory bill is now the defining constraint on how long, how concurrent, and how cheap LLM serving can be.
Famous related terms
- Prefill vs. decode —
prefill = one big pass over the prompt; decode = one-token passes using the cache. The two phases have different bottlenecks (compute vs. memory bandwidth), which is why some inference engines schedule them separately. - PagedAttention —
PagedAttention ≈ virtual memory for the KV cache— chops the cache into fixed-size blocks so requests can grow without contiguous allocations. The core trick behind vLLM. - Grouped-query attention (GQA) —
GQA = multiple query heads sharing one KV head— directly shrinks KV cache size. Why most newer open models use it. - Multi-query attention (MQA) —
MQA = all query heads share a single KV head. The aggressive end of the same idea; trades quality for cache size. - Speculative decoding —
speculative decoding = small model proposes, big model verifies in parallel. Hits a sweet spot only because verification reuses the KV cache. - Prompt caching —
prompt caching = persist a prefix's KV cache between requests. The user-facing version of the same optimization. - LLM — the thing whose attention layers this is all happening inside.
- Tokenization — defines the unit the cache is indexed by; longer tokens mean fewer cache entries for the same text.
Going deeper
- Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper. The KV cache isn’t named there, but the causal-mask property that makes it possible is.
- Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) — the vLLM paper. The clearest practical treatment of why the KV cache is the bottleneck and what to do about it.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023) — why modern models cut KV heads.
- Any modern inference-engine codebase (vLLM, llama.cpp, TensorRT-LLM): grep
for
kv_cacheand follow the data structures. Five minutes of reading real code teaches more than any blog post, this one included.