Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why is the KV cache a thing?

The model has to read your whole prompt every time it picks a token. Why doesn't it choke? Because of a quiet trick almost nobody mentions in the docs.

AI & ML intermediate Apr 29, 2026

Why it exists

Here’s the thing that should bother you the first time you really look at how an LLM generates text.

A model takes your prompt and produces one next token. To produce the second next token, it has to look at the prompt plus that first generated token. To produce the third, it looks at the prompt plus the first two generated tokens. And so on. Every step extends the input by one token and runs the whole forward pass again.

If you take that literally, generating a 500-token answer to a 2,000-token prompt means doing 500 forward passes over sequences of length 2,001, 2,002, 2,003, … 2,500. Each of those forward passes is, naively, quadratic in the sequence length because of attention. That should be unusably slow. Open a chat interface, watch tokens stream in at 50–100 a second on a sequence that’s already thousands of tokens long, and you should be asking: how?

The answer is the KV cache. It is the single most important inference-time optimization in modern LLM serving, and somehow it almost never makes it into the explainer posts. It’s why your tokens stream instead of stalling. It’s also why “context window” has a memory cost, why long prompts get expensive in a non-obvious way, and why your GPU runs out of VRAM before it runs out of compute.

Why it matters now

KV cache is load-bearing in a way most people don’t notice until something breaks:

If you’re building anything that calls an LLM at scale, the cost model in your head should have “KV cache” in it.

The short answer

KV cache = a per-layer store of the key and value vectors for every token already in the sequence, reused so attention only has to compute K and V for the new token

A transformer’s attention layer, for each token, computes three vectors: Q (query), K (key), V (value). Q is what this token is asking about; K and V are what every token offers up to be attended to. The crucial asymmetry: when you generate token N+1, the K and V for tokens 1…N don’t change. They were already computed last step. The KV cache just keeps them around so you don’t redo the work.

How it works

Picture transformer inference as a loop:

prompt -> forward pass -> next token
prompt + token1 -> forward pass -> token2
prompt + token1 + token2 -> forward pass -> token3
...

Inside each forward pass, every attention layer does, for each input token, something like:

Q_i = x_i · W_Q
K_i = x_i · W_K
V_i = x_i · W_V
attention_i = softmax(Q_i · K_all / sqrt(d)) · V_all

Two things to notice:

  1. K_i and V_i only depend on the token at position i and the model’s weights. They don’t depend on what comes after. So once computed for a given token, they’re frozen for the rest of generation.
  2. To compute attention for the new token, you need the full K_all and V_all — the keys and values for every position so far. But you already computed almost all of them on previous steps.

The KV cache is the obvious move: keep K and V for every token, every layer, in GPU memory. On each new generation step:

That changes the per-step cost from “redo everything” to “do one token’s worth of compute, plus a single attention read against the cache.” The quadratic blow-up is gone for the generation phase.

There are two distinct phases people sometimes conflate:

That split is why “time to first token” and “tokens per second after the first” are reported separately. Prefill cost lives in the first; cache amortization lives in the second.

What it costs

The cache size, per request, is roughly:

2 (K and V) × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element

For a model in the 70B-parameter range with a long context, this lands in the multiple-gigabytes-per-request territory. Multiply by however many requests you’re serving concurrently. This is why batching, paging (PagedAttention), quantization of the cache, and architectures like grouped-query attention exist. Half the work in modern inference engines is squeezing this thing.

Where it gets subtle

The thing to take away: the KV cache is what turns “decoding is quadratic” into “decoding is linear, with a memory bill.” That memory bill is now the defining constraint on how long, how concurrent, and how cheap LLM serving can be.

Going deeper