Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why grouped-query attention exists

Multi-head attention is a memory-bandwidth disaster at decode time. GQA keeps most of the quality and throws away most of the bandwidth bill.

AI & ML intermediate Apr 30, 2026

Why it exists

If you stare at where time actually goes during LLM autoregressive decoding — generating tokens one at a time after the prompt is processed — the answer is unintuitive: the matrix multiplies aren’t the bottleneck. Reading the cached keys and values out of VRAM is. Every generated token has to re-read the entire KV cache from HBM for every layer, for every head.

Standard multi-head attention makes that bill enormous. With 64 heads, you store and re-load 64 separate K tensors and 64 separate V tensors per layer per token. That’s the memory-bandwidth wall the industry kept hitting once context windows grew.

Grouped-query attention is the flinch. It says: keep all 64 query heads — those are the cheap, expressive part — but share keys and values across groups of them. A small tweak to the attention block, the same overall transformer recipe, almost the same quality, a fraction of the KV traffic.

Why it matters now

Open the config of almost any modern open-weight model — Llama 3, Mistral, Gemma, Qwen — and you’ll see num_attention_heads and num_key_value_heads as separate fields, with the second one smaller. That’s GQA. It’s the default for serious inference-time models in 2026, and it’s the reason a 70B model can serve long contexts on hardware that wouldn’t have managed two years ago.

Without it, you’re choosing between three bad options: small context, expensive inference, or a quality cliff (full multi-query attention).

The short answer

GQA = multi-head queries + shared key/value heads in groups

You keep the full set of query heads, but you tell groups of them to share a single K head and a single V head. With 64 query heads and 8 KV heads, every group of 8 queries reads the same K and V. The KV cache shrinks by 8×, and the KV-cache portion of decode traffic shrinks by the same factor. (Total bytes per token also include the model weights, which don’t change — so the wall-clock speedup is smaller than 8×.)

How it works

The mechanism is almost embarrassingly simple — that’s part of why it caught on so fast.

In multi-head attention with H heads, each head h has its own Q_h, K_h, V_h projections. The KV cache stores H copies of K and V per token per layer. At decode time, generating one token means streaming all of that out of HBM.

In multi-query attention (Shazeer, 2019, Fast Transformer Decoding: One Write-Head is All You Need), there’s just one K and one V per layer, shared across all query heads. The KV cache is H times smaller. Shazeer’s own paper reports only minor quality loss; the GQA paper motivates its own existence partly by appealing to MQA’s quality-and-stability trade-offs. I’d treat “MQA always degrades quality noticeably” as folklore rather than an established result.

GQA (Ainslie et al., 2023, EMNLP) splits the difference. Pick G groups, where 1 ≤ G ≤ H. Each group has its own K and V; all the query heads in that group share them. G = H recovers multi-head attention; G = 1 recovers multi-query. Llama 2 70B uses H = 64, G = 8 (the paper says the 34B and 70B variants use GQA; the exact 64/8 split is from the released config). Llama 3 ships GQA across all sizes, but the head ratios vary by size — Llama 3 8B is 32/8, for instance.

The paper’s other contribution is the recipe for converting an existing multi-head checkpoint to GQA without retraining from scratch — they call it “uptraining.” Mean-pool the K heads inside each group, mean-pool the V heads inside each group, then continue pretraining for a small fraction (around 5%) of the original pretraining compute, on the same data recipe. The uptrained model lands close to the original multi-head quality with multi-query-class inference speed.

Why does sharing K and V work but sharing Q wouldn’t? Intuition, not proof: the queries are what each token uses to ask its own question of the past, so they need to stay diverse. Keys and values are the answers the past offers — and apparently those are redundant enough across heads that you can compress them hard. This is the standard hand-wave; I haven’t seen a clean theoretical account of why the asymmetry holds, only empirical evidence that it does.

The seam worth noticing: GQA helps decode throughput much more than it helps prefill (processing the prompt). Decode is usually memory-bandwidth-bound; prefill is usually compute-bound, with K and V computed in parallel across the prompt. So the win is concentrated on long-output, batched serving — exactly the regime production inference cares about.

Going deeper