Why grouped-query attention exists
Multi-head attention is a memory-bandwidth disaster at decode time. GQA keeps most of the quality and throws away most of the bandwidth bill.
Why it exists
If you stare at where time actually goes during LLM autoregressive decoding — generating tokens one at a time after the prompt is processed — the answer is unintuitive: the matrix multiplies aren’t the bottleneck. Reading the cached keys and values out of VRAM is. Every generated token has to re-read the entire KV cache from HBM for every layer, for every head.
Standard multi-head attention makes that bill enormous. With 64 heads, you store and re-load 64 separate K tensors and 64 separate V tensors per layer per token. That’s the memory-bandwidth wall the industry kept hitting once context windows grew.
Grouped-query attention is the flinch. It says: keep all 64 query heads — those are the cheap, expressive part — but share keys and values across groups of them. A small tweak to the attention block, the same overall transformer recipe, almost the same quality, a fraction of the KV traffic.
Why it matters now
Open the config of almost any modern open-weight model — Llama 3, Mistral, Gemma, Qwen — and you’ll see num_attention_heads and num_key_value_heads as separate fields, with the second one smaller. That’s GQA. It’s the default for serious inference-time models in 2026, and it’s the reason a 70B model can serve long contexts on hardware that wouldn’t have managed two years ago.
Without it, you’re choosing between three bad options: small context, expensive inference, or a quality cliff (full multi-query attention).
The short answer
GQA = multi-head queries + shared key/value heads in groups
You keep the full set of query heads, but you tell groups of them to share a single K head and a single V head. With 64 query heads and 8 KV heads, every group of 8 queries reads the same K and V. The KV cache shrinks by 8×, and the KV-cache portion of decode traffic shrinks by the same factor. (Total bytes per token also include the model weights, which don’t change — so the wall-clock speedup is smaller than 8×.)
How it works
The mechanism is almost embarrassingly simple — that’s part of why it caught on so fast.
In multi-head attention with H heads, each head h has its own Q_h, K_h, V_h projections. The KV cache stores H copies of K and V per token per layer. At decode time, generating one token means streaming all of that out of HBM.
In multi-query attention (Shazeer, 2019, Fast Transformer Decoding: One Write-Head is All You Need), there’s just one K and one V per layer, shared across all query heads. The KV cache is H times smaller. Shazeer’s own paper reports only minor quality loss; the GQA paper motivates its own existence partly by appealing to MQA’s quality-and-stability trade-offs. I’d treat “MQA always degrades quality noticeably” as folklore rather than an established result.
GQA (Ainslie et al., 2023, EMNLP) splits the difference. Pick G groups, where 1 ≤ G ≤ H. Each group has its own K and V; all the query heads in that group share them. G = H recovers multi-head attention; G = 1 recovers multi-query. Llama 2 70B uses H = 64, G = 8 (the paper says the 34B and 70B variants use GQA; the exact 64/8 split is from the released config). Llama 3 ships GQA across all sizes, but the head ratios vary by size — Llama 3 8B is 32/8, for instance.
The paper’s other contribution is the recipe for converting an existing multi-head checkpoint to GQA without retraining from scratch — they call it “uptraining.” Mean-pool the K heads inside each group, mean-pool the V heads inside each group, then continue pretraining for a small fraction (around 5%) of the original pretraining compute, on the same data recipe. The uptrained model lands close to the original multi-head quality with multi-query-class inference speed.
Why does sharing K and V work but sharing Q wouldn’t? Intuition, not proof: the queries are what each token uses to ask its own question of the past, so they need to stay diverse. Keys and values are the answers the past offers — and apparently those are redundant enough across heads that you can compress them hard. This is the standard hand-wave; I haven’t seen a clean theoretical account of why the asymmetry holds, only empirical evidence that it does.
The seam worth noticing: GQA helps decode throughput much more than it helps prefill (processing the prompt). Decode is usually memory-bandwidth-bound; prefill is usually compute-bound, with K and V computed in parallel across the prompt. So the win is concentrated on long-output, batched serving — exactly the regime production inference cares about.
Famous related terms
- Multi-query attention (MQA) —
MQA = multi-head queries + 1 shared K/V— the 2019 extreme that GQA generalizes; faster than GQA, with a quality trade-off whose size is debated. - Multi-head attention (MHA) —
MHA = H independent (Q, K, V) heads— the original “Attention Is All You Need” recipe; high quality, expensive at inference. - KV cache —
KV cache = past keys and values + reused per token— the data structure GQA is shrinking. - Memory bandwidth —
bandwidth = bytes/sec from HBM to compute— the actual quantity GQA is conserving. - Multi-head latent attention (MLA) —
MLA = MHA + low-rank latent K/V cache + decoupled RoPE channel. DeepSeek’s later variant that compresses K/V into a low-rank latent rather than sharing heads. Different idea, same goal: shrink the cache.
Going deeper
- Shazeer, Fast Transformer Decoding: One Write-Head is All You Need (2019) — arXiv:1911.02150. The MQA paper.
- Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebrón, Sanghai, GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (EMNLP 2023) — arXiv:2305.13245. The original GQA paper, including the uptraining recipe.
- The Llama 2 paper (arXiv:2307.09288) is one of the early high-profile open-weight models to ship with GQA in its larger variants.