What is attention (in transformers)?
Every token in a sequence gets to peek at every other token and decide which ones matter. That trick is the engine inside every modern LLM.
Why it exists
Before attention, the dominant way to model a sequence was a recurrent neural network: read one token, update a hidden state, read the next, update again. By the end you had a single fixed-size vector supposed to summarize everything that had happened.
That worked for short sequences and melted on long ones. If the answer to “what does it refer to?” lives forty tokens back, the model has to carry that fact forward through forty hidden-state updates without losing it. Information decays. Long-range dependencies get lossy in a way you can’t fix by making the network bigger.
Bahdanau, Cho, and Bengio’s 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate introduced the fix: instead of squeezing the source sentence into one vector, let the decoder look back at every encoder state and decide which ones matter for the word it’s about to produce. The mechanism was called attention. Three years later, Vaswani et al.’s Attention Is All You Need (2017) threw out the recurrence entirely — if attention is doing the work, why have the RNN at all? Stack attention layers with feed-forward networks between them and you get the transformer.
Why it matters now
Attention is the load-bearing operation inside almost every modern LLM, and increasingly inside vision and multimodal models too. When people say “the transformer revolution,” what they really mean is “everyone figured out attention scales.”
It’s also why GPU memory is the bottleneck for inference. Every token looking at every other token costs roughly N² compute at sequence length N. Most of modern serving complexity exists to manage that one fact — see why attention is quadratic.
The short answer
attention = soft, content-addressed lookup over a sequence of tokens
Each token produces a query asking what it’s looking for. Every token also offers a key describing what it has and a value it will contribute if matched. The token’s new representation is a weighted sum of all the values, where the weights come from how well its query matches each key. Everything — what to ask, what to advertise, what to return — is learned.
How it works
Every token plays three roles at once. Given a token embedding, the model multiplies it by three learned weight matrices to produce:
- Query (Q) — “what am I looking for?”
- Key (K) — “what do I have to offer?”
- Value (V) — “what will I contribute if you pick me?”
To compute the new representation for token i, take its query and compare it against every other token’s key using a dot product. That gives a raw score per token. Divide by √d (the scaled dot-product trick keeps numbers from exploding), pass the scores through a softmax so they sum to one, and use those weights to take a weighted average of the values:
output_i = Σ_j softmax(Q_i · K_j / √d) · V_j
A toy example. In “the cat sat on the mat because it was warm,” — does it refer to the cat or the mat? When the model computes attention for it, the query ends up scoring high against the key for mat (warmth fits a mat better than a cat), and the weighted sum pulls mat’s value into it’s new representation. Nobody told the model to do this. The pattern emerged because, during training, getting that kind of resolution right made next-token prediction better.
A few wrinkles worth knowing:
- A head is one Q/K/V projection plus its softmax-weighted sum. Multi-head attention runs several in parallel with different projections and concatenates the outputs — different heads can pick up different relations.
- Causal (masked) attention is what decoder-only LLMs use: when predicting token i, mask out positions > i so the model can’t peek at the future. Otherwise training is trivial — copy the next token.
- Self vs. cross. Q, K, V from the same sequence is self-attention; queries from one stream and K/V from another is cross-attention.
The “attention” name is evocative but a bit misleading. It’s not a model of human attention — humans don’t softmax over their visual field. It’s a soft lookup table whose weights concentrate on a few entries when the model is confident and spread out when it isn’t. The metaphor stuck because the visualized weights look like focus.
Famous related terms
- Self-attention —
self-attention = attention where Q, K, V all come from the same sequence. The default inside an LLM block. - Cross-attention —
cross-attention = Q from one stream + K, V from another. How a decoder reads an encoder, or a text decoder reads image features. - Multi-head attention —
MHA = N parallel attention heads, concatenated. See why MLA replaced MHA for what came next. - KV cache —
KV cache = stored K and V tensors for past tokens, reused on each decode step. Turns per-step decode from O(N²) to O(N). - Quadratic cost —
attention cost ∝ N² · d. Every pair of tokens gets a score; the matrix is intrinsic to the operation. - Flash attention —
FlashAttention = exact softmax attention + tiling so the N×N matrix never hits HBM. Same FLOPs, far fewer memory reads.
Going deeper
- Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2014) — attention’s introduction, in the encoder-decoder NMT setting.
- Vaswani et al., Attention Is All You Need (2017) — the transformer paper. Section 3.2 is the attention math, in about a page.
- Jay Alammar, The Illustrated Transformer — the visual walkthrough most practitioners learned this from.
A note on what I’m sure of: the Q/K/V mechanism, the role of softmax, the asymptotic cost, and the historical sequence (Bahdanau 2014 → Vaswani 2017) are all well-established. Why multi-head specifically works — what the heads end up specializing in, whether interpretable “syntax head” / “coreference head” stories generalize — is more debated than a clean post like this can convey. Treat the head-specialization intuition as a sketch, not a proven claim.