Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is attention (in transformers)?

Every token in a sequence gets to peek at every other token and decide which ones matter. That trick is the engine inside every modern LLM.

AI & ML intro Apr 30, 2026

Why it exists

Before attention, the dominant way to model a sequence was a recurrent neural network: read one token, update a hidden state, read the next, update again. By the end you had a single fixed-size vector supposed to summarize everything that had happened.

That worked for short sequences and melted on long ones. If the answer to “what does it refer to?” lives forty tokens back, the model has to carry that fact forward through forty hidden-state updates without losing it. Information decays. Long-range dependencies get lossy in a way you can’t fix by making the network bigger.

Bahdanau, Cho, and Bengio’s 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate introduced the fix: instead of squeezing the source sentence into one vector, let the decoder look back at every encoder state and decide which ones matter for the word it’s about to produce. The mechanism was called attention. Three years later, Vaswani et al.’s Attention Is All You Need (2017) threw out the recurrence entirely — if attention is doing the work, why have the RNN at all? Stack attention layers with feed-forward networks between them and you get the transformer.

Why it matters now

Attention is the load-bearing operation inside almost every modern LLM, and increasingly inside vision and multimodal models too. When people say “the transformer revolution,” what they really mean is “everyone figured out attention scales.”

It’s also why GPU memory is the bottleneck for inference. Every token looking at every other token costs roughly N² compute at sequence length N. Most of modern serving complexity exists to manage that one fact — see why attention is quadratic.

The short answer

attention = soft, content-addressed lookup over a sequence of tokens

Each token produces a query asking what it’s looking for. Every token also offers a key describing what it has and a value it will contribute if matched. The token’s new representation is a weighted sum of all the values, where the weights come from how well its query matches each key. Everything — what to ask, what to advertise, what to return — is learned.

How it works

Every token plays three roles at once. Given a token embedding, the model multiplies it by three learned weight matrices to produce:

To compute the new representation for token i, take its query and compare it against every other token’s key using a dot product. That gives a raw score per token. Divide by √d (the scaled dot-product trick keeps numbers from exploding), pass the scores through a softmax so they sum to one, and use those weights to take a weighted average of the values:

output_i = Σ_j softmax(Q_i · K_j / √d) · V_j

A toy example. In “the cat sat on the mat because it was warm,” — does it refer to the cat or the mat? When the model computes attention for it, the query ends up scoring high against the key for mat (warmth fits a mat better than a cat), and the weighted sum pulls mat’s value into it’s new representation. Nobody told the model to do this. The pattern emerged because, during training, getting that kind of resolution right made next-token prediction better.

A few wrinkles worth knowing:

The “attention” name is evocative but a bit misleading. It’s not a model of human attention — humans don’t softmax over their visual field. It’s a soft lookup table whose weights concentrate on a few entries when the model is confident and spread out when it isn’t. The metaphor stuck because the visualized weights look like focus.

Going deeper

A note on what I’m sure of: the Q/K/V mechanism, the role of softmax, the asymptotic cost, and the historical sequence (Bahdanau 2014 → Vaswani 2017) are all well-established. Why multi-head specifically works — what the heads end up specializing in, whether interpretable “syntax head” / “coreference head” stories generalize — is more debated than a clean post like this can convey. Treat the head-specialization intuition as a sketch, not a proven claim.