Why do attention sinks exist?
Trained transformers funnel a startling fraction of their attention onto the very first token — a token that's usually semantically meaningless. The pattern looks like a bug, behaves like a feature, and falls out cleanly from one constraint in the softmax.
Why it exists
If you stare at the attention maps of a trained
LLM,
you find something embarrassing. Across many layers and many heads, a
huge slice of the attention probability — sometimes the majority — lands
on the first token of the sequence. Not on a token that’s
topically relevant. Not on the most recent token. On position 0, which
is usually a beginning-of-sequence marker like <s> or <|begin_of_text|>,
or in a chat template a piece of system-prompt boilerplate. Something
the model couldn’t possibly need to “look at” to predict the next word.
This pattern was named attention sinks by Xiao et al. in the StreamingLLM paper (ICLR 2024). They didn’t go looking for it as a curiosity; they bumped into it while trying to do something else. They wanted to run an LLM forever — feed it an unbounded stream of tokens and keep generating — using a sliding-window KV cache that drops the oldest tokens. The obvious thing. And it broke catastrophically: the moment the very first token of the sequence slid out of the window, perplexity exploded. The model wasn’t just slightly worse without that token. It became incoherent.
That is a strange failure. The first token of a long conversation has no business being load-bearing thousands of steps later. So what was the model actually using it for?
Why it matters now
Three reasons attention sinks moved from curiosity to “thing inference engineers have to know about”:
- Streaming and long-context generation. Any system that wants to evict old KV-cache entries to bound memory has to either preserve the sink tokens or accept that quality will collapse. StreamingLLM’s recipe — keep the first few tokens forever, slide a window over the rest — works precisely because it keeps the sinks alive.
- Quantization and pruning. The sink positions tend to host massive activations — numerically huge values that wreck quantization schemes that assume roughly Gaussian distributions. If you don’t special-case them, low-bit quantization eats them and the model degrades.
- Architectural fixes. Recent open-weights releases ship with the
sink mechanism built into the architecture rather than emerging by
accident. OpenAI’s
gpt-oss(released August 2025) adds a learned per-head bias logit that sits in the softmax denominator — an explicit “park your unused attention here” slot — based on the same underlying observation.
The phenomenon also has a neat tie to a 2023 blog post by Evan Miller,
Attention Is Off By One, which proposed almost exactly this fix
(softmax₁, with an extra +1 in the denominator) on theoretical
grounds, before the StreamingLLM paper showed how badly real models
need it.
The short answer
attention sink = a token that absorbs leftover attention weight + because softmax forces every head to pick somewhere
Softmax outputs always sum to 1. So every attention head, on every token, on every layer, must spend its full unit of attention on something — even when it has nothing useful to attend to. Trained models discover that the cheapest place to dump that surplus is a position that’s reliably present in every sequence and reliably uninformative: the first token. Sinks are the model’s “no-op” hack, forced into existence by the sum-to-one constraint.
How it works
Start from the attention formula. For a query $q_t$ at position $t$ and keys $k_0, \dots, k_t$:
weights = softmax(q_t · K^T / sqrt(d))
output = weights · V
Softmax means weights[i] = exp(score[i]) / Σ_j exp(score[j]). The
weights are non-negative and sum to exactly 1.
Now imagine you’re a particular attention head, and on this particular token, you genuinely have nothing to contribute. Maybe you’re a head that specializes in “find the matching open-paren” and there are no parens around. You’d like to abstain — output a zero vector, contribute nothing to the residual stream — but softmax will not let you. You have to put your full unit of probability mass somewhere. Whichever key you score highest wins, even if all your scores are tiny.
So during pretraining, gradient descent finds a stable trick: pick a token that is (a) reliably present in every sequence, (b) reliably boring, and (c) easy to identify. The first token fits all three. Concentrate attention there when there’s nothing else to say. The value vector at that position can be driven toward something close to zero (or at least uninformative), and the model effectively gets its “abstain” back. This is the sink.
A few details worth showing the seams on:
- It’s not always position 0. The Xiao et al. paper tested keeping just one initial token and found one wasn’t enough — they needed about four. The reason they conjecture: the models they studied weren’t pretrained with a consistent first token across all training documents, so the model learned to use several early positions as sinks rather than just one. Models pretrained with a guaranteed start-of-sequence token may concentrate on a single sink instead. This is empirical; the exact number is model-specific.
- Sinks correlate with massive activations. Subsequent work observed that the hidden-state norms at sink positions can be orders of magnitude larger than at other positions. The standard reading (see Sun et al. 2024, “Massive Activations in Large Language Models”) is that these large activations are how the model encodes “this is the sink” robustly enough that all the heads can find it. I’d call this the dominant interpretation rather than settled fact — the mechanistic picture is still being filled in.
- The fix is mechanical. StreamingLLM doesn’t retrain anything. It just changes the KV-cache eviction policy: keep the first k tokens (they used 4) pinned forever, and slide a window over everything else. With sinks preserved, perplexity stays flat as the model generates millions of tokens; without them, it diverges within a few thousand.
The “off by one” framing
Evan Miller’s 2023 post made the same point from the algebra side
without naming attention sinks. His proposal — softmax₁(x)_i = exp(x_i) / (1 + Σ_j exp(x_j)) —
adds a phantom 1 to the denominator. The phantom term effectively
gives every head a free “attend to nothing” option that doesn’t
correspond to any real token. If all real scores are very negative,
the phantom dominates, the real weights all go to nearly zero, and the
head genuinely abstains. The output for that head is then close to
the zero vector.
gpt-oss ships a per-head learnable version of this idea: each head
has its own bias logit appended to the attention scores, and that
logit is trained jointly with the rest of the model. If a head wants
to “park” a lot of attention on the bias slot, it learns a high
value; if it always wants to attend to real tokens, it learns a low
one. This is, in effect, the architectural answer to “why did the
model invent attention sinks?”: because we never gave it a legitimate
way to abstain. Now we do.
I haven’t seen a clean apples-to-apples public benchmark establishing how much the learned-bias approach beats a comparable model trained without it. The intuition is strong; the quantified case is, as far as I can tell, still being assembled.
What still puzzles people
A few honest gaps:
- Why does the model need several sink tokens rather than one, even when sequences begin with a consistent BOS marker? Xiao et al. conjecture it’s pretraining-data dependent; that hasn’t been pinned down rigorously across model families.
- Sinks behave differently across heads, layers, and positions. There isn’t yet a unified mechanistic story that explains which heads dump to the sink and when. This is active research as of 2025.
- The connection between attention sinks and outlier features in quantization is empirically tight but not fully explained. Models that quantize well at 4 bits often do so partly because the sink positions are kept in higher precision; models that quantize poorly often have unusually large sink activations. The causal direction is contested.
Famous related terms
- Softmax —
softmax(x)_i = exp(x_i) / Σ_j exp(x_j)— the sum-to-one constraint that creates the problem in the first place. softmax₁/ “Attention Is Off By One” —softmax₁ = softmax + phantom 1 in the denominator— Evan Miller’s proposed fix; lets a head abstain.- StreamingLLM —
StreamingLLM = sliding-window KV cache + pinned first-k tokens— Xiao et al.’s recipe for unbounded-length generation without retraining. - Massive activations —
massive activations = hidden states with anomalously huge norms— empirically co-located with sink positions; not yet fully explained. - KV cache —
KV cache = stored attention keys and values + reused on every decode step— the structure whose eviction policy made the sink visible. - PagedAttention —
PagedAttention = block-based KV cache + page table— orthogonal to sinks, but real systems have to combine the two: pin the sink blocks, page everything else.
Going deeper
- Xiao, Tian, Chen, Han, Lewis — Efficient Streaming Language Models with Attention Sinks (ICLR 2024). arXiv:2309.17453 · code. The paper that named the phenomenon and gave the streaming fix.
- Evan Miller — Attention Is Off By One (July 2023). evanmiller.org. The blog post that proposed
softmax₁from first principles, before the StreamingLLM measurements. - Sun, Chen, Liu, Kolter — Massive Activations in Large Language Models (2024). arXiv:2402.17762. The standard reference for the giant-activations side of the story.
- Cancedda — Why do LLMs attend to the first token? (2025). arXiv:2504.02732. A more recent attempt at a mechanistic account.
- OpenAI — gpt-oss model card (August 2025). PDF. Mentions the learned per-head sink bias as part of the architecture.