Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why do attention sinks exist?

Trained transformers funnel a startling fraction of their attention onto the very first token — a token that's usually semantically meaningless. The pattern looks like a bug, behaves like a feature, and falls out cleanly from one constraint in the softmax.

AI & ML intermediate Apr 30, 2026

Why it exists

If you stare at the attention maps of a trained LLM, you find something embarrassing. Across many layers and many heads, a huge slice of the attention probability — sometimes the majority — lands on the first token of the sequence. Not on a token that’s topically relevant. Not on the most recent token. On position 0, which is usually a beginning-of-sequence marker like <s> or <|begin_of_text|>, or in a chat template a piece of system-prompt boilerplate. Something the model couldn’t possibly need to “look at” to predict the next word.

This pattern was named attention sinks by Xiao et al. in the StreamingLLM paper (ICLR 2024). They didn’t go looking for it as a curiosity; they bumped into it while trying to do something else. They wanted to run an LLM forever — feed it an unbounded stream of tokens and keep generating — using a sliding-window KV cache that drops the oldest tokens. The obvious thing. And it broke catastrophically: the moment the very first token of the sequence slid out of the window, perplexity exploded. The model wasn’t just slightly worse without that token. It became incoherent.

That is a strange failure. The first token of a long conversation has no business being load-bearing thousands of steps later. So what was the model actually using it for?

Why it matters now

Three reasons attention sinks moved from curiosity to “thing inference engineers have to know about”:

  1. Streaming and long-context generation. Any system that wants to evict old KV-cache entries to bound memory has to either preserve the sink tokens or accept that quality will collapse. StreamingLLM’s recipe — keep the first few tokens forever, slide a window over the rest — works precisely because it keeps the sinks alive.
  2. Quantization and pruning. The sink positions tend to host massive activations — numerically huge values that wreck quantization schemes that assume roughly Gaussian distributions. If you don’t special-case them, low-bit quantization eats them and the model degrades.
  3. Architectural fixes. Recent open-weights releases ship with the sink mechanism built into the architecture rather than emerging by accident. OpenAI’s gpt-oss (released August 2025) adds a learned per-head bias logit that sits in the softmax denominator — an explicit “park your unused attention here” slot — based on the same underlying observation.

The phenomenon also has a neat tie to a 2023 blog post by Evan Miller, Attention Is Off By One, which proposed almost exactly this fix (softmax₁, with an extra +1 in the denominator) on theoretical grounds, before the StreamingLLM paper showed how badly real models need it.

The short answer

attention sink = a token that absorbs leftover attention weight + because softmax forces every head to pick somewhere

Softmax outputs always sum to 1. So every attention head, on every token, on every layer, must spend its full unit of attention on something — even when it has nothing useful to attend to. Trained models discover that the cheapest place to dump that surplus is a position that’s reliably present in every sequence and reliably uninformative: the first token. Sinks are the model’s “no-op” hack, forced into existence by the sum-to-one constraint.

How it works

Start from the attention formula. For a query $q_t$ at position $t$ and keys $k_0, \dots, k_t$:

weights = softmax(q_t · K^T / sqrt(d))
output  = weights · V

Softmax means weights[i] = exp(score[i]) / Σ_j exp(score[j]). The weights are non-negative and sum to exactly 1.

Now imagine you’re a particular attention head, and on this particular token, you genuinely have nothing to contribute. Maybe you’re a head that specializes in “find the matching open-paren” and there are no parens around. You’d like to abstain — output a zero vector, contribute nothing to the residual stream — but softmax will not let you. You have to put your full unit of probability mass somewhere. Whichever key you score highest wins, even if all your scores are tiny.

So during pretraining, gradient descent finds a stable trick: pick a token that is (a) reliably present in every sequence, (b) reliably boring, and (c) easy to identify. The first token fits all three. Concentrate attention there when there’s nothing else to say. The value vector at that position can be driven toward something close to zero (or at least uninformative), and the model effectively gets its “abstain” back. This is the sink.

A few details worth showing the seams on:

The “off by one” framing

Evan Miller’s 2023 post made the same point from the algebra side without naming attention sinks. His proposal — softmax₁(x)_i = exp(x_i) / (1 + Σ_j exp(x_j)) — adds a phantom 1 to the denominator. The phantom term effectively gives every head a free “attend to nothing” option that doesn’t correspond to any real token. If all real scores are very negative, the phantom dominates, the real weights all go to nearly zero, and the head genuinely abstains. The output for that head is then close to the zero vector.

gpt-oss ships a per-head learnable version of this idea: each head has its own bias logit appended to the attention scores, and that logit is trained jointly with the rest of the model. If a head wants to “park” a lot of attention on the bias slot, it learns a high value; if it always wants to attend to real tokens, it learns a low one. This is, in effect, the architectural answer to “why did the model invent attention sinks?”: because we never gave it a legitimate way to abstain. Now we do.

I haven’t seen a clean apples-to-apples public benchmark establishing how much the learned-bias approach beats a comparable model trained without it. The intuition is strong; the quantified case is, as far as I can tell, still being assembled.

What still puzzles people

A few honest gaps:

Going deeper