Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why do positional encodings exist?

A transformer cannot tell 'dog bites man' from 'man bites dog' on its own. The attention math is symmetric in token order — until you bolt on a position signal. Every modern LLM does, and the choice of how shapes long-context behavior more than people realize.

AI & ML intermediate Apr 29, 2026

Why it exists

There’s a small, embarrassing fact about the transformer that the original paper had to fix in a footnote-shaped way: the model has no idea what order its input tokens came in.

If you take a sentence, shuffle the tokens, and feed both versions through pure self-attention, the math comes out the same up to a permutation of the outputs. Attention is a weighted sum over a set of (key, value) pairs. Sets have no order. “dog bites man” and “man bites dog” produce the same set, so the same attention scores, so the same internal representations — just rearranged. That property has a name: permutation equivariant. It’s exactly what you don’t want for language, where order is most of the meaning.

The original Attention Is All You Need paper (Vaswani et al., 2017) handled this by adding a fixed sinusoidal vector to each token’s input embedding — one vector per position, the same one every time, baked from sines and cosines at geometrically-spaced frequencies. Token embedding plus position vector goes into the model. That’s the entire trick.

That early choice has aged into a much bigger story. Modern LLMs almost universally don’t use the original sinusoidal scheme anymore. Most have switched to RoPE, which works by rotating the query and key vectors inside the attention layer rather than adding anything to the embeddings. The reason matters, and it’s why “long context” was hard in a way nobody quite expected.

Why it matters now

Three things make positional encoding a load-bearing choice today, not a footnote:

The choice of positional encoding is one of those design decisions where the wrong answer doesn’t fail loudly. It just makes the model quietly worse at long-range work.

The short answer

positional encoding = a position-dependent signal injected into Q/K (or input embeddings) so attention can tell tokens apart by where they are

Self-attention by itself sees a bag of tokens, not a sequence. To get sequence behavior you have to inject “you are token #i” somewhere in the pipeline before the attention scores get computed. The original transformer added a fixed sinusoid to each input embedding. Modern LLMs (LLaMA, Mistral, GPT-NeoX, PaLM, Gemma, Qwen2) instead rotate the query and key vectors inside attention — that’s RoPE — because it makes the positional part of the score depend only on the relative distance between two tokens, which generalizes and composes much better.

How it works

Start with the failure mode, because the rest follows from it.

What “permutation equivariant” actually buys you

A single attention head computes, for each token i:

output_i = Σ_j  softmax_j( q_i · k_j / √d ) · v_j

Notice what this depends on: the content of q_i, and the contents of all k_j and v_j. There is nothing here about where token j sits in the sequence. If you shuffle the tokens, you shuffle the (k, v) pairs, but the set of pairs is unchanged — so for any given query, the scores it computes are unchanged (modulo which output slot you read out).

For a vision model on patches that’s sometimes fine; spatial position can be encoded in other ways. For language, it’s catastrophic. “the cat sat on the mat” and “mat the on sat cat the” are the same multiset of tokens.

You have to break the symmetry. The question is where you break it, and how.

Option 1: Add position to the input (sinusoidal, learned absolute)

The original transformer’s move: precompute a vector PE_i for each position i, and add it to the token embedding before layer 1.

x_i = embedding(token_i) + PE_i

Vaswani et al. defined PE_i using sines and cosines:

PE_{i, 2k}     = sin(i / 10000^{2k/d})
PE_{i, 2k+1}   = cos(i / 10000^{2k/d})

The 10000 is just a chosen base; the shape is geometric: low dimensions oscillate fast, high dimensions oscillate slowly. The reason for sines and cosines specifically (rather than learning the position vectors) was, in their own words, that it “may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.” A nice hope. It mostly didn’t pan out — sinusoidal models also don’t extrapolate cleanly past their training length in practice — but the construction stuck for years.

BERT and many follow-ups used learned absolute position embeddings instead: a lookup table from position index to a learned vector, just like the token embedding table. Same shape, simpler, no extrapolation property at all (positions past training length are literally untrained vectors).

The intuitive worry with both flavors of “add to input”: the position information enters once, at the bottom, and has to survive through every layer’s mixing. By layer 24 the model has been shuffling these representations around for a long time. Whether the signal gets meaningfully attenuated isn’t something I have a clean source for — treat it as motivation for what comes next, not as established fact. What is established is that absolute schemes encode absolute position, when what attention usually wants is relative position — token i caring about “the token three to my left” rather than “the token at index 47.”

Option 2: Rotate Q and K (RoPE)

RoPE — proposed in Su et al.’s RoFormer paper (Su, Lu, Pan, Murtadha, Wen, Liu, 2021; arXiv 2104.09864) — does something cleaner. Instead of adding to the embedding, it rotates the query and key vectors themselves, by an amount that depends on the position.

The key fact, written compactly: split each head’s d-dimensional Q and K into d/2 pairs of components. For each pair, treat it as a 2D vector and rotate it by an angle m · θ_k, where m is the position and θ_k is a per-pair frequency (using the same 10000-base geometric progression as the original sinusoidal scheme). Different pairs rotate at different rates. The whole operation is a 2×2 rotation applied independently to each pair.

The magic: the dot product of a rotated query at position m with a rotated key at position n depends on the content of q and k plus the relative offset m − n, not on the absolute m and n separately. This is a mathematical fact about rotations: R(α) · R(β)ᵀ = R(α − β). Translate the whole sequence by 100 tokens and the attention pattern stays identical — exactly the property attention usually wants.

Other useful properties of RoPE:

LLaMA (1, 2, 3), Mistral, Gemma, GPT-NeoX, GPT-J, Qwen2, and most other modern open-weights models use RoPE; PaLM (closed-weights) does too. It is, as of 2026, the default.

Option 3: Bias the attention scores directly (ALiBi)

A third route: don’t touch Q or K at all. Just bias the attention score for token i attending to token j by −|i − j| · m_h, where m_h is a small per-head slope. Closer tokens get a smaller penalty; farther tokens get a larger one. This is ALiBi, from Press, Smith, and Lewis (arXiv 2108.12409, ICLR 2022).

ALiBi’s pitch in the paper was specifically about extrapolation: train at length 1024, test at 2048+, with no fine-tuning. The headline result was a 1.3B model trained on 1024 tokens that extrapolated to 2048 with the same perplexity as a sinusoidal model trained at 2048, while training 11% faster and using 11% less memory. ALiBi has been used in some real models (e.g. MPT, BLOOM), but as of early 2026 the field has converged hard on RoPE for frontier-class models. My read on why: RoPE’s relative-position-aware Q/K rotations feel more like a general mechanism, where ALiBi is a hand-designed bias term that happens to work for distance-decay specifically. I don’t have a clean public source attributing the convergence to a single reason, so this is a take, not a fact.

Option 4: No positional encoding at all (NoPE)

This one is the most surprising. Haviv et al. (arXiv 2203.16634, 2022) showed that causal (decoder-only) transformers with no positional encoding at all are still competitive with explicit-PE ones at language modeling, across multiple datasets and model sizes. The intuition the paper develops is that the causal mask itself — token i can only attend to tokens 1…i — leaks position information: a token at position 5 has 5 things to attend to; a token at position 50 has 50. (Their probing experiments suggest the model picks up an implicit notion of absolute position; the exact mechanism is more subtle than a literal “count.”)

Kazemnejad et al. (The Impact of Positional Encoding on Length Generalization in Transformers, arXiv 2305.19466, 2023) studied NoPE more formally and showed it can in principle represent both absolute and relative position. The catch: NoPE models have poor length generalization, even slightly worse than explicit-PE models in their experiments. So this is more of a “nice to know the model could in principle figure it out” than a real production choice.

It’s still useful as a sanity check on what position encodings are actually doing. They’re not creating positional information out of thin air; they’re giving the model a more direct, less learning-required handle on position than it would otherwise have to discover.

Why long-context is mostly a RoPE-extension story

If you train a RoPE model on sequences up to 4k tokens, the rotation angles m · θ_k for the slowest-rotating pairs sweep through some specific range of angles for m ∈ [0, 4k]. At inference time, if you suddenly feed it a 32k-token sequence, those slow-rotating pairs see angles 8× larger than anything in training. The attention math doesn’t crash, but the model has never seen that part of the rotation phase space. Behavior degrades.

The fixes are all variants of “rescale θ so the angles stay in the trained range”:

All of these are RoPE-specific. They’re not adjustments to attention, or to the KV cache, or to the model weights — they’re knobs on the rotation schedule. That’s the deepest sense in which the choice of positional encoding shapes what the long-context regime even looks like.

The seams worth seeing

The thing to take away: positional encoding isn’t a clever optimization or a flourish. It’s the bare minimum surgery that turns attention from a set operation into a sequence operation. Every choice — sinusoidal, learned, RoPE, ALiBi, none — is a different bet about which facts about position the model should get for free, and which ones it should have to learn from data.

Going deeper