Why do positional encodings exist?

A transformer cannot tell 'dog bites man' from 'man bites dog' on its own. The attention math is symmetric in token order — until you bolt on a position signal. Every modern LLM does, and the choice of how shapes long-context behavior more than people realize.

AI & ML intermediate Apr 29, 2026

Why it exists

There’s a small, embarrassing fact about the transformer that the original paper had to fix in a footnote-shaped way: the model has no idea what order its input tokens came in.

If you take a sentence, shuffle the tokens, and feed both versions through pure self-attention, the math comes out the same up to a permutation of the outputs. Attention is a weighted sum over a set of (key, value) pairs. Sets have no order. “dog bites man” and “man bites dog” produce the same set, so the same attention scores, so the same internal representations — just rearranged. That property has a name: permutation equivariant. It’s exactly what you don’t want for language, where order is most of the meaning.

The original Attention Is All You Need paper (Vaswani et al., 2017) handled this by adding a fixed sinusoidal vector to each token’s input embedding — one vector per position, the same one every time, baked from sines and cosines at geometrically-spaced frequencies. Token embedding plus position vector goes into the model. That’s the entire trick.

That early choice has aged into a much bigger story. Modern LLMs almost universally don’t use the original sinusoidal scheme anymore. Most have switched to RoPE, which works by rotating the query and key vectors inside the attention layer rather than adding anything to the embeddings. The reason matters, and it’s why “long context” was hard in a way nobody quite expected.

Why it matters now

Three things make positional encoding a load-bearing choice today, not a footnote:

Long context is the headline feature. When you advertise 128k or 1M token windows, the positions in those windows have to mean something the model was actually trained to handle. A positional encoding that didn’t see position 500,000 during training will usually degrade sharply at that index. The whole game of “context extension” — YaRN, NTK scaling, position interpolation — is fundamentally about hacking RoPE to work at lengths it never saw.
The KV cache is positional. Each entry in the KV cache is the K and V tensor for a specific position in the prompt. With RoPE, the rotation has already been applied to the cached K when it was computed. That means the position is baked into the cache; you can’t trivially “shift” a cached prefix to a new offset. This shows up in subtle ways when systems try to share or splice KV caches across requests.
It’s the easiest place for “I’m using a long context model” to silently mean “I’m using a confused model.” Models extended past their training length without proper RoPE rescaling don’t crash — they just degrade in ways that are hard to spot from a single completion. (My read, not a sourced claim: Lost in the middle-style pathologies are partly downstream of this — but they have other causes too, and I don’t have a clean source pinning the share.)

The choice of positional encoding is one of those design decisions where the wrong answer doesn’t fail loudly. It just makes the model quietly worse at long-range work.

The short answer

positional encoding = a position-dependent signal injected into Q/K (or input embeddings) so attention can tell tokens apart by where they are

Self-attention by itself sees a bag of tokens, not a sequence. To get sequence behavior you have to inject “you are token #i” somewhere in the pipeline before the attention scores get computed. The original transformer added a fixed sinusoid to each input embedding. Modern LLMs (LLaMA, Mistral, GPT-NeoX, PaLM, Gemma, Qwen2) instead rotate the query and key vectors inside attention — that’s RoPE — because it makes the positional part of the score depend only on the relative distance between two tokens, which generalizes and composes much better.

How it works

Start with the failure mode, because the rest follows from it.

What “permutation equivariant” actually buys you

A single attention head computes, for each token i:

output_i = Σ_j  softmax_j( q_i · k_j / √d ) · v_j

Notice what this depends on: the content of q_i, and the contents of all k_j and v_j. There is nothing here about where token j sits in the sequence. If you shuffle the tokens, you shuffle the (k, v) pairs, but the set of pairs is unchanged — so for any given query, the scores it computes are unchanged (modulo which output slot you read out).

For a vision model on patches that’s sometimes fine; spatial position can be encoded in other ways. For language, it’s catastrophic. “the cat sat on the mat” and “mat the on sat cat the” are the same multiset of tokens.

You have to break the symmetry. The question is where you break it, and how.

Option 1: Add position to the input (sinusoidal, learned absolute)

The original transformer’s move: precompute a vector PE_i for each position i, and add it to the token embedding before layer 1.

x_i = embedding(token_i) + PE_i

Vaswani et al. defined PE_i using sines and cosines:

PE_{i, 2k}     = sin(i / 10000^{2k/d})
PE_{i, 2k+1}   = cos(i / 10000^{2k/d})

The 10000 is just a chosen base; the shape is geometric: low dimensions oscillate fast, high dimensions oscillate slowly. The reason for sines and cosines specifically (rather than learning the position vectors) was, in their own words, that it “may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.” A nice hope. It mostly didn’t pan out — sinusoidal models also don’t extrapolate cleanly past their training length in practice — but the construction stuck for years.

BERT and many follow-ups used learned absolute position embeddings instead: a lookup table from position index to a learned vector, just like the token embedding table. Same shape, simpler, no extrapolation property at all (positions past training length are literally untrained vectors).

The intuitive worry with both flavors of “add to input”: the position information enters once, at the bottom, and has to survive through every layer’s mixing. By layer 24 the model has been shuffling these representations around for a long time. Whether the signal gets meaningfully attenuated isn’t something I have a clean source for — treat it as motivation for what comes next, not as established fact. What is established is that absolute schemes encode absolute position, when what attention usually wants is relative position — token i caring about “the token three to my left” rather than “the token at index 47.”

Option 2: Rotate Q and K (RoPE)

RoPE — proposed in Su et al.’s RoFormer paper (Su, Lu, Pan, Murtadha, Wen, Liu, 2021; arXiv 2104.09864) — does something cleaner. Instead of adding to the embedding, it rotates the query and key vectors themselves, by an amount that depends on the position.

The key fact, written compactly: split each head’s d-dimensional Q and K into d/2 pairs of components. For each pair, treat it as a 2D vector and rotate it by an angle m · θ_k, where m is the position and θ_k is a per-pair frequency (using the same 10000-base geometric progression as the original sinusoidal scheme). Different pairs rotate at different rates. The whole operation is a 2×2 rotation applied independently to each pair.

The magic: the dot product of a rotated query at position m with a rotated key at position n depends on the content of q and k plus the relative offset m − n, not on the absolute m and n separately. This is a mathematical fact about rotations: R(α) · R(β)ᵀ = R(α − β). Translate the whole sequence by 100 tokens and the attention pattern stays identical — exactly the property attention usually wants.

Other useful properties of RoPE:

No new parameters. It’s a fixed transform on Q and K. Nothing is learned that wasn’t learned before.
Position is reinjected at every layer. Because RoPE is applied inside the attention block — every layer — there’s no concern about a single bottom-of-stack injection getting lost in the deep layers, the way there might be with additive input embeddings. (Whether that actually matters in practice is a separate empirical question I won’t claim I have a clean answer to.)
It composes with the KV cache. The rotated K is what gets cached. Subsequent queries are rotated and dotted against the already-rotated cached keys; the relative-position math still works out.
Decaying inter-token dependency with distance. Su et al. show that, in expectation, the attention score from RoPE has a built-in mild decay as distance increases. This isn’t a hard cutoff, but it’s a useful inductive bias.

LLaMA (1, 2, 3), Mistral, Gemma, GPT-NeoX, GPT-J, Qwen2, and most other modern open-weights models use RoPE; PaLM (closed-weights) does too. It is, as of 2026, the default.

Option 3: Bias the attention scores directly (ALiBi)

A third route: don’t touch Q or K at all. Just bias the attention score for token i attending to token j by −|i − j| · m_h, where m_h is a small per-head slope. Closer tokens get a smaller penalty; farther tokens get a larger one. This is ALiBi, from Press, Smith, and Lewis (arXiv 2108.12409, ICLR 2022).

ALiBi’s pitch in the paper was specifically about extrapolation: train at length 1024, test at 2048+, with no fine-tuning. The headline result was a 1.3B model trained on 1024 tokens that extrapolated to 2048 with the same perplexity as a sinusoidal model trained at 2048, while training 11% faster and using 11% less memory. ALiBi has been used in some real models (e.g. MPT, BLOOM), but as of early 2026 the field has converged hard on RoPE for frontier-class models. My read on why: RoPE’s relative-position-aware Q/K rotations feel more like a general mechanism, where ALiBi is a hand-designed bias term that happens to work for distance-decay specifically. I don’t have a clean public source attributing the convergence to a single reason, so this is a take, not a fact.

Option 4: No positional encoding at all (NoPE)

This one is the most surprising. Haviv et al. (arXiv 2203.16634, 2022) showed that causal (decoder-only) transformers with no positional encoding at all are still competitive with explicit-PE ones at language modeling, across multiple datasets and model sizes. The intuition the paper develops is that the causal mask itself — token i can only attend to tokens 1…i — leaks position information: a token at position 5 has 5 things to attend to; a token at position 50 has 50. (Their probing experiments suggest the model picks up an implicit notion of absolute position; the exact mechanism is more subtle than a literal “count.”)

Kazemnejad et al. (The Impact of Positional Encoding on Length Generalization in Transformers, arXiv 2305.19466, 2023) studied NoPE more formally and showed it can in principle represent both absolute and relative position. The catch: NoPE models have poor length generalization, even slightly worse than explicit-PE models in their experiments. So this is more of a “nice to know the model could in principle figure it out” than a real production choice.

It’s still useful as a sanity check on what position encodings are actually doing. They’re not creating positional information out of thin air; they’re giving the model a more direct, less learning-required handle on position than it would otherwise have to discover.

Why long-context is mostly a RoPE-extension story

If you train a RoPE model on sequences up to 4k tokens, the rotation angles m · θ_k for the slowest-rotating pairs sweep through some specific range of angles for m ∈ [0, 4k]. At inference time, if you suddenly feed it a 32k-token sequence, those slow-rotating pairs see angles 8× larger than anything in training. The attention math doesn’t crash, but the model has never seen that part of the rotation phase space. Behavior degrades.

The fixes are all variants of “rescale θ so the angles stay in the trained range”:

Position interpolation (PI) — Chen et al. (2023): squash the new positions back into the training range by scaling positions down (i.e. divide m by the extension factor). Cheap, requires fine-tuning.
NTK-aware scaling — community-developed (the bloke / kaiokendev): instead of scaling positions uniformly, rescale the base (the 10000) so high-frequency dimensions are barely changed and low-frequency ones are scaled more. Often works without fine-tuning.
YaRN — Peng, Quesnelle, Fan, Shippole (arXiv 2309.00071, 2023): a more careful per-dimension scaling plus an attention-temperature adjustment. The headline claim from the paper is 10× fewer tokens and 2.5× fewer training steps than prior context-extension methods to reach the same quality. (My read — not a sourced claim — is that YaRN is one of the more commonly used context-extension recipes for open-weights long-context fine-tunes; treat that as a take rather than a survey result.)

All of these are RoPE-specific. They’re not adjustments to attention, or to the KV cache, or to the model weights — they’re knobs on the rotation schedule. That’s the deepest sense in which the choice of positional encoding shapes what the long-context regime even looks like.

The seams worth seeing

Position is not the same as causality. The causal mask makes a model autoregressive; the positional encoding tells it where in the sequence each token is. Encoder-only models (BERT) need positional encoding without a causal mask. Decoder-only models need both, but as NoPE shows, the mask alone leaks some position info “for free.”
RoPE is per-head and per-pair. Different attention heads in the same model are looking at different rotation frequencies for different pairs of dimensions. There’s no one place where “the position” lives in the activations; it’s distributed across rotation phases.
There’s nothing magical about 10000. It’s a hyperparameter, picked once in 2017 for sinusoidal PE and inherited into RoPE. Increasing the base is one of the standard knobs (it’s roughly what NTK scaling is doing). Some recent models train with much larger bases on purpose to get better long-context out of the box.
“Position” is positions of tokens, not positions of characters or bytes. If your tokenizer splits a word weirdly, that’s the granularity of the position signal. Position encoding sits on top of tokenization; it doesn’t fix tokenization’s failure modes.

The thing to take away: positional encoding isn’t a clever optimization or a flourish. It’s the bare minimum surgery that turns attention from a set operation into a sequence operation. Every choice — sinusoidal, learned, RoPE, ALiBi, none — is a different bet about which facts about position the model should get for free, and which ones it should have to learn from data.

Sinusoidal positional encoding — sinusoidal PE = sin/cos at geometric frequencies, added to input embeddings. The 2017 original. Largely retired in favor of RoPE for new models.
Learned absolute PE — learned PE = lookup table of position → vector, added to input embeddings. BERT-era choice. Zero extrapolation past training length; you literally have no vector for unseen positions.
RoPE (Rotary Position Embedding) — RoPE = rotate Q and K by an angle proportional to position, applied per layer. Makes the positional part of attention scores depend only on relative offset. The 2026 default.
ALiBi — ALiBi = subtract a per-head linear penalty proportional to |i − j| from attention scores. No PE on Q/K; bias the scores instead. Better extrapolation than sinusoidal in the original paper; mostly bypassed by the RoPE-extension ecosystem.
NoPE — NoPE = no explicit position signal; let the causal mask leak it. Works for language modeling perplexity; weak at length generalization.
YaRN — YaRN = RoPE + per-dimension scaling + attention temperature adjustment. Standard way to take a 4k-trained model to 32k+ without redoing pretraining.
Position interpolation — PI = scale positions down so they fit into the trained rotation range. Simpler than YaRN, usually needs fine-tuning.

Going deeper

Vaswani et al., Attention Is All You Need (2017) — Section 3.5 introduces sinusoidal PE.
Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) — the RoPE paper.
EleutherAI blog, Rotary Embeddings: A Relative Revolution — by far the clearest plain-English derivation of why RoPE gives relative-position attention “for free.”
Press, Smith, Lewis, Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (2021) — the ALiBi paper.
Haviv et al., Transformer Language Models without Positional Encodings Still Learn Positional Information (2022) — the NoPE result.
Kazemnejad et al., The Impact of Positional Encoding on Length Generalization in Transformers (NeurIPS 2023) — formal analysis of NoPE plus a length-generalization comparison across PE schemes.
Peng et al., YaRN: Efficient Context Window Extension of Large Language Models (2023) — the canonical long-context extension method.
EleutherAI blog, Extending the RoPE — companion piece walking through PI, NTK-aware, and YaRN side by side.