Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why RoPE replaced sinusoidal positional encoding

The original transformer added a fixed sine/cosine vector to each token. Almost no frontier model does that anymore. RoPE rotates queries and keys instead — and that one structural change is what made long context tractable.

AI & ML intermediate Apr 30, 2026

Why it exists

“The cat sat on the mat” and “The mat sat on the cat” contain exactly the same six words. Their meaning is completely different. A transformer’s attention layer, by default, can’t tell them apart — it sees a bag of words, not an order. So every transformer has to bolt some notion of “position” onto its tokens before it can mean anything. The way you bolt it on turns out to matter a lot, especially when you ask the model to read a 1-million-token document. The original transformer added a fixed sine/cosine pattern to each token. RoPE replaced that with a rotation. That single change is what unlocked long context.

The original transformer had a positional-encoding problem and a positional-encoding answer, and the answer turned out to be wrong in a way that only became obvious later.

Vaswani et al.’s Attention Is All You Need (2017) added a fixed vector — sines and cosines at geometrically-spaced frequencies — to each token’s input embedding. Position 0 got one vector, position 1 got another, and so on. The vector was glued onto the token before it ever entered the attention stack. The intuition was elegant: different frequencies encode different positional scales, and any relative offset between two positions can in principle be recovered as a linear function of those sines and cosines.

In practice, the scheme worked well enough to ship the original transformer, then was steadily replaced. Learned absolute embeddings (BERT-style) took over for a while. Relative position biases (T5-style) and later ALiBi went a different direction. And then in 2021 Jianlin Su and coauthors published RoPE in the RoFormer paper. By 2022–2024 essentially every major frontier LLM family had adopted it: GPT-NeoX, PaLM, LLaMA 1/2/3, Mistral, Falcon, Gemma, Qwen. Most of those were born with RoPE rather than switched — the lineage just chose RoPE over sinusoidal from the start.

The reason isn’t that sinusoidal “didn’t work.” It’s that adding position to the embedding is the wrong mathematical operation if what you actually want is for attention scores between two tokens to depend on their relative distance, not their absolute indices. RoPE encodes position as a rotation of the query and key vectors inside attention, and that rotation falls out of the dot product as a function of the offset between positions. The math wants this; sinusoidal addition was a workaround.

Why it matters now

The switch isn’t just academic — it shows up in three places engineers hit constantly.

The short answer

RoPE = rotate Q and K by an angle proportional to position, before the dot product

Where sinusoidal positional encoding adds a position vector to each token embedding once at the input, RoPE rotates the query and key vectors at every attention layer by an angle that scales with the token’s position index. Because rotations preserve magnitude and compose by addition of angles, the positional contribution to the q·k score between a query at position m and a key at position n depends only on the difference m − n. Position becomes a property of the interaction between two tokens, not a property baked into one of them.

How it works

Pick a query vector q at position m and a key vector k at position n. RoPE splits each vector into 2-dimensional pairs and treats each pair as a point in the plane. For each pair i, it picks a frequency θᵢ (a geometric series, the same flavor sinusoidal used) and rotates the query pair by angle m·θᵢ and the key pair by n·θᵢ.

Because rotations in 2D compose by adding angles, the dot product of the two rotated pairs is:

rotated_q · rotated_k = |q||k| cos((m − n)·θᵢ + original_angle_between_q_and_k)

The absolute positions m and n dropped out. Only their difference survives. Stack many such 2D pairs at different θᵢ and you’ve encoded position as a rotation pattern that the attention dot product naturally translates into a relative-distance signal. No addition, no learned table, no separate bias — just a structural choice about where in the embedding space “position” lives.

A few consequences fall out of this:

The honest seam: I have not seen a clean ablation that isolates RoPE versus sinusoidal at modern scale, controlling for everything else. The case for RoPE in the literature is partly mathematical (the relative-position property), partly empirical (RoFormer’s own results, then a cascade of large-model adoptions), and partly path-dependent — once LLaMA shipped with it, the open-source ecosystem standardized. I don’t have a source for a controlled head-to-head at, say, 70B parameters and 32k context. If anyone does, the result would be genuinely interesting.

Going deeper