Why LayerNorm (and RMSNorm) exist
Every transformer block has a normalization step. Pull it out and training falls apart in the first thousand steps. Why is this tiny operation load-bearing?
Why it exists
Open any transformer implementation. Inside every block, sandwiched between the attention and the feed-forward, there is a tiny operation that looks like an afterthought: take the activations, subtract their mean, divide by their standard deviation, multiply by a learned scale. Five lines of code. No moving parts. Nobody talks about it.
Pull it out and at modern depth and learning rate the model usually doesn’t train. The loss explodes in the first thousand steps, or plateaus at the cross-entropy of “guess uniformly.” (You can sometimes rescue an un-normalized transformer with very careful initialization and residual scaling, but that’s its own line of research, not the default.) This is the kind of failure where, if you’ve never seen it, you assume your data loader is broken.
So the question is: why does a deep network — billions of parameters, trillions of training tokens, attention heads doing all the linguistically interesting work — collapse without a five-line rescaling step?
The short version is that gradients in deep networks are a chain of multiplications, and chains of multiplications go to zero or infinity unless something keeps them tame. Normalization is what keeps them tame. Everything else in the architecture assumes it.
Why it matters now
If you’re building, fine-tuning, or even just reasoning about modern LLMs, the choice of normalization shows up in places you’d rather not have to think about:
- Architecture diffs across model families. Llama, Mistral, Gemma, Qwen, DeepSeek, T5, PaLM use RMSNorm. GPT-2 / GPT-3, and most public OpenAI checkpoints, use classical LayerNorm. When you port weights or compare papers, this is one of the first-line “is this the same shape” checks.
- Pre-LN vs. Post-LN. Where the norm sits inside the residual block strongly affects whether the model can be trained without a learning-rate warm-up. Pre-LN dominates recent large LLMs; the original 2017 transformer was Post-LN, and arguably that difference contributed to a few years of mysterious training crashes before the community settled on Pre-LN as the default.
- Quantization and inference kernels. Norm layers are tiny in parameter count but they are sequential dependencies in the forward pass — every fused-attention kernel and every quantization scheme has to deal with them explicitly. RMSNorm’s popularity is partly because it has one fewer reduction to implement on a GPU.
- Debugging training runs. “Loss is NaN at step 800” is, more often than not, an interaction between the norm, the residual stream, and learning-rate schedule. You can’t reason about it without knowing what the norm is doing.
The norm is small, boring, and load-bearing. It’s the structural beam in the wall — easy to forget about until you remove it.
The short answer
LayerNorm = (per-token: subtract mean, divide by std) + learned scale and shift
RMSNorm = (per-token: divide by RMS) + learned scale, no shift
A normalization layer rescales each token’s activation vector so its magnitude is roughly fixed, regardless of how the previous layers chose to amplify it. That fixed magnitude is what makes the gradient chain through a hundred-layer network behave. RMSNorm is the same idea with the mean-subtraction step removed, after Zhang and Sennrich (2019) showed empirically that the centering wasn’t doing useful work.
How it works
Three things to get right: what normalization actually does to the math, why the layer-flavored version was needed for transformers in the first place, and why RMSNorm is the modern default.
What it does to the gradient
A deep network is a chain of matrix multiplications, nonlinearities, and adds. The gradient at the bottom is the product of all the Jacobians along the way. If the typical magnitude of those Jacobians is greater than 1, the product blows up. If it’s less than 1, the product shrinks to zero. Either way you stop learning.
You can try to control this by careful initialization (Xavier, He, etc.) and by clipping gradients. Those help. They are not enough at the depth and learning rate transformers want to train at.
Normalization is the brute-force fix. After every layer (or twice per block, in a transformer), you reach in and rescale the activations so that their magnitude is fixed by construction. The network can no longer drift into a regime where one block multiplies by 100 and the next divides by 100. Activations stay in a roughly controlled range across depth, which tends to keep the chain of Jacobians from blowing up or collapsing — not as a guarantee, but as a strong empirical regularization.
There’s a learned scale (and, in LayerNorm, a learned shift) so the network can still choose a non-unit magnitude where it’s useful — it just has to ask for it explicitly through a parameter, instead of getting it accidentally through compounded weight scales.
Why “Layer” Norm and not Batch Norm
The 2016 LayerNorm paper (Ba, Kiros, Hinton) was a response to a specific limit of Batch Normalization: BatchNorm normalizes each feature across the batch, which means the statistics depend on which other examples happen to be in the mini-batch. That’s fine for image classifiers with fixed-size batches in training. It’s a problem for sequence models, where you have variable lengths, padding, and — at inference — autoregressive generation that produces tokens one at a time. There is no “mini-batch of one token” that gives you meaningful BatchNorm statistics.
LayerNorm sidesteps this by normalizing per token, across the hidden dimension. Each token’s activation vector is normalized using only its own statistics. Training and inference behave identically. Batch size doesn’t matter. Padded positions don’t pollute anything. This is the property transformers need and it’s why the original transformer (and most that followed) reached for LayerNorm rather than BatchNorm, despite BatchNorm being older and more famous. Some later models switched to RMSNorm; almost none went back to BatchNorm.
Pre-LN vs. Post-LN — the placement that matters
The original “Attention Is All You Need” transformer placed
LayerNorm after the residual add: x + Sublayer(x) then norm.
This is Post-LN. It worked, but in practice it required a careful
learning-rate warm-up that ramped from near-zero over thousands of
steps; without warm-up, training tended to diverge. Nobody could
quite say why.
Xiong et al. (2020), On Layer Normalization in the Transformer Architecture, gave the answer. In Post-LN, the gradients near the output layer are large at initialization, which means a sane learning rate at depth 100 looks like a learning-rate-from-hell at depth 1. Warm-up was a workaround: start with such a tiny learning rate that nothing diverges, then ramp once the network has shaped itself.
Move the norm inside the residual block — x + Sublayer(LN(x)),
which is Pre-LN — and the gradients are well-behaved at
initialization across all depths. You can often drop warm-up
entirely, or shrink it dramatically. You can train deeper models.
You can use higher learning rates.
This is one of those changes where the diff is one line and the practical consequence is large — Pre-LN is one of several enabling factors (alongside better optimizers, gradient clipping, and init schemes) that make training very deep, very large transformers tractable. Pre-LN is the dominant choice in recent large LLMs.
Why RMSNorm replaced LayerNorm in many modern LLMs
LayerNorm does two things to each token vector: subtract the mean (centering) and divide by the standard deviation (scaling). Zhang and Sennrich (2019) asked: do we actually need the centering?
Their hypothesis: the re-scaling invariance is what stabilizes training. The re-centering invariance is mostly cosmetic. They removed the mean subtraction, kept only the divide-by-RMS, and called it RMSNorm.
LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta
RMSNorm(x) = x / sqrt(mean(x^2) + eps) * gamma
Empirically, in their paper and in many subsequent reports, models trained with RMSNorm reach roughly the same loss as LayerNorm — sometimes a hair better, sometimes a hair worse, broadly comparable. But RMSNorm is cheaper: one fewer reduction over the hidden dimension, no shift parameter, simpler GPU kernels. The reported speedups in the original paper were 7%-64% depending on model and hardware, which is real money at LLM training scale.
This is why Llama, Mistral, Gemma, Qwen, DeepSeek, and T5 all use RMSNorm; PaLM is also commonly described as using RMSNorm via its T5 lineage, though the original PaLM paper’s normalization wording is less explicit than the later open-model reports. GPT-2 and GPT-3 used LayerNorm because they predate the widespread RMSNorm result; I don’t have public confirmation of GPT-4-and-beyond’s normalization choice, so don’t quote me on the closed models.
The honest gaps
A few things are not fully understood, and the literature is still arguing:
- Why mean-centering doesn’t matter. Zhang and Sennrich showed empirically that you can drop it. The theoretical story for why re-centering is dispensable is still being filled in. Recent work (e.g. Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm, 2024) revisits this with geometric arguments. There isn’t a single textbook proof.
- Whether normalization is strictly necessary or just very convenient. There’s a small but persistent line of work on norm-free transformers — careful init, residual scaling, no explicit norm. They sometimes match LayerNorm baselines on small models. They have not, so far, replaced LayerNorm/RMSNorm in production LLMs. My read is “norms are an empirically dominant local optimum, not a mathematical requirement,” but the question is genuinely open.
- The exact gradient story. “Normalization keeps gradients tame” is the right shape of the answer, but the precise mechanism by which Pre-LN gradients become uniform across depth involves some nontrivial linear-algebra accounting that’s easier to verify experimentally than to derive cleanly from first principles. The Xiong 2020 paper does the derivation; it’s not a one-liner.
The thing to walk away with: normalization is the part of the transformer that exists to keep the rest of the transformer trainable. It’s not where the model stores knowledge. It’s not what makes the architecture expressive. It’s the structural support that lets a hundred residual blocks chain together without the gradient landscape going to hell. Modern transformers picked Pre-LN for the gradient reason, and many picked RMSNorm because the mean-subtraction turned out to be free to remove.
Famous related terms
- LayerNorm —
LayerNorm = per-token (subtract mean + divide by std) + learned scale + learned shift. The 2016 default. Still used in GPT-style models. - RMSNorm —
RMSNorm = per-token divide by RMS + learned scale. LayerNorm with the centering removed. The default in Llama, Mistral, Gemma, Qwen, DeepSeek, T5, PaLM. - BatchNorm —
BatchNorm = per-feature normalize across the batch + learned affine. The original normalization, from 2015 (Ioffe & Szegedy). Dominant in CNNs, ill-suited to sequence models because batch statistics couple examples together. - Pre-LN —
Pre-LN = norm goes inside the residual block.x + Sublayer(LN(x)). The placement that often lets you skip or shrink learning-rate warm-up and train deep transformers stably. - Post-LN —
Post-LN = norm goes after the residual add.LN(x + Sublayer(x)). The 2017 original. Works, but typically requires warm-up; uncommon in big modern LLMs. - Softmax —
softmax(x) = exp(x) / Σ exp(x). Also a normalization, of a different flavor: it normalizes a vector to a probability distribution. Unrelated mechanism, related vibe. - DyT (Dynamic Tanh) —
DyT ≈ replace normalization with a learned tanh. Recent (2025) proposal to replace normalization layers with a learned tanh. Interesting, not yet standard. Worth watching, not betting on.
Going deeper
- Layer Normalization (Ba, Kiros, Hinton, 2016) — the original paper, written in the RNN era, where the case against BatchNorm is cleanest.
- Root Mean Square Layer Normalization (Zhang & Sennrich, 2019) — the RMSNorm paper. Short, empirical, and has aged well.
- On Layer Normalization in the Transformer Architecture (Xiong et al., 2020) — the gradient-analysis paper that explains why Pre-LN works and Post-LN needs warm-up. The single best read for understanding why placement matters.
- Understanding and Improving Layer Normalization (Xu et al., 2019) — a follow-up that picks at the centering-vs-scaling question from a different angle.
- Sebastian Raschka’s “Why do many modern LLMs use RMSNorm instead of LayerNorm?” — a clean, illustrated walkthrough if you want a visual intuition.