Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why LayerNorm (and RMSNorm) exist

Every transformer block has a normalization step. Pull it out and training falls apart in the first thousand steps. Why is this tiny operation load-bearing?

AI & ML intermediate Apr 29, 2026

Why it exists

Open any transformer implementation. Inside every block, sandwiched between the attention and the feed-forward, there is a tiny operation that looks like an afterthought: take the activations, subtract their mean, divide by their standard deviation, multiply by a learned scale. Five lines of code. No moving parts. Nobody talks about it.

Pull it out and at modern depth and learning rate the model usually doesn’t train. The loss explodes in the first thousand steps, or plateaus at the cross-entropy of “guess uniformly.” (You can sometimes rescue an un-normalized transformer with very careful initialization and residual scaling, but that’s its own line of research, not the default.) This is the kind of failure where, if you’ve never seen it, you assume your data loader is broken.

So the question is: why does a deep network — billions of parameters, trillions of training tokens, attention heads doing all the linguistically interesting work — collapse without a five-line rescaling step?

The short version is that gradients in deep networks are a chain of multiplications, and chains of multiplications go to zero or infinity unless something keeps them tame. Normalization is what keeps them tame. Everything else in the architecture assumes it.

Why it matters now

If you’re building, fine-tuning, or even just reasoning about modern LLMs, the choice of normalization shows up in places you’d rather not have to think about:

The norm is small, boring, and load-bearing. It’s the structural beam in the wall — easy to forget about until you remove it.

The short answer

LayerNorm = (per-token: subtract mean, divide by std) + learned scale and shift

RMSNorm = (per-token: divide by RMS) + learned scale, no shift

A normalization layer rescales each token’s activation vector so its magnitude is roughly fixed, regardless of how the previous layers chose to amplify it. That fixed magnitude is what makes the gradient chain through a hundred-layer network behave. RMSNorm is the same idea with the mean-subtraction step removed, after Zhang and Sennrich (2019) showed empirically that the centering wasn’t doing useful work.

How it works

Three things to get right: what normalization actually does to the math, why the layer-flavored version was needed for transformers in the first place, and why RMSNorm is the modern default.

What it does to the gradient

A deep network is a chain of matrix multiplications, nonlinearities, and adds. The gradient at the bottom is the product of all the Jacobians along the way. If the typical magnitude of those Jacobians is greater than 1, the product blows up. If it’s less than 1, the product shrinks to zero. Either way you stop learning.

You can try to control this by careful initialization (Xavier, He, etc.) and by clipping gradients. Those help. They are not enough at the depth and learning rate transformers want to train at.

Normalization is the brute-force fix. After every layer (or twice per block, in a transformer), you reach in and rescale the activations so that their magnitude is fixed by construction. The network can no longer drift into a regime where one block multiplies by 100 and the next divides by 100. Activations stay in a roughly controlled range across depth, which tends to keep the chain of Jacobians from blowing up or collapsing — not as a guarantee, but as a strong empirical regularization.

There’s a learned scale (and, in LayerNorm, a learned shift) so the network can still choose a non-unit magnitude where it’s useful — it just has to ask for it explicitly through a parameter, instead of getting it accidentally through compounded weight scales.

Why “Layer” Norm and not Batch Norm

The 2016 LayerNorm paper (Ba, Kiros, Hinton) was a response to a specific limit of Batch Normalization: BatchNorm normalizes each feature across the batch, which means the statistics depend on which other examples happen to be in the mini-batch. That’s fine for image classifiers with fixed-size batches in training. It’s a problem for sequence models, where you have variable lengths, padding, and — at inference — autoregressive generation that produces tokens one at a time. There is no “mini-batch of one token” that gives you meaningful BatchNorm statistics.

LayerNorm sidesteps this by normalizing per token, across the hidden dimension. Each token’s activation vector is normalized using only its own statistics. Training and inference behave identically. Batch size doesn’t matter. Padded positions don’t pollute anything. This is the property transformers need and it’s why the original transformer (and most that followed) reached for LayerNorm rather than BatchNorm, despite BatchNorm being older and more famous. Some later models switched to RMSNorm; almost none went back to BatchNorm.

Pre-LN vs. Post-LN — the placement that matters

The original “Attention Is All You Need” transformer placed LayerNorm after the residual add: x + Sublayer(x) then norm. This is Post-LN. It worked, but in practice it required a careful learning-rate warm-up that ramped from near-zero over thousands of steps; without warm-up, training tended to diverge. Nobody could quite say why.

Xiong et al. (2020), On Layer Normalization in the Transformer Architecture, gave the answer. In Post-LN, the gradients near the output layer are large at initialization, which means a sane learning rate at depth 100 looks like a learning-rate-from-hell at depth 1. Warm-up was a workaround: start with such a tiny learning rate that nothing diverges, then ramp once the network has shaped itself.

Move the norm inside the residual block — x + Sublayer(LN(x)), which is Pre-LN — and the gradients are well-behaved at initialization across all depths. You can often drop warm-up entirely, or shrink it dramatically. You can train deeper models. You can use higher learning rates.

This is one of those changes where the diff is one line and the practical consequence is large — Pre-LN is one of several enabling factors (alongside better optimizers, gradient clipping, and init schemes) that make training very deep, very large transformers tractable. Pre-LN is the dominant choice in recent large LLMs.

Why RMSNorm replaced LayerNorm in many modern LLMs

LayerNorm does two things to each token vector: subtract the mean (centering) and divide by the standard deviation (scaling). Zhang and Sennrich (2019) asked: do we actually need the centering?

Their hypothesis: the re-scaling invariance is what stabilizes training. The re-centering invariance is mostly cosmetic. They removed the mean subtraction, kept only the divide-by-RMS, and called it RMSNorm.

LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta
RMSNorm(x)   = x / sqrt(mean(x^2) + eps) * gamma

Empirically, in their paper and in many subsequent reports, models trained with RMSNorm reach roughly the same loss as LayerNorm — sometimes a hair better, sometimes a hair worse, broadly comparable. But RMSNorm is cheaper: one fewer reduction over the hidden dimension, no shift parameter, simpler GPU kernels. The reported speedups in the original paper were 7%-64% depending on model and hardware, which is real money at LLM training scale.

This is why Llama, Mistral, Gemma, Qwen, DeepSeek, and T5 all use RMSNorm; PaLM is also commonly described as using RMSNorm via its T5 lineage, though the original PaLM paper’s normalization wording is less explicit than the later open-model reports. GPT-2 and GPT-3 used LayerNorm because they predate the widespread RMSNorm result; I don’t have public confirmation of GPT-4-and-beyond’s normalization choice, so don’t quote me on the closed models.

The honest gaps

A few things are not fully understood, and the literature is still arguing:

The thing to walk away with: normalization is the part of the transformer that exists to keep the rest of the transformer trainable. It’s not where the model stores knowledge. It’s not what makes the architecture expressive. It’s the structural support that lets a hundred residual blocks chain together without the gradient landscape going to hell. Modern transformers picked Pre-LN for the gradient reason, and many picked RMSNorm because the mean-subtraction turned out to be free to remove.

Going deeper