Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why Adam beat plain SGD for LLMs

Vision models are mostly trained with SGD + momentum. Transformers are almost always trained with Adam or AdamW. Why did one optimizer win one regime and lose the other?

AI & ML intermediate Apr 29, 2026

Why it exists

If you read a CNN-era vision paper, the optimizer is almost always the same: SGD with momentum. ResNets, the original ImageNet results, the workhorse 2010s computer-vision recipes — SGD + momentum, with a hand-tuned learning-rate schedule. It’s boring in the good way: cheap, well understood, generalizes well. (Modern ViT-style vision recipes have largely moved to AdamW, but the contrast still holds for the architecture-by-architecture story.)

Now open a paper that trains a transformer. Pick almost any one. The optimizer is Adam or its cousin AdamW. Llama, GPT-style pretraining, fine-tuning, the open-source recipes — all Adam-flavored. Almost nobody trains a serious LLM with plain SGD — recent work has started revisiting that, but it’s the exception, not the default.

That should feel weird. SGD is the simpler algorithm. Adam stores two extra fp32 buffers per parameter — for a 70B-parameter model, ~560 GB of optimizer state, roughly 2× the size of the fp32 weights themselves. Why is the expensive, more complicated optimizer the one that won the regime where compute is most precious?

The answer isn’t “Adam is just better.” Adam is better at the specific shape of the loss landscape transformers produce on text. That qualifier is the whole story.

Why it matters now

Optimizer choice looks like a footnote until you start paying for it.

The short answer

Adam = SGD + per-parameter learning rate that adapts to each parameter's gradient history

SGD uses one global learning rate for every parameter. Adam keeps a running estimate of how big each parameter’s gradient typically is, and shrinks the step for parameters whose gradients are large or noisy while letting parameters with small gradients take bigger steps. That single change — making the effective step size per-parameter and adaptive — is the whole difference. For loss landscapes where different parameters live on wildly different scales (like transformers), it turns out to matter a lot.

How it works

Plain gradient descent is one rule:

w ← w − η · g

where g is the gradient of the loss with respect to w and η is the learning rate. Same η for every parameter in the model.

Adam tracks two extra running averages per parameter — m (the gradient’s running mean) and v (the gradient’s running mean squared) — and applies:

w ← w − η · m̂ / (√v̂ + ε)

The √v̂ denominator is the trick. Parameters whose gradients are large in magnitude (whether from bad scaling, sharp curvature, or one frequent training signal) get their effective step shrunk. Parameters with small, consistent gradients keep their full η. It’s a per-parameter rescaling that emerges automatically from the training data — no human has to set it.

That sounds like a small detail. For transformers on text, it isn’t.

Why this matters more for transformers than for CNNs

There are at least three lines of explanation in the literature, and they aren’t mutually exclusive. None is fully settled — treat this whole section as a working synthesis, not a verdict.

1. The Hessian is “block heterogeneous.”

The Hessian of a transformer’s loss has very different curvature in different parameter blocks: the attention weights, the MLP weights, the embedding table, and the LayerNorm scales all live on different scales. A single global learning rate is forced to compromise: small enough for the sharpest block, which is wasteful for the flat ones. Adam is coordinate-wise adaptive, but in practice that ends up giving different parameter blocks usefully different effective scales. “Why Transformers Need Adam: A Hessian Perspective” (Zhang et al., NeurIPS 2024, arXiv 2402.16788) argues that block-wise Hessian heterogeneity is a key reason SGD struggles on transformers and that this heterogeneity is much milder in CNNs. They argue a cause; I wouldn’t read it as the settled cause.

2. Token frequencies are heavy-tailed.

Natural-language tokens follow something Zipf-shaped: a small set of very common tokens (the, ,, ., of) and a long tail of rare ones. If you train with SGD’s single global learning rate, the gradient signal at the output layer is dominated by frequent tokens. Rare tokens make tiny contributions to the average gradient, so under plain SGD the loss on rare-token classes goes down much more slowly than on frequent ones. Adam divides by the per-parameter √v, which is small exactly for the parameters that rarely receive a gradient — so rare-token directions get amplified instead of drowned out. Kunstner et al.’s “Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models” (NeurIPS 2024, arXiv 2402.19449) makes this argument with a deliberately designed empirical study.

3. SGD’s update directions are too sharp.

A complementary line of work looks at the directional sharpness of the update — how much the loss curves along the direction you’re stepping in. Pan & Li’s “Toward Understanding Why Adam Converges Faster Than SGD for Transformers” (arXiv 2306.00204) argues that SGD’s update steps land in much sharper directions than Adam’s on transformers, that this is driven by a few coordinates with poorly behaved curvature, and that Adam’s coordinate-wise scaling is essentially a directional-sharpness reduction. They show that adding coordinate-wise gradient clipping to SGD recovers much of Adam’s behavior, which is suggestive of where the gap actually lives.

The seams

Where the textbook story gets less clean:

The compressed lesson: the optimizer that wins is the one that matches the shape of your loss landscape. Vision and language produce different shapes — different curvature, different gradient distributions, different noise — and the same optimizer doesn’t dominate both.

Going deeper