Why Adam beat plain SGD for LLMs
Vision models are mostly trained with SGD + momentum. Transformers are almost always trained with Adam or AdamW. Why did one optimizer win one regime and lose the other?
Why it exists
If you read a CNN-era vision paper, the optimizer is almost always the same: SGD with momentum. ResNets, the original ImageNet results, the workhorse 2010s computer-vision recipes — SGD + momentum, with a hand-tuned learning-rate schedule. It’s boring in the good way: cheap, well understood, generalizes well. (Modern ViT-style vision recipes have largely moved to AdamW, but the contrast still holds for the architecture-by-architecture story.)
Now open a paper that trains a transformer. Pick almost any one. The optimizer is Adam or its cousin AdamW. Llama, GPT-style pretraining, fine-tuning, the open-source recipes — all Adam-flavored. Almost nobody trains a serious LLM with plain SGD — recent work has started revisiting that, but it’s the exception, not the default.
That should feel weird. SGD is the simpler algorithm. Adam stores two extra fp32 buffers per parameter — for a 70B-parameter model, ~560 GB of optimizer state, roughly 2× the size of the fp32 weights themselves. Why is the expensive, more complicated optimizer the one that won the regime where compute is most precious?
The answer isn’t “Adam is just better.” Adam is better at the specific shape of the loss landscape transformers produce on text. That qualifier is the whole story.
Why it matters now
Optimizer choice looks like a footnote until you start paying for it.
- VRAM
budgets are dominated by optimizer state. During training you store
weights, gradients, and Adam’s two moment buffers
mandv. In fp32 that’s 8 bytes per parameter just for the optimizer state, on top of the 4 bytes for the weights and 4 for the gradients — so optimizer state alone is roughly the size of the model again. This is a big part of why memory-efficient optimizers like Adafactor and 8-bit Adam exist (and why systems-level tricks like ZeRO/FSDP shard the state across devices). - Recipes don’t transfer. A schedule that works on a CNN does not trivially port to a transformer. If you don’t know why the field switched, you’ll waste runs trying to make SGD “just work” on a language model and conclude your code is broken.
- The frontier is moving. Recent work (Kunstner et al. 2024; follow-ups in 2025) has pushed back on the idea that Adam is necessarily better for LLMs — at small batch sizes, with the right tweaks, plain SGD can keep up. So the “why” isn’t settled folklore; it’s an active research question, and the answer matters for anyone trying to design the next generation of optimizers.
The short answer
Adam = SGD + per-parameter learning rate that adapts to each parameter's gradient history
SGD uses one global learning rate for every parameter. Adam keeps a running estimate of how big each parameter’s gradient typically is, and shrinks the step for parameters whose gradients are large or noisy while letting parameters with small gradients take bigger steps. That single change — making the effective step size per-parameter and adaptive — is the whole difference. For loss landscapes where different parameters live on wildly different scales (like transformers), it turns out to matter a lot.
How it works
Plain gradient descent is one rule:
w ← w − η · g
where g is the gradient of the loss with respect to w and η is the
learning rate. Same η for every parameter in the model.
Adam tracks two extra running averages per parameter — m (the gradient’s
running mean) and v (the gradient’s running mean squared) — and applies:
w ← w − η · m̂ / (√v̂ + ε)
The √v̂ denominator is the trick. Parameters whose gradients are large in
magnitude (whether from bad scaling, sharp curvature, or one frequent
training signal) get their effective step shrunk. Parameters with small,
consistent gradients keep their full η. It’s a per-parameter rescaling
that emerges automatically from the training data — no human has to set it.
That sounds like a small detail. For transformers on text, it isn’t.
Why this matters more for transformers than for CNNs
There are at least three lines of explanation in the literature, and they aren’t mutually exclusive. None is fully settled — treat this whole section as a working synthesis, not a verdict.
1. The Hessian is “block heterogeneous.”
The Hessian of a transformer’s loss has very different curvature in different parameter blocks: the attention weights, the MLP weights, the embedding table, and the LayerNorm scales all live on different scales. A single global learning rate is forced to compromise: small enough for the sharpest block, which is wasteful for the flat ones. Adam is coordinate-wise adaptive, but in practice that ends up giving different parameter blocks usefully different effective scales. “Why Transformers Need Adam: A Hessian Perspective” (Zhang et al., NeurIPS 2024, arXiv 2402.16788) argues that block-wise Hessian heterogeneity is a key reason SGD struggles on transformers and that this heterogeneity is much milder in CNNs. They argue a cause; I wouldn’t read it as the settled cause.
2. Token frequencies are heavy-tailed.
Natural-language tokens follow something Zipf-shaped: a small set of very
common tokens (the, ,, ., of) and a long tail of rare ones. If you
train with SGD’s single global learning rate, the gradient signal at the
output layer is dominated by frequent tokens. Rare tokens make tiny
contributions to the average gradient, so under plain SGD the loss on
rare-token classes goes down much more slowly than on frequent ones. Adam
divides by the per-parameter √v, which is small exactly for the
parameters that rarely receive a gradient — so rare-token directions get
amplified instead of drowned out. Kunstner et al.’s “Heavy-Tailed Class
Imbalance and Why Adam Outperforms Gradient Descent on Language Models”
(NeurIPS 2024, arXiv 2402.19449) makes this argument with a deliberately
designed empirical study.
3. SGD’s update directions are too sharp.
A complementary line of work looks at the directional sharpness of the update — how much the loss curves along the direction you’re stepping in. Pan & Li’s “Toward Understanding Why Adam Converges Faster Than SGD for Transformers” (arXiv 2306.00204) argues that SGD’s update steps land in much sharper directions than Adam’s on transformers, that this is driven by a few coordinates with poorly behaved curvature, and that Adam’s coordinate-wise scaling is essentially a directional-sharpness reduction. They show that adding coordinate-wise gradient clipping to SGD recovers much of Adam’s behavior, which is suggestive of where the gap actually lives.
The seams
Where the textbook story gets less clean:
- Recent results challenge “Adam is necessary.” Srećković, Geiping & Orvieto’s “Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling” (arXiv 2506.12543, 2025) argue the gap shrinks dramatically — sometimes to zero — when you use small batch sizes, proper gradient clipping, and momentum. Their reading: a lot of what we attributed to Adam might really be about Adam’s interaction with the large-batch regime that LLM training happens to live in. This is a working paper, not settled folklore, and I haven’t independently re-verified the experiments — but it’s a good reason to treat any one-line explanation skeptically.
- AdamW vs Adam. AdamW (Loshchilov & Hutter, ICLR 2019) is just Adam
with weight decay applied to the weights directly, not via the
gradient. With Adam, “L2 regularization” and “weight decay” stop being
equivalent (because the gradient is rescaled by
√v), and AdamW fixes it. Most modern LLM recipes use AdamW; when papers say “Adam,” it’s often AdamW under the hood. - Adam costs memory. Two extra fp32 buffers per parameter. For a 100B-parameter model that’s roughly 800 GB of optimizer state, which is why Adafactor and 8-bit Adam got invented and why ZeRO/FSDP-style systems shard the state across devices. The cost is real; the field pays it because in practice plain SGD on a transformer at modern LLM batch sizes converges materially slower for the same compute — but see the seam above; “in practice” is doing work in that sentence.
- It is not a free lunch on generalization. A long-running thread in the literature says Adam tends to find sharper minima than SGD and can generalize worse in some image-classification setups. That’s a big part of why CNNs stuck with SGD + momentum, and why the transformers-use-Adam pattern was non-obvious in advance.
- Why the field standardized so fast (my read, not consensus). The Adam paper (Kingma & Ba, ICLR 2015, arXiv 1412.6980) shipped well before transformers existed. Vaswani et al.’s “Attention Is All You Need” (2017) used Adam as the default optimizer, and the recipe worked well enough at scale that it became the obvious starting point for everything that followed. Some of what looks like a principled choice today is plausibly path dependence — which is another reason the recent “actually, SGD can work too” papers are worth paying attention to.
The compressed lesson: the optimizer that wins is the one that matches the shape of your loss landscape. Vision and language produce different shapes — different curvature, different gradient distributions, different noise — and the same optimizer doesn’t dominate both.
Famous related terms
- SGD —
SGD = "step against the gradient" + a single global learning rate. The baseline. Cheap, generalizes well in vision; struggles to learn rare-token directions in language. - Momentum —
momentum = SGD + a running average of past gradients. Helps SGD push through flat regions and dampens noise. Doesn’t, on its own, fix the per-parameter scale problem in transformers. - Adam —
Adam = SGD + momentum on the gradient + momentum on the squared gradient + divide by √(squared gradient). The default LLM optimizer. - AdamW —
AdamW = Adam + weight decay applied to weights, not gradients. Restores the regularization Adam accidentally breaks. What most modern LLM recipes actually use. - KV cache — different memory bottleneck (inference, not training), but the same theme: the cost of the algorithm shows up as VRAM.
- Adafactor / 8-bit Adam —
Adafactor ≈ Adam with the squared-gradient state factorized to save memory. Lives where you can’t afford the two extra fp32 buffers per parameter. - Scaling laws — every optimizer choice is implicitly inside the scaling-law constants. Switching optimizers can change them.
Going deeper
- Adam: A Method for Stochastic Optimization (Kingma & Ba, ICLR 2015, arXiv 1412.6980). The original. Short, and the algorithm is more obvious from the pseudocode than from any blog post.
- Decoupled Weight Decay Regularization (Loshchilov & Hutter, ICLR 2019, arXiv 1711.05101). The AdamW paper. Worth reading if only for the precise statement of why “weight decay” and “L2 regularization” diverge under adaptive gradients.
- Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models (Kunstner et al., NeurIPS 2024, arXiv 2402.19449). Cleanest empirical argument for the rare-token story.
- Why Transformers Need Adam: A Hessian Perspective (Zhang et al., NeurIPS 2024, arXiv 2402.16788). The block-heterogeneity argument.
- Toward Understanding Why Adam Converges Faster Than SGD for Transformers (Pan & Li, arXiv 2306.00204). The directional-sharpness argument, with coordinate-wise clipping as the controlled experiment.
- Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling (Srećković, Geiping & Orvieto, arXiv 2506.12543, 2025). The “wait, maybe SGD is fine if you use it right” counterpoint. Read it alongside the others; the truth is task- and regime-specific.