Why dropout disappeared from modern LLMs
Dropout was the regularization workhorse of the deep-learning era. Frontier LLM pretraining quietly stopped using it. The reason isn't that dropout broke — it's that the problem dropout solved stopped being the problem.
Why it exists
Picture two ways to study for an exam. The first: re-read the same textbook five times until you’ve practically memorized the page numbers. The second: read fifty different textbooks once each. The first student crushes practice questions taken from that textbook and then bombs the real exam, because they learned the book, not the subject. The second student has never seen any single passage twice — and ends up understanding the material better. Dropout was invented for the first kind of student. Modern LLM pretraining looks like the second kind, and that’s most of the story.
In the 2010s, neural networks were trained on relatively small, repeated datasets, and they would happily overfit — fit the training set perfectly while failing on anything new. Srivastava, Hinton and coauthors published Dropout: A Simple Way to Prevent Neural Networks from Overfitting in JMLR in 2014, and the trick was as memorable as the name: during training, randomly zero out a fraction of activations on every forward pass. The network can’t lean on any single neuron, so it learns more distributed, robust features. It worked, it was cheap to implement, and it became the default — most canonical CNN recipes and almost every BERT-era transformer recipe shipped with dropout on. The original Attention Is All You Need (Vaswani et al., 2017) used Pdrop = 0.1 in its base model (and 0.3 in one of the big variants); BERT (Devlin et al., 2018) used 0.1 in pretraining and fine-tuning.
Then the data and the models got much bigger, and the same researchers who’d shipped dropout in every paper quietly stopped turning it on.
Why it matters now
This isn’t just historical trivia — it’s a working example of how scaling changes which knobs matter:
- Recipes copied from BERT-era code mostly still set
dropout=0.1. If you fine-tune or pretrain a small model with that default, you’re paying a regularization tax that may not buy you anything at your scale. The defaults outlived the regime they were designed for. - Fine-tuning is a different regime from pretraining. When you fine-tune on a small task-specific dataset, you are the 2015 student re-reading one textbook. Dropout (and its cousins like attention dropout) often comes back on for fine-tuning and LoRA adapters even when pretraining ran without it.
- The mental model “more regularization = better generalization” stops carrying its weight at scale. When you’ve spent a Chinchilla-optimal compute budget, anything that adds noise to gradients without buying matching generalization eats into effective steps. Whether dropout specifically interacts badly with fused attention kernels or low-precision activations isn’t something I can cite cleanly — the more defensible point is just that the upside shrank as data scale grew, while the engineering surface area didn’t.
The short answer
dropout-free pretraining = scale + data diversity > stochastic regularization
Dropout fights overfitting by injecting noise so the network can’t lean on individual features. At internet-scale pretraining, the model sees most tokens roughly once, and the data distribution is broad enough that the repeated-example overfitting pressure dropout was designed to fight is largely absent. (LLMs still memorize — verbatim recall of training data is a real, measured phenomenon — but that’s a different failure mode than the one dropout addresses.) Data scale and diversity do most of the regularization work for free, and the per-step noise dropout adds stops paying its keep.
How it works
To see why scale changes the equation, it helps to remember what dropout actually was doing.
Dropout sets a random subset of activations to zero on each training step (typically 10–50% in the 2010s). The standard story has two parts. (1) It approximates an exponentially large ensemble: each forward pass is a different sub-network, and at inference you “average” them by using the full network with rescaled activations. (2) It prevents co-adaptation: no neuron can rely on any specific other neuron being present, so features have to stand on their own.
Both stories assume the model is in a regime where the same training examples are seen many times and the network can latch onto incidental co-adaptations. That regime used to be the default — a CNN trained on ImageNet sees each image dozens of times across epochs. It isn’t anymore for frontier pretraining.
Frontier LLM pretraining looks different in three ways that matter:
- Single-pass-ish data. Pretraining runs are typically one epoch or close to it over trillions of tokens. LLaMA 1’s data mix, for example, uses most components for one epoch, with Wikipedia and books at roughly two. The model rarely sees the same exact sequence many times.
- The dataset itself does much of the regularizing. Web text, code, books, math, multilingual data — the distribution is broad enough that any feature the model learns has to pay rent across many domains. This is the field’s working mental model, not a proven mechanism, but it lines up with what open recipes converged on.
- Compute, not variance, is the binding constraint. Once you’re allocating tokens and parameters under a fixed compute budget (the Chinchilla framing), anything that injects noise into gradients without buying matching generalization eats into effective steps. At small scale, dropout’s noise pays for itself in better generalization. At large scale with diverse data, that bargain shifts.
What do open-weight technical reports actually say? The LLaMA 1 paper (Touvron et al., 2023) does not report using dropout in pretraining. Pythia’s released config sets attention and hidden dropout to 0 for pretraining. Other open-weight reports in the same era describe similar setups — low or zero dropout for pretraining, sometimes nonzero for downstream fine-tuning. The exact configurations of closed frontier models (GPT-4, Claude, Gemini) are not public, so I can’t tell you their dropout rates. What’s public is the trend in open-weight reports and the underlying argument: when the data is doing the regularizing, the noise injection isn’t earning its slot.
The honest seam: a 2025 ACL Findings paper (Liu et al.) studies dropout in single-epoch language-model pretraining at the BERT-base / Pythia 160M–1.4B scale and reports that dropout-free training is competitive or better in that regime — that’s the closest published ablation I’m aware of. But it isn’t 70B+ frontier scale; a clean head-to-head holding tokens, parameters, optimizer, and precision fixed at frontier scale isn’t something I’ve seen. The case in this post is partly mechanistic (the data-diversity argument), partly observational (open-weight recipes converged on low/zero dropout), and partly path-dependent. The safe claim is: dropout’s role shrank dramatically as pretraining scale grew, not that it has been formally proven harmful.
Famous related terms
- Weight decay —
weight decay = loss + λ·||weights||²— the other classic regularizer. Unlike dropout, it survived the scale transition and is standard in modern LLM training. It penalizes weight magnitude rather than injecting activation noise, which is friendlier to large-batch optimization. - Label smoothing —
label smoothing = one-hot target + small uniform mass— softens the training target. Used in the original transformer; usage in modern LLM pretraining is mixed and not always disclosed. - Data augmentation —
data augmentation = training set + label-preserving transformations— the vision-world cousin of “more data fixes it.” LLMs effectively get this for free from the diversity of web text. - Why scaling laws exist —
scaling laws ≈ loss falls predictably as parameters, data, and compute grow— the broader story for why “make it bigger and feed it more data” reshaped which tricks matter. - Why fine-tuning is cheap —
fine-tuning ≈ pretraining minus building representations from scratch— the regime where dropout often comes back on, because the data is small and overfitting is real again.
Going deeper
- Srivastava, Hinton et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014 — jmlr.org/papers/v15/srivastava14a.html. The original.
- Vaswani et al., Attention Is All You Need, 2017 — arxiv.org/abs/1706.03762. The transformer paper, which used dropout.
- Touvron et al., LLaMA: Open and Efficient Foundation Language Models, 2023 — arxiv.org/abs/2302.13971. One of the open-weight recipes where dropout’s diminished role in pretraining is visible.
- Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla), 2022 — arxiv.org/abs/2203.15556. The compute/data-scaling reframing that pushed “more tokens per parameter” into mainstream practice.
- Liu et al., ACL 2025 Findings paper on dropout in single-epoch language-model pretraining (BERT-base / Pythia 160M–1.4B). Closest open-literature ablation I’ve seen on the dropout-vs-no-dropout question — search the ACL Findings 2025 listings for the exact citation.