Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why dropout disappeared from modern LLMs

Dropout was the regularization workhorse of the deep-learning era. Frontier LLM pretraining quietly stopped using it. The reason isn't that dropout broke — it's that the problem dropout solved stopped being the problem.

AI & ML intermediate May 2, 2026

Why it exists

Picture two ways to study for an exam. The first: re-read the same textbook five times until you’ve practically memorized the page numbers. The second: read fifty different textbooks once each. The first student crushes practice questions taken from that textbook and then bombs the real exam, because they learned the book, not the subject. The second student has never seen any single passage twice — and ends up understanding the material better. Dropout was invented for the first kind of student. Modern LLM pretraining looks like the second kind, and that’s most of the story.

In the 2010s, neural networks were trained on relatively small, repeated datasets, and they would happily overfit — fit the training set perfectly while failing on anything new. Srivastava, Hinton and coauthors published Dropout: A Simple Way to Prevent Neural Networks from Overfitting in JMLR in 2014, and the trick was as memorable as the name: during training, randomly zero out a fraction of activations on every forward pass. The network can’t lean on any single neuron, so it learns more distributed, robust features. It worked, it was cheap to implement, and it became the default — most canonical CNN recipes and almost every BERT-era transformer recipe shipped with dropout on. The original Attention Is All You Need (Vaswani et al., 2017) used Pdrop = 0.1 in its base model (and 0.3 in one of the big variants); BERT (Devlin et al., 2018) used 0.1 in pretraining and fine-tuning.

Then the data and the models got much bigger, and the same researchers who’d shipped dropout in every paper quietly stopped turning it on.

Why it matters now

This isn’t just historical trivia — it’s a working example of how scaling changes which knobs matter:

The short answer

dropout-free pretraining = scale + data diversity > stochastic regularization

Dropout fights overfitting by injecting noise so the network can’t lean on individual features. At internet-scale pretraining, the model sees most tokens roughly once, and the data distribution is broad enough that the repeated-example overfitting pressure dropout was designed to fight is largely absent. (LLMs still memorize — verbatim recall of training data is a real, measured phenomenon — but that’s a different failure mode than the one dropout addresses.) Data scale and diversity do most of the regularization work for free, and the per-step noise dropout adds stops paying its keep.

How it works

To see why scale changes the equation, it helps to remember what dropout actually was doing.

Dropout sets a random subset of activations to zero on each training step (typically 10–50% in the 2010s). The standard story has two parts. (1) It approximates an exponentially large ensemble: each forward pass is a different sub-network, and at inference you “average” them by using the full network with rescaled activations. (2) It prevents co-adaptation: no neuron can rely on any specific other neuron being present, so features have to stand on their own.

Both stories assume the model is in a regime where the same training examples are seen many times and the network can latch onto incidental co-adaptations. That regime used to be the default — a CNN trained on ImageNet sees each image dozens of times across epochs. It isn’t anymore for frontier pretraining.

Frontier LLM pretraining looks different in three ways that matter:

What do open-weight technical reports actually say? The LLaMA 1 paper (Touvron et al., 2023) does not report using dropout in pretraining. Pythia’s released config sets attention and hidden dropout to 0 for pretraining. Other open-weight reports in the same era describe similar setups — low or zero dropout for pretraining, sometimes nonzero for downstream fine-tuning. The exact configurations of closed frontier models (GPT-4, Claude, Gemini) are not public, so I can’t tell you their dropout rates. What’s public is the trend in open-weight reports and the underlying argument: when the data is doing the regularizing, the noise injection isn’t earning its slot.

The honest seam: a 2025 ACL Findings paper (Liu et al.) studies dropout in single-epoch language-model pretraining at the BERT-base / Pythia 160M–1.4B scale and reports that dropout-free training is competitive or better in that regime — that’s the closest published ablation I’m aware of. But it isn’t 70B+ frontier scale; a clean head-to-head holding tokens, parameters, optimizer, and precision fixed at frontier scale isn’t something I’ve seen. The case in this post is partly mechanistic (the data-diversity argument), partly observational (open-weight recipes converged on low/zero dropout), and partly path-dependent. The safe claim is: dropout’s role shrank dramatically as pretraining scale grew, not that it has been formally proven harmful.

Going deeper