Why do scaling laws exist?
Bigger model, more data, more compute — and the loss falls along a straight line on a log-log plot for seven orders of magnitude. Nobody fully knows why that line is so straight.
Why it exists
Most of machine learning before 2020 had the texture of craft. You’d pick an architecture, fiddle with regularization, anneal a learning rate, and hope you’d squeezed another point of accuracy out of the benchmark. There was no reason to believe that bigger models, more data, or more compute would predictably keep paying off — let alone that they’d pay off along a clean mathematical curve. People assumed there was some near-term ceiling. Surely at some point you’d hit diminishing returns; surely the model would start memorizing; surely something would break.
What Kaplan and collaborators at OpenAI showed in early 2020 was that, for transformer language models trained on next-token prediction, nothing breaks. Not for a long, long way. Plot test cross-entropy loss against model size, dataset size, or compute on log-log axes, and you get a straight line — over roughly six orders of magnitude in non-embedding parameters and around eight in adjusted compute. Architecture details (depth vs width, head count) barely matter inside a wide range. What matters is N (parameters), D (data), and C (compute). Pick any two and you can predict the loss.
That changed how the field thought about progress. You stopped asking “is there a clever trick that wins this benchmark?” and started asking “how much compute do I need to get loss X?” A lot of frontier-lab work shifted from research into budgeting.
The really uncomfortable part: nobody has a satisfying first-principles explanation for why the line is that straight. There are partial accounts (we’ll get to them), but the empirical fact came first, and a lot of the theory is still catching up.
Why it matters now
Scaling laws are the thing that turned LLM training from research into engineering. They matter to a working engineer for three concrete reasons:
- They tell you when to stop. If a 10× increase in compute will buy you 0.5 nats of loss reduction, and your downstream task only cares about a 0.1-nat improvement, you can decide to spend less. Or, more often, the other way around: scaling laws are the argument used to justify a $100M training run, because the math says it’ll work.
- They’re why “data-optimal” became a word. The 2022 Chinchilla paper from DeepMind (Hoffmann et al.) re-did the scaling-law fit more carefully and concluded GPT-3-era models were undertrained: at a fixed compute budget, you should make the model smaller and feed it more data. The result was the rule of thumb “~20 tokens per parameter” for compute-optimal training. Most open frontier models since 2023 (Llama, Mistral, DeepSeek, and others) follow that recipe at minimum, often pushing well past it on the data axis.
- They’re why post-training got interesting. Once pretraining loss is a known function of compute, the only remaining levers are what data you train on and what you do after pretraining. That’s a big part of why RLHF, distillation, and reasoning post-training have eaten so much of the field’s attention since 2023.
If you’ve ever wondered why every frontier model release reads like an industrial logistics report (“trained on 15T tokens for X exaflop-days”) rather than a research paper, this is why. The recipe is mostly known. The hard part is operating it.
The short answer
scaling law = loss falls as a power law in (N, D, C) — straight line on log-log axes
If you train transformer language models with the next-token objective and you don’t bottleneck on data or model size, the cross-entropy loss L satisfies, approximately:
L(N) ≈ (Nc / N)^αN when data isn't the limit
L(D) ≈ (Dc / D)^αD when model size isn't the limit
L(C) ≈ (Cc / C)^αC for compute-optimal training
where N is parameter count, D is training tokens, C is compute (flops), and the exponents α are small positive numbers (Kaplan et al. report αN ≈ 0.076 and αD ≈ 0.095 for their setup; the exact values depend on tokenizer and vocab). The point isn’t the specific numbers — it’s that the relationship is a power law and it holds across many orders of magnitude. (Kaplan et al. 2020.)
How it works
It helps to look at this in three pieces: what was actually measured, what people argue about, and what nobody knows yet.
What was measured
Kaplan et al. trained a swarm of decoder-only transformer language models — varying model size from ~768 to ~1.5B non-embedding parameters, varying data from ~22M to ~23B tokens, varying compute. For each run, they plotted the test cross-entropy loss. Three findings, in plain English:
- Loss is a power law in each axis. Plot loss vs N (with enough data and compute), you get a straight line on log-log. Same for D, same for C. No bend.
- Architecture details are second-order. Across a wide range of layer counts and aspect ratios, the shape of the model barely changes the curve. The deviations show up at extremes (very few layers, or very lopsided depth/width ratios). What matters is how many non-embedding parameters total.
- Sample efficiency improves with size. A bigger model needs proportionally fewer tokens to reach a given loss. This sounds counterintuitive; we’ll come back to it.
Then in 2022, Hoffmann et al. (“Chinchilla”) trained ~400 models from 70M to 16B parameters on 5B–500B tokens, with several methodological differences from Kaplan — including matching the cosine learning-rate schedule’s cycle length to the actual training duration. With those changes, their conclusion shifted: at a fixed compute budget, the compute-optimal mix is equal scaling of model and data — roughly 20 training tokens per parameter, not the much smaller ratios Kaplan’s law implied. They tested it by training a 70B model (“Chinchilla”) on 1.4T tokens, matching Gopher’s compute but using a smaller model with more data, and beat Gopher (280B params, 300B tokens) decisively. Why exactly the two scaling laws disagreed has since been studied carefully; later work (Pearce & Song 2024, Porian et al. 2024) traces the gap to a mix of factors — non-embedding-vs-total parameter conventions, warmup duration, last-layer compute accounting, and optimizer tuning across scales — and argues that learning-rate decay is not the dominant cause.
That’s the headline result. GPT-3 (175B params, 300B tokens — about 1.7 tokens/param) was, by Chinchilla’s lights, dramatically undertrained.
What people argue about
Two big asterisks have been put on the Chinchilla story:
- The fit is less precise than it looks. Besiroglu, Erdil, Barnett, and You (Epoch AI, 2024) re-extracted Hoffmann et al.’s data and refit their parametric model. They found the original confidence intervals were implausibly tight (would require >600,000 experiments; the team likely ran fewer than 500) and that the fitted exponents shift meaningfully on re-estimation. The “20 tokens per param” rule isn’t wrong in spirit, but the claim that we know it precisely is overstated.
- In practice, frontier models train way past compute-optimal. Llama 3 8B and 70B were trained on ~15T tokens — about 1875 tokens/param for the 8B (≈94× the Chinchilla-optimal ratio) and about 214 tokens/param for the 70B (≈11×). Meta says the models were still improving log-linearly at the end of training. The standard interpretation of why a lab would do this is inference-aware scaling: compute-optimal minimizes training cost, but if you’re going to deploy the model at scale, you care about total cost over its lifetime. A smaller model trained on more data is roughly the same training cost but a much cheaper forever cost. (See Sardana et al. 2024 for the explicit framing.) Meta hasn’t said in those words why they did it, but the engineering case is straightforward and consistent with their stated emphasis on inference efficiency.
Both of these are seams in the textbook account, and worth holding in mind whenever someone cites “Chinchilla-optimal” as if it were a constant of nature.
What nobody fully knows
Here’s the thing that should make you suspicious in a productive way: we have no settled, mechanistic theory for why the loss curve is a power law. There are several proposed explanations and they’re not yet reconciled. Roughly:
- Data-manifold / dimension arguments. If natural-language data lives on a manifold of effective dimension
d, and you’re trying to approximate a target distribution with a finite model, generic approximation arguments give power-law error in model size with exponent ~1/d. Sketches of this style appear in work like Bahri et al. on neural scaling. - Heavy-tailed distribution arguments. Token frequencies in natural language follow Zipf-like power laws. The loss decomposes into “easy” common patterns and a long tail of rare ones; the tail is what asymptotic loss is paying for, and its statistics produce the observed exponents.
- Information-theoretic floors. There’s an irreducible-entropy term in the loss equation (the part you can’t drive to zero — natural language has real entropy), and the power law describes how fast you close the gap to that floor.
None of these is a complete account. The honest summary: we have a robust empirical fact (loss is a power law in N, D, C across many orders of magnitude), several plausible theoretical sketches that each capture part of it, and active disagreement about which sketch is the right one — or whether the right one has been proposed yet.
Where the law breaks
It’s also worth being clear about where the curve isn’t actually flat:
- Outside the regime measured. The original Kaplan curves go up to ~1.5B params and ~23B tokens. Frontier models are 2–3 orders of magnitude larger on each axis. Most of the time the curve has held; that’s not a guarantee.
- On downstream tasks, not loss. Cross-entropy loss is a power law. Accuracy on benchmarks often isn’t — it sits flat for a while, then jumps. Schaeffer et al. 2023 argue that many of those jumps (“emergent abilities”) are metric artifacts — discontinuous metrics over a smooth underlying improvement — though they don’t claim every capability jump is illusory. Either way, “loss is smooth” doesn’t translate to “capabilities are smooth.”
- When data quality changes. The scaling laws were fit on a fixed mix. Switch to higher-quality data (filtering, deduplication, code-heavy mixes, synthetic data) and the constants shift. This is a big part of why post-2023 frontier labs spend so much effort on data — moving the intercept of the line is just as valuable as moving along it.
Famous related terms
- Power law —
power law: y = a · x^k— straight line on log-log axes. The shape of nearly every scaling result. The fact that loss curves are straight on log-log is itself the load-bearing observation, not a triviality. - Compute-optimal training —
compute-optimal = pick (N, D) to minimize L for fixed C— the Chinchilla framing. Roughly 20 tokens per parameter for the original setup, with the caveats above. - Inference-optimal training —
inference-optimal = pick (N, D) to minimize total cost including deployment— why Llama 3 trains far past compute-optimal. Smaller model + more data is cheaper to serve forever. - Emergent abilities — capabilities that appear sharply with scale and that smooth loss curves don’t predict. How much of the apparent emergence is real and how much is the metric is contested; Schaeffer et al. 2023 make the case that many cited examples are artifacts of discontinuous metrics over smooth underlying progress.
- Chinchilla —
Chinchilla = 70B params + 1.4T tokens (~20 tokens/param)— DeepMind’s compute-optimal model that beat the much larger Gopher. The poster child for “more data, smaller model.” - Kaplan scaling laws vs Chinchilla scaling laws — same shape, different exponents, different ratio recommendations. Reconciliation work (Pearce & Song 2024, Porian et al. 2024) traces the discrepancy to a mix of parameter-counting conventions, optimizer tuning, warmup, and last-layer compute — not primarily learning-rate decay.
- Loss vs capability — the load-bearing distinction in any scaling-laws conversation. The law is for cross-entropy loss; what users care about is downstream behavior. The map between the two is not itself a clean power law.
Going deeper
- Scaling Laws for Neural Language Models — Kaplan et al., 2020. The original. Read sections 1–3 first; the appendices are where the curve fits live.
- Training Compute-Optimal Large Language Models — Hoffmann et al., 2022. The Chinchilla paper. Three different methodologies that converge on the same “scale N and D equally” conclusion.
- Chinchilla Scaling: A Replication Attempt — Besiroglu et al., 2024. The careful critique of Chinchilla’s parametric fits. Required reading before quoting “20 tokens per parameter” with confidence.
- Reconciling Kaplan and Chinchilla Scaling Laws — Pearce & Song, 2024, and Resolving Discrepancies in Compute-Optimal Scaling of Language Models — Porian et al., 2024. Two different post-mortems on why Kaplan’s law and Chinchilla’s law disagreed, and what the actual difference was.
- Are Emergent Abilities of Large Language Models a Mirage? — Schaeffer, Miranda, Koyejo, 2023. The “emergence is a metric artifact” paper. Useful counterweight to the smooth-scaling narrative.
- Explaining neural scaling laws — Bahri et al., one of the more developed theoretical accounts. Worth reading even if (especially if) you don’t fully buy it.