Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why do scaling laws exist?

Bigger model, more data, more compute — and the loss falls along a straight line on a log-log plot for seven orders of magnitude. Nobody fully knows why that line is so straight.

AI & ML intermediate Apr 29, 2026

Why it exists

Most of machine learning before 2020 had the texture of craft. You’d pick an architecture, fiddle with regularization, anneal a learning rate, and hope you’d squeezed another point of accuracy out of the benchmark. There was no reason to believe that bigger models, more data, or more compute would predictably keep paying off — let alone that they’d pay off along a clean mathematical curve. People assumed there was some near-term ceiling. Surely at some point you’d hit diminishing returns; surely the model would start memorizing; surely something would break.

What Kaplan and collaborators at OpenAI showed in early 2020 was that, for transformer language models trained on next-token prediction, nothing breaks. Not for a long, long way. Plot test cross-entropy loss against model size, dataset size, or compute on log-log axes, and you get a straight line — over roughly six orders of magnitude in non-embedding parameters and around eight in adjusted compute. Architecture details (depth vs width, head count) barely matter inside a wide range. What matters is N (parameters), D (data), and C (compute). Pick any two and you can predict the loss.

That changed how the field thought about progress. You stopped asking “is there a clever trick that wins this benchmark?” and started asking “how much compute do I need to get loss X?” A lot of frontier-lab work shifted from research into budgeting.

The really uncomfortable part: nobody has a satisfying first-principles explanation for why the line is that straight. There are partial accounts (we’ll get to them), but the empirical fact came first, and a lot of the theory is still catching up.

Why it matters now

Scaling laws are the thing that turned LLM training from research into engineering. They matter to a working engineer for three concrete reasons:

If you’ve ever wondered why every frontier model release reads like an industrial logistics report (“trained on 15T tokens for X exaflop-days”) rather than a research paper, this is why. The recipe is mostly known. The hard part is operating it.

The short answer

scaling law = loss falls as a power law in (N, D, C) — straight line on log-log axes

If you train transformer language models with the next-token objective and you don’t bottleneck on data or model size, the cross-entropy loss L satisfies, approximately:

L(N) ≈ (Nc / N)^αN     when data isn't the limit
L(D) ≈ (Dc / D)^αD     when model size isn't the limit
L(C) ≈ (Cc / C)^αC     for compute-optimal training

where N is parameter count, D is training tokens, C is compute (flops), and the exponents α are small positive numbers (Kaplan et al. report αN ≈ 0.076 and αD ≈ 0.095 for their setup; the exact values depend on tokenizer and vocab). The point isn’t the specific numbers — it’s that the relationship is a power law and it holds across many orders of magnitude. (Kaplan et al. 2020.)

How it works

It helps to look at this in three pieces: what was actually measured, what people argue about, and what nobody knows yet.

What was measured

Kaplan et al. trained a swarm of decoder-only transformer language models — varying model size from ~768 to ~1.5B non-embedding parameters, varying data from ~22M to ~23B tokens, varying compute. For each run, they plotted the test cross-entropy loss. Three findings, in plain English:

  1. Loss is a power law in each axis. Plot loss vs N (with enough data and compute), you get a straight line on log-log. Same for D, same for C. No bend.
  2. Architecture details are second-order. Across a wide range of layer counts and aspect ratios, the shape of the model barely changes the curve. The deviations show up at extremes (very few layers, or very lopsided depth/width ratios). What matters is how many non-embedding parameters total.
  3. Sample efficiency improves with size. A bigger model needs proportionally fewer tokens to reach a given loss. This sounds counterintuitive; we’ll come back to it.

Then in 2022, Hoffmann et al. (“Chinchilla”) trained ~400 models from 70M to 16B parameters on 5B–500B tokens, with several methodological differences from Kaplan — including matching the cosine learning-rate schedule’s cycle length to the actual training duration. With those changes, their conclusion shifted: at a fixed compute budget, the compute-optimal mix is equal scaling of model and data — roughly 20 training tokens per parameter, not the much smaller ratios Kaplan’s law implied. They tested it by training a 70B model (“Chinchilla”) on 1.4T tokens, matching Gopher’s compute but using a smaller model with more data, and beat Gopher (280B params, 300B tokens) decisively. Why exactly the two scaling laws disagreed has since been studied carefully; later work (Pearce & Song 2024, Porian et al. 2024) traces the gap to a mix of factors — non-embedding-vs-total parameter conventions, warmup duration, last-layer compute accounting, and optimizer tuning across scales — and argues that learning-rate decay is not the dominant cause.

That’s the headline result. GPT-3 (175B params, 300B tokens — about 1.7 tokens/param) was, by Chinchilla’s lights, dramatically undertrained.

What people argue about

Two big asterisks have been put on the Chinchilla story:

Both of these are seams in the textbook account, and worth holding in mind whenever someone cites “Chinchilla-optimal” as if it were a constant of nature.

What nobody fully knows

Here’s the thing that should make you suspicious in a productive way: we have no settled, mechanistic theory for why the loss curve is a power law. There are several proposed explanations and they’re not yet reconciled. Roughly:

None of these is a complete account. The honest summary: we have a robust empirical fact (loss is a power law in N, D, C across many orders of magnitude), several plausible theoretical sketches that each capture part of it, and active disagreement about which sketch is the right one — or whether the right one has been proposed yet.

Where the law breaks

It’s also worth being clear about where the curve isn’t actually flat:

Going deeper