Why synthetic data works for modern LLM training
The open web ran out of high-quality text years before frontier models stopped getting better. The new training signal didn't come from a fresh internet — it came from models writing for models, with filters in front.
Why it exists
If you have been paying attention to chatbots since 2023, something doesn’t quite add up. GPT-4 became GPT-4o became GPT-5. Claude 3 became 3.5 became 4.x. Llama 2 became 3 became 4. Every generation got visibly better at coding, at math, at long-form reasoning. But there hasn’t been a new internet to scrape — Reddit and StackOverflow and GitHub didn’t double in size every nine months. So where did the new training signal come from?
The short version is: a lot of it didn’t come from humans at all. The current frontier of LLM training leans heavily on text that other models wrote, filtered hard for the parts that actually teach. This is the thing called synthetic data, and it is one of the loadbearing ingredients of the 2024–2026 generation of models.
The framing comes from a 2022 paper by Villalobos and collaborators at Epoch AI, Will we run out of data? Limits of LLM scaling based on human-generated data (revised 2024). Their projection: at the rates labs were scaling pretraining sets, the stock of high-quality public human text would be exhausted somewhere between roughly 2026 and 2032. That is the data wall. The interesting question is what happened instead of hitting it — because models kept improving, and the wall didn’t visibly stop them.
Why it matters now
If you build on top of frontier models, almost everything you notice about their behavior past the base capability — the way they write code, the way they show their reasoning, the way they format math, the cleanliness of their chain-of-thought — is shaped by post-training on synthetic data. Meta’s The Llama 3 Herd of Models (2024) is the clearest public example: they explicitly describe generating synthetic data for code, math, reasoning, long-context, and tool use, and using it inside SFT and preference-pair generation.
For the closed labs — OpenAI, Anthropic, Google DeepMind — the exact training mixes for GPT-5, Claude 4.x, and Gemini are not public. We can see the shape of the recipe in the open papers and infer the rest, but I don’t have a reliable source for the precise synthetic-to-human ratios at the frontier, and anyone who claims a number is probably guessing. What is uncontroversial: synthetic data is no longer a side dish.
The short answer
synthetic data ≈ stronger model writes + filter for what teaches
You take a capable existing model (or a verifier, or a chain of both) and have it generate candidate training examples. Then you throw most of them away — keeping only the ones that pass some quality bar. The surviving examples become training data for the next model. The filter is doing as much work as the generator.
How it works
Three flavors of synthetic data dominate the open literature, and they shade into each other.
1. Curated generation from a stronger model
The cleanest example is Microsoft’s phi line. Textbooks Are All You Need (Gunasekar et al., 2023) trained a 1.3B-parameter code model, phi-1, on a mix of filtered web code and synthetic “textbook-style” Python content generated with GPT-3.5, plus synthetic exercises. Despite being orders of magnitude smaller than contemporaries, phi-1 reached 50.6% pass@1 on HumanEval — the headline result that made “textbook-quality synthetic data” a research direction rather than a curiosity.
The mechanism here is simple: GPT-3.5 already knows how to write a clean, well-commented Python tutorial. The web does not — the web is full of half-finished snippets, copy-pasted answers, and dead StackOverflow threads. Synthetic textbooks let phi-1 learn from concentrated, well-explained code instead of statistical sludge.
The same recipe shows up in Llama 3’s post-training. The Llama 3 paper documents synthetic data pipelines for code (a “code expert” model generating SFT examples), for math, and for reasoning, with quality filters in front and preference-pair generation feeding DPO.
2. Distillation as synthetic data
A teacher model generating outputs that a student trains on is, mechanically, synthetic data — the teacher is the generator. The framing is just different: in distillation you care about compressing a big model into a small one; in synthetic-data work you care about producing examples that teach a behavior. The pipeline is the same. Most of the small open-weights models you have used in the last year were distilled this way, and the line between “distilled” and “trained on synthetic data” is mostly about which side of the conversation you are emphasizing.
3. Self-improvement on verifiable domains
This is the variant with the steepest growth curve, and it deserves its own post — why verifiable domains run away covers it in detail. The setup: in domains where you can automatically check whether an answer is correct (math problems with known answers, code with passing tests, formal proofs), the model can generate millions of attempts, keep only the ones that pass the check, and train on those.
This is why frontier reasoning models have improved on math and code so much faster than on, say, taste in poetry. The verifier is free. Rejection sampling turns “generate” into “generate-and-grade” — the grade is the filter that makes the synthetic data actually useful.
Why the filter is the magic
Naively, “train a model on its own output” sounds like it should be useless: the model can’t teach itself things it doesn’t already know. And the naive version is useless — sometimes worse than useless, see model collapse below. What changes the picture is that the filter isn’t the model itself. The filter is one of:
- a stronger model (teacher → student)
- a verifier (a checker, a unit test, a math grader)
- a rule (length, format, refusal pattern)
- a second model trained to predict human preference (the reward model from RLHF)
In every case there is information entering the training pipeline that didn’t come from the generator. The generator’s job is to produce a wide distribution of candidates; the filter’s job is to extract the parts that are actually correct or useful. The synthetic data is what survives the filter.
This is also why “synthetic data” and “distillation” and “RL on verifiable rewards” are deeply related — they are all the same trick wearing different hats: cheap candidate generation plus a more selective filter.
The seam: this is not a free lunch
If you train a model on its own outputs, and then train the next model on that model’s outputs, and so on, things go wrong. Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget (2023), showed that recursive training on model-generated data causes model collapse: the tails of the distribution disappear, rare modes get lost, and the model converges to a narrower and narrower slice of what its ancestors knew.
The reason synthetic data works in practice — and doesn’t collapse frontier models — is that the recipe in production isn’t “model trains on its own output, repeat.” It’s:
- A stronger model generates for a weaker one (phi-1 from GPT-3.5; small open models from larger ones). Information flows downhill.
- A verifier filters (only passing solutions survive). The verifier is anchored in something real (a test suite, a known answer), not in the generator.
- Synthetic is mixed with human and web data, not used alone. The Llama 3 paper is explicit about this. The exact ratios aren’t public for the closed labs, but every public recipe I’m aware of mixes.
There are also subtler costs even when collapse is avoided: synthetic data inherits the teacher’s biases, its refusal patterns, its stylistic tics, and its blind spots. If the teacher is wrong in a systematic way, the student inherits the wrongness with high confidence. This is one of the reasons frontier post-training pipelines run many generators, multiple filters, and constant evaluation against held-out human data — to catch the drift before it bakes in.
The standard account is that synthetic data extended the scaling era past where the data wall would have stopped it. Whether it can keep doing so indefinitely — whether the trick has another order of magnitude in it, or whether we’re already in diminishing returns — is genuinely contested, and not something I’d bet either way on.
Famous related terms
- Distillation —
distillation = teacher model + student model + train student on teacher's outputs. Synthetic data with a compression goal. - Rejection sampling —
rejection sampling = generate many + score each + keep the winners. The filter half of synthetic data, especially powerful in verifiable domains. - RLVR —
RLVR = RL loop + reward from a verifier instead of a learned reward model. The training signal is, in effect, filtered synthetic rollouts. - Model collapse —
model collapse ≈ distribution narrowing under recursive self-training. The failure mode synthetic-data pipelines design around. - Data wall —
data wall ≈ projected exhaustion of high-quality public human text. The Villalobos et al. framing the rest of this post sits inside. - Phi series (Microsoft) — small models trained explicitly on filtered + synthetic “textbook-quality” data; the canonical public example of the recipe.
Going deeper
- Villalobos et al., Will we run out of data? Limits of LLM scaling based on human-generated data (2022, revised 2024) — the data-wall framing.
- Gunasekar et al., Textbooks Are All You Need (2023) — phi-1, the cleanest public demonstration that synthetic textbook-style data trains capable code models.
- Meta AI, The Llama 3 Herd of Models (2024) — the most detailed public account of synthetic data inside a frontier post-training pipeline (code, math, reasoning, tool use, long context).
- Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget (2023) — the model-collapse paper. Read this before assuming synthetic data is free.
- Liu et al., Best Practices and Lessons Learned on Synthetic Data for Language Models (2024, Google DeepMind) — survey-style overview of where synthetic data has worked and where it hasn’t.
What I’m confident about: the data-wall framing, that phi and Llama 3 used synthetic data in the documented ways, that model collapse is real under recursive self-training, and that the filter is doing most of the work. What I’m less confident about: the exact synthetic-to-human ratios in current frontier training mixes (GPT-5, Claude 4.x, Gemini). Those mixes aren’t public, and the trustworthy thing to say is that I don’t know.