Why model distillation exists

A small model trained on a big model's outputs often beats the same small model trained on the original labels. That shouldn't be obvious — and the reason it works is the actually interesting part.

AI & ML intermediate Apr 29, 2026

Why it exists

Pick a small open-weights model that punches above its size today — DeepSeek-R1-Distill-Qwen-7B is the cleanest example, but the broader pattern is visible across the “mini” tiers from frontier labs and many of the post-trained Llama-3.x 8B variants people actually run on a laptop. Peel back the training story for the cases where it’s public, and the same shape keeps appearing: at some stage of training, the small model learned from the outputs of a much larger model. That practice has a name: knowledge distillation.

The thing that should surprise you is that it works at all. The naive intuition is: a smaller network has less capacity, so the best you can hope for is “trains on the same data, ends up a bit worse.” Distillation ignores that bound. It says: take a giant teacher model, run it on a pile of inputs, record the whole probability distribution it produces (or, more recently, the actual generated text), and train a small student to imitate that. The student often ends up better than the same small architecture trained on the original ground-truth labels. That’s the trick — and it’s been re-discovered, generalized, and re-tooled three times across a decade.

The earliest version is Bucilă, Caruana and Niculescu-Mizil’s 2006 Model Compression paper, which trained small neural nets to mimic big ensembles by labelling a large unlabelled set with the ensemble’s predictions. The version most people mean today is Hinton, Vinyals and Dean’s 2015 Distilling the Knowledge in a Neural Network, which gave us the modern recipe: soft targets from a softened softmax. The version your inference bill cares about in 2026 is the one running quietly in production: small dense models distilled from frontier reasoning models on synthetic data.

Why it matters now

For an engineer shipping with LLMs, distillation is the single biggest reason “small enough to be cheap” and “good enough to be useful” overlap at all:

The “mini” tiers you call are very likely distilled in some form. Closed-lab recipes aren’t public, so I’d treat that as strong inference rather than fact. The open ecosystem shows the pattern unambiguously: DeepSeek’s January 2025 R1 release shipped six dense distilled checkpoints — built on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct — fine-tuned on roughly 800k samples generated and curated with R1. The model card describes only this fine-tuning step for the distilled variants, not an additional RL stage. The 32B distilled checkpoint beat OpenAI’s o1-mini on AIME 2024, MATH-500, GPQA Diamond, and LiveCodeBench in the card’s own table. (DeepSeek-R1 model card)
It’s how reasoning capability transfers downward. The expensive thing about a reasoning model is the RL run on a verifier. The cheap thing is fine-tuning a smaller base on the traces the big model already produced. Distillation is the bridge.
It’s why synthetic data pipelines work for the labs that have a frontier model. A teacher is, in effect, a labelling machine for unlimited new examples.
Latency and cost. A 7B dense student fits on a single consumer-class GPU; a frontier teacher (R1 is 671B total params, ~37B activated per token via mixture-of-experts) needs a serving cluster. The exact cost ratio depends on hardware, batch size, and serving stack — but it’s the difference between “runs on a laptop” and “rents a node.” If the student is good enough on your task, it changes the unit economics of the product.

The pragmatic version of the question, then, is “when can I get away with the small one?” — and to answer that you have to know what distillation actually transfers and what it doesn’t.

The short answer

distillation = train a small student on the teacher's full output distribution + (optionally) the original labels

Instead of teaching the student “the answer is class 7” with a one-hot label, you teach it “the teacher thought class 7 was 62% likely, class 3 was 18%, class 9 was 11%, …”. That extra structure — the shape of the teacher’s uncertainty — turns out to carry a lot of information that a hard label throws away.

For modern LLMs the same idea shows up in two flavours: matching the teacher’s per-token probability distribution (true distillation), or just training on text the teacher generated (often called distillation loosely; technically closer to “synthetic-data fine-tuning”). The two get conflated in casual usage, and the distinction matters in places.

How it works

The original trick: soft targets

In a classifier, the standard training signal for an input is a one-hot vector: the correct class is 1.0, everything else is 0.0. Hinton’s 2015 paper observed something that, in retrospect, is obvious. The teacher’s softmax doesn’t just say “cat.” It says something like cat 0.90, dog 0.07, fox 0.02, truck 0.001. The ratio of “dog” to “truck” — both wrong — encodes real similarity information that the dataset’s hard label has erased. The phrase dark knowledge gets attached to this idea — it’s associated with Hinton from later talks rather than the 2015 paper itself, but it’s the label most people use.

To make those small numbers usable, you raise the softmax temperature T at both teacher and student during training:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

A higher T flattens the distribution and exposes the structure in the “wrong” classes. Hinton et al. report using temperatures from about 1 up to 20 in their experiments. The student is trained with a weighted sum of two losses: cross-entropy with the soft targets at high T, plus ordinary cross-entropy with the true labels at T = 1. After training, the student runs at T = 1 like any other model. (Hinton et al., 2015)

If you want one mental picture: the hard label tells the student what’s correct; the soft target tells it what’s almost correct. The “almost correct” map is the teaching.

Why a small model can match a big one (sometimes)

The capacity argument — “smaller networks must be worse” — is misleading. The real bottleneck for a small model is rarely raw parameter count; more often it’s finding the right function during training. A big model trained on noisy, finite labelled data has done the hard work of locating a good decision surface. Distilling it gives the student an effectively denser, smoother training signal: more nuanced labels, often on far more inputs (you can label as much unlabelled data as you can run inference on). The student is solving an easier optimization problem on a richer dataset.

This is exactly what Bucilă, Caruana and Niculescu-Mizil showed in 2006, before the deep-learning era: small neural nets trained to mimic the output of a large ensemble matched the ensemble’s accuracy while being, in their conclusion’s wording, roughly 1000× smaller and faster on average. The “more inputs, ensemble-labelled” part of their recipe is the part the modern LLM era leans on hardest. (Bucilă et al., 2006)

Distillation in the LLM era: three things that get called the same name

Here’s where the term gets fuzzy. In current practice, “distillation” covers at least three setups:

Logit / soft-target distillation. The classical Hinton recipe, adapted to LLMs: at every position, train the student to match the teacher’s full next-token distribution (or its top-k). Requires access to teacher logits. DistilBERT (Sanh et al., 2019) is a well-known example — about 40% smaller than BERT-base, retaining roughly 97% of its GLUE performance, around 60% faster at inference. (Sanh et al., 2019)
Sequence / behavior distillation. Have the teacher generate text on a set of prompts, then fine-tune the student via ordinary supervised fine-tuning on those (prompt, teacher-output) pairs. This is what DeepSeek’s R1-Distill checkpoints used: ~800k reasoning traces from R1, plain SFT on the smaller bases, no RL. You only need the teacher’s samples, not its logits — which is why this works through a closed-API teacher.
Self-distillation and synthetic-data pretraining. The teacher and student can be the same model architecture at different training stages, or the teacher can simply be a high-quality model used to generate filtered pretraining-style data for somebody else’s base. The line between “distillation” and “training on synthetic data” gets blurry here. I’d flag it as terminology drift in the field rather than a clean technical boundary.

The first kind is what the 2015 paper meant. The second kind is what the public DeepSeek-R1 example does. The third kind appears to be common practice in the current small-model ecosystem, but quantifying “how common” runs straight into the wall of closed-lab opacity.

Where the seams are

A few things worth knowing if you ever lean on a distilled model in production:

The student inherits the teacher’s blind spots. Every confident mistake the teacher makes is now a high-quality training label. If the teacher is biased on a topic, the student will be too — often more so, because the student has less capacity to hold a contrarian view it didn’t see in training.
Distillation does not magically transfer reasoning. The DeepSeek paper itself notes that pure SFT-on-traces gets the smaller dense models a long way, but the team explicitly contrasts this with the full RL pipeline used on R1 itself. Recipes that try to “distill RL policies” are an active research area; I don’t think there’s a clean consensus on which approach wins for small reasoning models.
Confident copying of errors is the failure mode you can’t audit easily. When you train on human-labelled data, errors are noise; the model sees disagreement. When you train on a single teacher’s outputs, errors are signal.
The legal and policy story varies by source. It’s not uniform. Meta’s Llama 3.1 license, for instance, explicitly permits using outputs to generate synthetic data and to distill into other models. Other providers’ terms restrict using their outputs to train competing models — OpenAI’s early-access terms include language of that shape — and enforcement and scope vary. Read the actual terms for the specific service and version you’re using; I don’t have a reliable read on how courts or regulators will eventually treat the broader question.

Soft targets — soft target = teacher's full per-class probability vector (often softened with temperature T > 1). The original signal that made distillation work.
Dark knowledge — dark knowledge ≈ information hidden in the teacher's wrong-class probabilities — Hinton’s name for that structure. Not a separate technique; a vivid label for why soft targets help.
Synthetic data fine-tuning — SFT on synthetic = fine-tune student on (prompt, teacher-generated answer) pairs. The flavour that doesn’t need teacher logits — and the one most LLM-era “distillation” stories actually describe. See why fine-tuning is cheap.
Quantization — quantization ≈ same weights, fewer bits. A different compression axis: distillation changes the model’s function, quantization changes its representation. Usually composed.
Pruning — pruning = drop weights/heads that don't matter much + fine-tune to recover. The third leg of the model-compression stool.
Small models — small model = fewer parameters + same recipe + different deployment target — see why small models are getting good for the broader story this post is one mechanism inside.

Going deeper

Bucilă, Caruana, Niculescu-Mizil, Model Compression (KDD 2006) — the paper that introduced the idea, pre-deep-learning. (PDF)
Hinton, Vinyals, Dean, Distilling the Knowledge in a Neural Network (2015) — the soft-targets / temperature recipe everyone cites. (arXiv:1503.02531)
Sanh, Debut, Chaumond, Wolf, DistilBERT (2019) — the canonical worked example of distilling a transformer. (arXiv:1910.01108)
DeepSeek-AI, DeepSeek-R1 technical report and the R1-Distill model cards (2025) — the most detailed public account of distillation-from-a-reasoning-model in the current era. (model card)

What I’m confident about: the mechanical story (soft targets carry more information than hard labels; sequence distillation works through teacher samples alone) and the empirical pattern (today’s small frontier-quality models are mostly distilled). What I’m less confident about: the precise contribution of distillation vs. base data quality vs. architecture choices in any specific small-model success story. The closed labs don’t publish the breakdown, and reverse-engineering it from leaderboards is unreliable.