Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why model distillation exists

A small model trained on a big model's outputs often beats the same small model trained on the original labels. That shouldn't be obvious — and the reason it works is the actually interesting part.

AI & ML intermediate Apr 29, 2026

Why it exists

Pick a small open-weights model that punches above its size today — DeepSeek-R1-Distill-Qwen-7B is the cleanest example, but the broader pattern is visible across the “mini” tiers from frontier labs and many of the post-trained Llama-3.x 8B variants people actually run on a laptop. Peel back the training story for the cases where it’s public, and the same shape keeps appearing: at some stage of training, the small model learned from the outputs of a much larger model. That practice has a name: knowledge distillation.

The thing that should surprise you is that it works at all. The naive intuition is: a smaller network has less capacity, so the best you can hope for is “trains on the same data, ends up a bit worse.” Distillation ignores that bound. It says: take a giant teacher model, run it on a pile of inputs, record the whole probability distribution it produces (or, more recently, the actual generated text), and train a small student to imitate that. The student often ends up better than the same small architecture trained on the original ground-truth labels. That’s the trick — and it’s been re-discovered, generalized, and re-tooled three times across a decade.

The earliest version is Bucilă, Caruana and Niculescu-Mizil’s 2006 Model Compression paper, which trained small neural nets to mimic big ensembles by labelling a large unlabelled set with the ensemble’s predictions. The version most people mean today is Hinton, Vinyals and Dean’s 2015 Distilling the Knowledge in a Neural Network, which gave us the modern recipe: soft targets from a softened softmax. The version your inference bill cares about in 2026 is the one running quietly in production: small dense models distilled from frontier reasoning models on synthetic data.

Why it matters now

For an engineer shipping with LLMs, distillation is the single biggest reason “small enough to be cheap” and “good enough to be useful” overlap at all:

The pragmatic version of the question, then, is “when can I get away with the small one?” — and to answer that you have to know what distillation actually transfers and what it doesn’t.

The short answer

distillation = train a small student on the teacher's full output distribution + (optionally) the original labels

Instead of teaching the student “the answer is class 7” with a one-hot label, you teach it “the teacher thought class 7 was 62% likely, class 3 was 18%, class 9 was 11%, …”. That extra structure — the shape of the teacher’s uncertainty — turns out to carry a lot of information that a hard label throws away.

For modern LLMs the same idea shows up in two flavours: matching the teacher’s per-token probability distribution (true distillation), or just training on text the teacher generated (often called distillation loosely; technically closer to “synthetic-data fine-tuning”). The two get conflated in casual usage, and the distinction matters in places.

How it works

The original trick: soft targets

In a classifier, the standard training signal for an input is a one-hot vector: the correct class is 1.0, everything else is 0.0. Hinton’s 2015 paper observed something that, in retrospect, is obvious. The teacher’s softmax doesn’t just say “cat.” It says something like cat 0.90, dog 0.07, fox 0.02, truck 0.001. The ratio of “dog” to “truck” — both wrong — encodes real similarity information that the dataset’s hard label has erased. The phrase dark knowledge gets attached to this idea — it’s associated with Hinton from later talks rather than the 2015 paper itself, but it’s the label most people use.

To make those small numbers usable, you raise the softmax temperature T at both teacher and student during training:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

A higher T flattens the distribution and exposes the structure in the “wrong” classes. Hinton et al. report using temperatures from about 1 up to 20 in their experiments. The student is trained with a weighted sum of two losses: cross-entropy with the soft targets at high T, plus ordinary cross-entropy with the true labels at T = 1. After training, the student runs at T = 1 like any other model. (Hinton et al., 2015)

If you want one mental picture: the hard label tells the student what’s correct; the soft target tells it what’s almost correct. The “almost correct” map is the teaching.

Why a small model can match a big one (sometimes)

The capacity argument — “smaller networks must be worse” — is misleading. The real bottleneck for a small model is rarely raw parameter count; more often it’s finding the right function during training. A big model trained on noisy, finite labelled data has done the hard work of locating a good decision surface. Distilling it gives the student an effectively denser, smoother training signal: more nuanced labels, often on far more inputs (you can label as much unlabelled data as you can run inference on). The student is solving an easier optimization problem on a richer dataset.

This is exactly what Bucilă, Caruana and Niculescu-Mizil showed in 2006, before the deep-learning era: small neural nets trained to mimic the output of a large ensemble matched the ensemble’s accuracy while being, in their conclusion’s wording, roughly 1000× smaller and faster on average. The “more inputs, ensemble-labelled” part of their recipe is the part the modern LLM era leans on hardest. (Bucilă et al., 2006)

Distillation in the LLM era: three things that get called the same name

Here’s where the term gets fuzzy. In current practice, “distillation” covers at least three setups:

  1. Logit / soft-target distillation. The classical Hinton recipe, adapted to LLMs: at every position, train the student to match the teacher’s full next-token distribution (or its top-k). Requires access to teacher logits. DistilBERT (Sanh et al., 2019) is a well-known example — about 40% smaller than BERT-base, retaining roughly 97% of its GLUE performance, around 60% faster at inference. (Sanh et al., 2019)
  2. Sequence / behavior distillation. Have the teacher generate text on a set of prompts, then fine-tune the student via ordinary supervised fine-tuning on those (prompt, teacher-output) pairs. This is what DeepSeek’s R1-Distill checkpoints used: ~800k reasoning traces from R1, plain SFT on the smaller bases, no RL. You only need the teacher’s samples, not its logits — which is why this works through a closed-API teacher.
  3. Self-distillation and synthetic-data pretraining. The teacher and student can be the same model architecture at different training stages, or the teacher can simply be a high-quality model used to generate filtered pretraining-style data for somebody else’s base. The line between “distillation” and “training on synthetic data” gets blurry here. I’d flag it as terminology drift in the field rather than a clean technical boundary.

The first kind is what the 2015 paper meant. The second kind is what the public DeepSeek-R1 example does. The third kind appears to be common practice in the current small-model ecosystem, but quantifying “how common” runs straight into the wall of closed-lab opacity.

Where the seams are

A few things worth knowing if you ever lean on a distilled model in production:

Going deeper

What I’m confident about: the mechanical story (soft targets carry more information than hard labels; sequence distillation works through teacher samples alone) and the empirical pattern (today’s small frontier-quality models are mostly distilled). What I’m less confident about: the precise contribution of distillation vs. base data quality vs. architecture choices in any specific small-model success story. The closed labs don’t publish the breakdown, and reverse-engineering it from leaderboards is unreliable.