Why does in-context learning work?
You paste three examples into a prompt and the model suddenly does the task. Nothing got trained. So what just happened?
Why it exists
Here is the trick that, more than any single benchmark, made people believe LLMs were a different kind of thing:
Translate English to French.
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => ???
The model has never been told it is now a translator. No weights were updated. No fine-tuning happened. You just typed three lines into a text box. And the next token that comes out is “girafe en peluche.”
This is weird. The whole training pipeline was “predict the next token on a giant pile of internet text.” Nothing in that objective says “learn to follow few-shot examples.” But somewhere along the way, the model became something that, at inference time, behaves as if it picked up a new skill from the prompt itself.
That phenomenon got a name — in-context learning — in the GPT-3 paper (Brown et al., 2020), and the rest of the field spent the next several years trying to figure out what it actually is. The honest answer is that we still don’t fully know. There are good partial stories. None of them is settled.
I am writing this post because the curious-engineer question — the weights didn’t change, so what kind of “learning” is this? — is more interesting than any how-to about prompting.
Why it matters now
If in-context learning didn’t work, modern LLM products mostly wouldn’t exist as we know them.
- Prompt engineering is in-context learning by another name. Every “you are a helpful assistant who…” preamble is leaning on the model’s ability to specialize on the fly.
- Few-shot prompting is the first thing anyone tries before reaching for fine-tuning, and it usually works well enough that fine-tuning never happens.
- RAG works because dropping retrieved passages into the prompt is enough to make the model condition on them — that conditioning is in-context learning, just with documents as the “examples.”
- Tool use, agents, structured output all rely on the same property: show the model the shape, and it will produce something of that shape.
The whole stack assumes a model that adapts from context. If you don’t know why that works, you can’t predict when it will fail.
The short answer
in-context learning = a model + a long enough prompt + the property that conditioning on examples in the prompt mimics having been trained on them
Nothing literally learns at inference. The weights are frozen. What changes is the model’s conditional distribution over the next token once it has been forced to attend to your examples. Because the training objective happened to make those conditional distributions behave a lot like “do the task the examples are doing,” it looks like the model picked up a skill. It didn’t. It was always able to do this; your prompt just woke up the right slice of behavior.
How it works
Two questions to keep separate, because they have different answers:
- At inference time, what is the model mechanically doing?
- Why did training on next-token prediction give it that ability in the first place?
What’s mechanically happening
Inside the transformer, your prompt becomes a sequence of token embeddings. Each layer’s attention mechanism lets later positions read from earlier ones. By the time the model is computing the distribution for the next token, every previous token in the prompt — including your three “sea otter ⇒ loutre de mer” examples — has had a chance to influence the internal state.
So “few-shot learning” is, at the level of the math, just a longer context window. There is no parameter update, no gradient, no separate “learning phase.” The same matrix multiplications that ran on token 1 run on token 5,000. The weights never know they’re doing translation. The activations do.
That reframing is useful: in-context learning is whatever the attention pattern does when it conditions on patterned context. It’s a property of inference, not a separate algorithm.
Why next-token training gave us this
This is the part that is not fully understood, and I want to be clear about that. Several partial accounts have real evidence behind them; none of them is “the answer.”
The “implicit Bayesian inference” story. Xie et al. (2022) and others argued that during pretraining the model sees countless little stretches of text that look like latent task followed by examples of that task — recipes, FAQs, code with docstrings, exam answers, translation pairs in parallel corpora. The model learns to infer “which task is this passage doing?” as a side-effect of next-token prediction, because guessing the task helps predict the next token. At inference, your few-shot prompt looks like one of those stretches; the model infers the task from your examples and generates accordingly. In this view, in-context learning isn’t a new capability — it’s the model doing the same task-inference it always did, with your prompt as the input.
The “induction head” story.
Olsson et al. (2022, Anthropic) found small circuits inside
transformers — pairs of attention heads they called induction heads —
that implement a very specific behavior: if the pattern [A][B]
appeared earlier in the context, and you now see [A] again, attend
to and copy [B]. They showed these circuits form abruptly during
training, and the moment they form coincides with the model getting
much better at in-context learning on synthetic tasks. The claim is
not that induction heads explain all in-context learning, but that
they’re a concrete, mechanistic example of how next-token training can
build a circuit that looks, from outside, like “learning from
examples.”
The “gradient descent in the forward pass” story. A line of work (Garg et al., 2022; von Oswald et al., 2022; Akyürek et al., 2022) showed that on simple tasks like linear regression, transformers given (input, output) pairs in their context produce predictions that closely match what one or a few steps of gradient descent on those pairs would produce. The provocative reading: the forward pass is implementing a tiny optimizer over the in-context examples. How far this generalizes from toy regression to real natural language is contested. It’s a beautiful mathematical result and an uncertain empirical claim.
These stories are not mutually exclusive. They’re probably all partially right, on different tasks, at different scales. The honest summary is: we have several mechanistic hypotheses with supporting evidence, and we do not yet have a unified theory of why scaling a next-token predictor produces this behavior. Anyone who tells you otherwise with confidence is overselling.
Where the seams show
If you stress in-context learning, it cracks in instructive ways:
- Order matters. Lu et al. (2022) showed that just permuting the order of few-shot examples can swing accuracy from near-random to near-perfect on the same task. A “real” learning algorithm wouldn’t care about the order of its training set. This one does.
- The labels barely have to be right. Min et al. (2022) found that on many classification tasks, replacing the example labels with random labels barely hurt few-shot performance — what mattered was the format and the label space, not the input-label mapping. This is hard to square with “the model is learning the task from the examples.” It’s much easier to square with “the examples are telling the model which distribution of behavior to switch into.”
- It plateaus. More examples help, then stop helping, then sometimes hurt. A real learning algorithm would keep getting better with more data.
- It’s not durable. End the conversation, start a new one, and the “skill” is gone. Whatever happened was scoped to the activations of one forward pass.
These are not bugs to fix. They’re tells that whatever in-context learning is, it’s not the same kind of process as training. It’s a sibling, not a copy.
Famous related terms
- Few-shot prompting —
few-shot = task description + k worked examples + the new input. The most common shape of in-context learning in practice. - Zero-shot prompting —
zero-shot = task description + the new input, no examples. Relies on the model already having the task baked in from pretraining. - Chain of thought —
CoT ≈ in-context learning + 'show your work' as the demonstrated format. Probably its own post. - Fine-tuning —
fine-tuning = more training + your data + actual weight updates. The thing in-context learning lets you skip 80% of the time. - Induction head —
induction head = a 2-attention-head circuit that copies "what came after [A] last time" when it sees [A] again. The cleanest mechanistic example of an in-context-learning building block. - Prompt engineering —
prompt engineering = exploiting in-context learning by hand. The applied side of all of the above.
Going deeper
- Language Models are Few-Shot Learners (Brown et al., 2020) — the GPT-3 paper that named the phenomenon and forced everyone to take it seriously.
- An Explanation of In-context Learning as Implicit Bayesian Inference (Xie, Raghunathan, Liang, Ma; 2022) — the task-inference framing.
- In-context Learning and Induction Heads (Olsson et al., Anthropic, 2022) — the mechanistic interpretability angle.
- Rethinking the Role of Demonstrations (Min et al., 2022) — the paper where the labels turn out not to matter much, and you have to rethink what the examples are doing.
A note on what I’m sure of: the phenomenon (models behaving as if they learned from in-prompt examples, with no weight updates) is rock-solid and reproducible. The explanation is genuinely open research. I’ve tried to keep speculation and evidence on different sides of the line; if I’ve slid across it somewhere, that’s a defect of this post, not a discovery about the field.