Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does in-context learning work?

You paste three examples into a prompt and the model suddenly does the task. Nothing got trained. So what just happened?

AI & ML intermediate Apr 29, 2026

Why it exists

Here is the trick that, more than any single benchmark, made people believe LLMs were a different kind of thing:

Translate English to French.
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => ???

The model has never been told it is now a translator. No weights were updated. No fine-tuning happened. You just typed three lines into a text box. And the next token that comes out is “girafe en peluche.”

This is weird. The whole training pipeline was “predict the next token on a giant pile of internet text.” Nothing in that objective says “learn to follow few-shot examples.” But somewhere along the way, the model became something that, at inference time, behaves as if it picked up a new skill from the prompt itself.

That phenomenon got a name — in-context learning — in the GPT-3 paper (Brown et al., 2020), and the rest of the field spent the next several years trying to figure out what it actually is. The honest answer is that we still don’t fully know. There are good partial stories. None of them is settled.

I am writing this post because the curious-engineer question — the weights didn’t change, so what kind of “learning” is this? — is more interesting than any how-to about prompting.

Why it matters now

If in-context learning didn’t work, modern LLM products mostly wouldn’t exist as we know them.

The whole stack assumes a model that adapts from context. If you don’t know why that works, you can’t predict when it will fail.

The short answer

in-context learning = a model + a long enough prompt + the property that conditioning on examples in the prompt mimics having been trained on them

Nothing literally learns at inference. The weights are frozen. What changes is the model’s conditional distribution over the next token once it has been forced to attend to your examples. Because the training objective happened to make those conditional distributions behave a lot like “do the task the examples are doing,” it looks like the model picked up a skill. It didn’t. It was always able to do this; your prompt just woke up the right slice of behavior.

How it works

Two questions to keep separate, because they have different answers:

  1. At inference time, what is the model mechanically doing?
  2. Why did training on next-token prediction give it that ability in the first place?

What’s mechanically happening

Inside the transformer, your prompt becomes a sequence of token embeddings. Each layer’s attention mechanism lets later positions read from earlier ones. By the time the model is computing the distribution for the next token, every previous token in the prompt — including your three “sea otter ⇒ loutre de mer” examples — has had a chance to influence the internal state.

So “few-shot learning” is, at the level of the math, just a longer context window. There is no parameter update, no gradient, no separate “learning phase.” The same matrix multiplications that ran on token 1 run on token 5,000. The weights never know they’re doing translation. The activations do.

That reframing is useful: in-context learning is whatever the attention pattern does when it conditions on patterned context. It’s a property of inference, not a separate algorithm.

Why next-token training gave us this

This is the part that is not fully understood, and I want to be clear about that. Several partial accounts have real evidence behind them; none of them is “the answer.”

The “implicit Bayesian inference” story. Xie et al. (2022) and others argued that during pretraining the model sees countless little stretches of text that look like latent task followed by examples of that task — recipes, FAQs, code with docstrings, exam answers, translation pairs in parallel corpora. The model learns to infer “which task is this passage doing?” as a side-effect of next-token prediction, because guessing the task helps predict the next token. At inference, your few-shot prompt looks like one of those stretches; the model infers the task from your examples and generates accordingly. In this view, in-context learning isn’t a new capability — it’s the model doing the same task-inference it always did, with your prompt as the input.

The “induction head” story. Olsson et al. (2022, Anthropic) found small circuits inside transformers — pairs of attention heads they called induction heads — that implement a very specific behavior: if the pattern [A][B] appeared earlier in the context, and you now see [A] again, attend to and copy [B]. They showed these circuits form abruptly during training, and the moment they form coincides with the model getting much better at in-context learning on synthetic tasks. The claim is not that induction heads explain all in-context learning, but that they’re a concrete, mechanistic example of how next-token training can build a circuit that looks, from outside, like “learning from examples.”

The “gradient descent in the forward pass” story. A line of work (Garg et al., 2022; von Oswald et al., 2022; Akyürek et al., 2022) showed that on simple tasks like linear regression, transformers given (input, output) pairs in their context produce predictions that closely match what one or a few steps of gradient descent on those pairs would produce. The provocative reading: the forward pass is implementing a tiny optimizer over the in-context examples. How far this generalizes from toy regression to real natural language is contested. It’s a beautiful mathematical result and an uncertain empirical claim.

These stories are not mutually exclusive. They’re probably all partially right, on different tasks, at different scales. The honest summary is: we have several mechanistic hypotheses with supporting evidence, and we do not yet have a unified theory of why scaling a next-token predictor produces this behavior. Anyone who tells you otherwise with confidence is overselling.

Where the seams show

If you stress in-context learning, it cracks in instructive ways:

These are not bugs to fix. They’re tells that whatever in-context learning is, it’s not the same kind of process as training. It’s a sibling, not a copy.

Going deeper

A note on what I’m sure of: the phenomenon (models behaving as if they learned from in-prompt examples, with no weight updates) is rock-solid and reproducible. The explanation is genuinely open research. I’ve tried to keep speculation and evidence on different sides of the line; if I’ve slid across it somewhere, that’s a defect of this post, not a discovery about the field.