Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does predicting the next token end up doing reasoning?

An LLM is trained on one objective: guess the next token. From that one task, you get translation, code, arithmetic, and arguments. Why is autocomplete this powerful?

AI & ML intermediate Apr 29, 2026

Why it exists

The first time you really sit with what an LLM does, the trick stops being impressive and starts being weird. The model has one job. Given a string of tokens, output a probability distribution over what the next token is. That’s it. No “understanding” objective. No “reasoning” objective. No “be helpful” objective at the pretraining stage. Just: which token comes next.

And yet what falls out of that one objective is — depending on the day — working code, a passable translation between languages it was never explicitly taught to translate, an argument that holds together for a paragraph, arithmetic on numbers it has never seen in that exact form. Somewhere between “predict the next token” and “write a unit test that passes” there is a gap that, if you’ve never tried to close it, looks absurd.

The interesting question isn’t whether this works. We know it does. The question is why. Why is autocomplete, run hard enough, the same thing as thinking? And how much of “the same thing” is real and how much is us being fooled by fluent text?

Why it matters now

Engineers shipping LLM-powered features keep running into the seams of this question without naming it:

If your mental model is “the LLM has been taught to do tasks,” you’ll keep being surprised by what it’s good and bad at. If your mental model is “the LLM has been pressured into a representation of language good enough to predict the next token, and tasks fall out of that,” the surprises mostly stop.

The short answer

next-token prediction generalizes ≈ "to predict text well, you have to model what produced the text"

To get good at guessing the next token in arbitrary internet text, the model is forced — by the objective alone — to build internal machinery that approximates the things that generated the text: facts, syntax, arithmetic, code semantics, the stance of an author, the structure of an argument. The machinery is the byproduct. Tasks ride on top of it.

How it works

Start with what the loss is actually measuring. During pretraining, the model is shown enormous amounts of text and asked, for every position, “what’s the next token?” The training signal — cross-entropy loss — punishes it in proportion to how surprised it was by the right token. Lower loss means it was less surprised, on average, across everything in the training set.

Now think about what “everything in the training set” contains. To predict the next token well across all of it, the model has to do better than chance on:

There is no shortcut for any of these that doesn’t, at some level, model the thing being described. A model that has memorized text but has no notion of arithmetic can’t reliably continue novel arithmetic. A model that has no notion of variable scope can’t reliably continue novel code. The objective is “predict the next token,” but the only representations that actually drive the loss down across that whole corpus are ones that, in some compressed form, capture what produced the text. This framing — prediction is compression, compression requires modeling — is the one Ilya Sutskever has pointed at in talks when asked why this works at all. It’s intuition, not a proof, but it matches what we see: the more diverse and structured the data, the more structure the model is forced to internalize to keep predicting well.

Tasks as conditional continuations

Once you have a model that’s good at next-token prediction, you don’t “teach it tasks” — you arrange the prompt so the right continuation is the task’s answer. The prompt sets a context in which the most likely continuation, according to the patterns the model learned, is the thing you wanted.

This is why in-context learning works at all. The model isn’t learning a task in any usual sense; the prompt is steering an already-built distribution toward the slice that produces the right kind of continuation. RLHF and instruction tuning then re-shape that distribution further so the model treats user messages as task specs, but the engine underneath is still next-token prediction.

Why scale is doing the heavy lifting

A small model trained on the same objective doesn’t get you working code. The reason — best as anyone can tell — is that the representations needed to predict diverse text well are expensive to learn. A 100M-parameter model has to make crude generalizations to fit its capacity; a 100B-parameter model can afford features for syntax and arithmetic and translation and a thousand other regularities in the data, and use each one when relevant.

This is the rough shape of scaling laws: loss falls smoothly with more compute and data. But specific capabilities — arithmetic past two digits, multi-step reasoning, following instructions — appear to switch on more abruptly at certain scales. Whether that abruptness is real or partly a measurement artifact (some of the “emergence” results have been re-analyzed and softened) is still actively debated. The honest version: average loss goes down smoothly, and a lot of capabilities ride on that, but the exact mapping from “loss” to “capability” is messier than the early emergence narrative suggested.

Where the story breaks down

The “to predict text you must model the world” framing is the right intuition, but taken too far it becomes wrong in load-bearing ways:

The headline still holds: a single, almost embarrassingly simple objective, applied to enough text with enough capacity, ends up forcing the model to assemble most of what we’d recognize as linguistic and semi-conceptual structure. Tasks are then prompts that sample from that structure. That’s the trick.

Going deeper