Why does predicting the next token end up doing reasoning?
An LLM is trained on one objective: guess the next token. From that one task, you get translation, code, arithmetic, and arguments. Why is autocomplete this powerful?
Why it exists
The first time you really sit with what an LLM does, the trick stops being impressive and starts being weird. The model has one job. Given a string of tokens, output a probability distribution over what the next token is. That’s it. No “understanding” objective. No “reasoning” objective. No “be helpful” objective at the pretraining stage. Just: which token comes next.
And yet what falls out of that one objective is — depending on the day — working code, a passable translation between languages it was never explicitly taught to translate, an argument that holds together for a paragraph, arithmetic on numbers it has never seen in that exact form. Somewhere between “predict the next token” and “write a unit test that passes” there is a gap that, if you’ve never tried to close it, looks absurd.
The interesting question isn’t whether this works. We know it does. The question is why. Why is autocomplete, run hard enough, the same thing as thinking? And how much of “the same thing” is real and how much is us being fooled by fluent text?
Why it matters now
Engineers shipping LLM-powered features keep running into the seams of this question without naming it:
- Why does chain-of-thought prompting help at all? If the model “knows” the answer, why does asking for the steps change it? Because the model is a token predictor, and predicting “the answer is 47” in one shot is a different computation than predicting it after writing out the reasoning trace.
- Why are LLMs strong at things their trainers didn’t aim at, and brittle at things that look easier? Counting characters in a word is hard for them; explaining a 19th-century court ruling is easy. Both of these stop being mysterious once you remember what task the model was actually trained on.
- Why does emergence happen at scale? The objective doesn’t change as you add parameters and data. What changes is which patterns are cheap enough for the model to learn under that single objective.
If your mental model is “the LLM has been taught to do tasks,” you’ll keep being surprised by what it’s good and bad at. If your mental model is “the LLM has been pressured into a representation of language good enough to predict the next token, and tasks fall out of that,” the surprises mostly stop.
The short answer
next-token prediction generalizes ≈ "to predict text well, you have to model what produced the text"
To get good at guessing the next token in arbitrary internet text, the model is forced — by the objective alone — to build internal machinery that approximates the things that generated the text: facts, syntax, arithmetic, code semantics, the stance of an author, the structure of an argument. The machinery is the byproduct. Tasks ride on top of it.
How it works
Start with what the loss is actually measuring. During pretraining, the model is shown enormous amounts of text and asked, for every position, “what’s the next token?” The training signal — cross-entropy loss — punishes it in proportion to how surprised it was by the right token. Lower loss means it was less surprised, on average, across everything in the training set.
Now think about what “everything in the training set” contains. To predict the next token well across all of it, the model has to do better than chance on:
- Code, where the next token depends on syntax, scope, and what the function is supposed to do.
- Translations, where one language’s sentence is followed by another’s.
- Arithmetic strings, where
7 × 8 =is followed by56more often than by54. - Stories, where character names persist and Chekhov’s gun goes off in Act III.
- Reasoning chains, where each line is the consequence of the last.
There is no shortcut for any of these that doesn’t, at some level, model the thing being described. A model that has memorized text but has no notion of arithmetic can’t reliably continue novel arithmetic. A model that has no notion of variable scope can’t reliably continue novel code. The objective is “predict the next token,” but the only representations that actually drive the loss down across that whole corpus are ones that, in some compressed form, capture what produced the text. This framing — prediction is compression, compression requires modeling — is the one Ilya Sutskever has pointed at in talks when asked why this works at all. It’s intuition, not a proof, but it matches what we see: the more diverse and structured the data, the more structure the model is forced to internalize to keep predicting well.
Tasks as conditional continuations
Once you have a model that’s good at next-token prediction, you don’t “teach it tasks” — you arrange the prompt so the right continuation is the task’s answer. The prompt sets a context in which the most likely continuation, according to the patterns the model learned, is the thing you wanted.
- “Translate to French: Hello → ” biases the next tokens toward Bonjour because text in the training set that started this way tended to continue that way.
- “def is_prime(n):” biases the next tokens toward a Python primality check.
- “Q: What’s 137 + 248? A: Let’s think step by step.” biases toward a step-by-step derivation, which on average ends in the right answer more often than a one-shot guess.
This is why in-context learning works at all. The model isn’t learning a task in any usual sense; the prompt is steering an already-built distribution toward the slice that produces the right kind of continuation. RLHF and instruction tuning then re-shape that distribution further so the model treats user messages as task specs, but the engine underneath is still next-token prediction.
Why scale is doing the heavy lifting
A small model trained on the same objective doesn’t get you working code. The reason — best as anyone can tell — is that the representations needed to predict diverse text well are expensive to learn. A 100M-parameter model has to make crude generalizations to fit its capacity; a 100B-parameter model can afford features for syntax and arithmetic and translation and a thousand other regularities in the data, and use each one when relevant.
This is the rough shape of scaling laws: loss falls smoothly with more compute and data. But specific capabilities — arithmetic past two digits, multi-step reasoning, following instructions — appear to switch on more abruptly at certain scales. Whether that abruptness is real or partly a measurement artifact (some of the “emergence” results have been re-analyzed and softened) is still actively debated. The honest version: average loss goes down smoothly, and a lot of capabilities ride on that, but the exact mapping from “loss” to “capability” is messier than the early emergence narrative suggested.
Where the story breaks down
The “to predict text you must model the world” framing is the right intuition, but taken too far it becomes wrong in load-bearing ways:
- The model is modeling text about the world, not the world. If the internet is wrong about something in a consistent way, the model will be wrong with it. Hallucinated citations are a classic case: fluent text where citations go is a strong pattern; real citations are a weaker one.
- Some patterns that look like reasoning are pattern matching that happens to coincide with reasoning on the training distribution and comes apart off it. This is why benchmarks that perturb surface form (rename variables, change numbers) sometimes drop scores sharply.
- Tokenization leaks through. Counting characters in a word is hard because the model doesn’t see characters; it sees tokens. No amount of next-token training fixes a representation that hides the unit you’re being asked about.
- “Why does this work as well as it does” is genuinely not fully understood. There’s no clean theorem saying next-token prediction on internet-scale text must yield code-writing assistants. We have scaling-law fits and post-hoc stories. The post-hoc stories are good. They are not a derivation.
The headline still holds: a single, almost embarrassingly simple objective, applied to enough text with enough capacity, ends up forcing the model to assemble most of what we’d recognize as linguistic and semi-conceptual structure. Tasks are then prompts that sample from that structure. That’s the trick.
Famous related terms
- LLM —
LLM = neural net + "predict the next token" objective at scale. The thing this whole post is unpacking. - In-context learning —
in-context learning = prompt + frozen weights + a continuation that happens to be the task answer. The mechanism by which “tasks” exist at all. - Chain-of-thought —
CoT = prompt the model to write its steps before its answer. Works because next-token prediction over a derivation is a different (often better-behaved) computation than one-shot answer prediction. - Scaling laws —
scaling laws ≈ loss falls predictably as you add parameters, data, and compute. The empirical reason “more of the same objective” keeps producing better models. - Emergence —
emergence = capability that appears sharply at scale. A real-feeling pattern, with caveats; the average loss curve is smoother than the capability curves it carries. - Pretraining vs fine-tuning —
pretraining = build representations from scratch;fine-tuning = rent existing ones for a task. Pretraining buys the representations; fine-tuning rents them for a specific task.
Going deeper
- Ilya Sutskever’s various talks on prediction-as-compression — the cleanest informal articulation of the “to predict well you must model” intuition. (I don’t have one canonical talk to point at; the argument shows up across several interviews and lectures.)
- Kaplan et al., “Scaling Laws for Neural Language Models” (2020) and Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla, 2022) — the empirical backbone for “more of the same objective gets better in a predictable way.”
- Schaeffer, Miranda, Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?” (NeurIPS 2023) — the skeptical re-reading of the emergence narrative. Worth reading alongside the original emergence papers, not instead of them.