Why does chain-of-thought prompting work?

Adding 'let's think step by step' to a prompt makes models measurably better at hard problems. Nobody fully agrees on why, and the wrong story will mislead you about how to use it.

AI & ML intermediate Apr 29, 2026

Why it exists

Around 2022 a strange empirical fact got named. If you asked a sufficiently large LLM a multi-step word problem and demanded the answer directly, it often flubbed it. If you asked it the same problem and added something like “think step by step” — or showed it a few worked examples with the intermediate steps written out — accuracy jumped, sometimes by tens of points on the same benchmark. No new training. No new weights. Just more tokens between the question and the answer.

That fact was a problem for the mental model people had. The standard account of an LLM was “next-token predictor” — a giant lookup that, given the prompt, samples a likely continuation. Under that account, the content of the prompt should matter, but the length of the model’s own scratchwork shouldn’t. Yet here was a clean empirical handle: let the model write more before answering and it gets things right that it otherwise gets wrong.

Chain-of-thought (CoT) prompting exists because once that handle was discovered, it was too cheap and too useful to ignore. It also exists because it forced the field to confront a less comfortable observation: the model’s “reasoning” wasn’t a property of the model alone — it was a property of how much room you gave it to compute on the way to an answer.

Why it matters now

Every modern frontier model — and most of the cheap ones built on top of distilled versions — leans on this in some form:

Reasoning models (the OpenAI o-series, DeepSeek-R1, Anthropic’s extended-thinking modes, Google’s Gemini Thinking variants, and others) are essentially “models that have been trained to produce a long, hidden chain of thought before the visible answer, and rewarded when the final answer is correct.” The first wave (o1, late 2024) made the category real; everything since has iterated on it. The category is the production form of the prompting trick.
Tool-use loops and agents are CoT in disguise. The model writes a thought, calls a tool, reads the result, writes another thought. The visible structure is “agent harness,” but the engine driving it is the same intermediate-token machinery.
Pricing. Reasoning tokens are billed. A model that “thinks” for 10k tokens before answering costs roughly 10k tokens more than one that doesn’t. CoT is now a line item, not a free trick.
Evaluation. Benchmarks split into “with reasoning” and “without.” A score on GSM8K or a competition math set means almost nothing unless you know which budget the model was given.

The practical version of the question, for an engineer in 2026, isn’t “should I add ‘think step by step’ to my prompt?” — it’s “how much intermediate computation does this task actually need, and am I paying for it on purpose?”

The short answer

chain-of-thought = let the model write intermediate tokens before the answer + condition each new token on all the previous ones

CoT works because a transformer’s “thinking” is the tokens it produces. Each new token is computed from the prompt plus everything the model has already written. If the answer is the very next token, the model has exactly one forward pass to get there. If the model is allowed to write a paragraph of working first, it has dozens or hundreds of forward passes, each one able to use the previous ones as scratch space. More tokens before the answer means more compute, more intermediate state, and more chances to recover from a bad guess.

That’s the mechanical story. The harder question — why does the scratch space help so much, given that the model wasn’t explicitly trained to use it? — is the part nobody has fully nailed down.

How it works

Three things are happening at once. It’s worth keeping them apart.

1. Compute per answer scales with output length

A transformer does a roughly fixed amount of work per token generated. If your task secretly requires composing several sub-results, and you demand the final answer in one token, you’ve capped the model at one forward pass of compute. Asking for the working out gives it more forward passes. There’s a line of theoretical work formalizing this — showing that transformers with a CoT scratchpad can solve problem classes that fixed-depth, no-scratchpad transformers provably cannot. I won’t pretend to know the exact formal results well enough to cite them precisely; the rough shape is “scratchpad tokens are not just nice, they expand what the architecture can express.”

If you only remember one thing: CoT buys the model more compute, and that compute is the resource it was short on.

2. Each token conditions on the ones before it

A transformer is autoregressive: token n is sampled from a distribution conditioned on tokens 1…n−1. When the model writes “first, let’s compute 17 × 6 = 102,” that string is now in the context for every subsequent token. The next sentence can use 102 as if it were given in the prompt. The model has, effectively, written itself a note.

This is why CoT can fail in fascinating ways. If the model writes a wrong intermediate step (“17 × 6 = 112”), every later step conditions on the wrong number. The error doesn’t just survive — it gets elaborated. CoT outputs are often confidently coherent stories built on top of an early arithmetic slip. That’s a feature of the mechanism, not a bug to be patched.

3. The pretraining corpus is full of step-by-step reasoning

Textbook solutions, math-stack threads, code with comments, legal opinions, chess annotations — the internet is saturated with examples of “here’s the problem, here’s the working, here’s the answer.” A model that has read enough of that has learned, statistically, that the distribution of correct answers given a worked-out scratchpad is sharper than the distribution of correct answers given the bare question. When you prompt it to think step by step, you’re nudging it into the part of its training distribution where this pattern lives. That’s a meaningful slice of why a prompt — no weight changes — moves the needle.

The 2022/2023 papers that put CoT on the map (Wei et al., “Chain-of- Thought Prompting Elicits Reasoning in Large Language Models,” and Kojima et al., “Large Language Models are Zero-Shot Reasoners,” which introduced the now-iconic “Let’s think step by step” trigger) are the canonical references for the empirical effect. The “why does this really work” literature is messier, and I don’t think there’s consensus.

What changed with reasoning models

Plain CoT prompting was a free trick: add a phrase, get better results. Reasoning-model training is the same idea, but moved from prompt-time to train-time. Roughly: generate many CoTs per problem, score the final answers, and reinforce the model toward producing CoTs that lead to correct answers — often via reinforcement learning with verifiable rewards on math and code. The exact recipes behind o1, R1, and the extended-thinking modes are not fully public; the public DeepSeek-R1 paper is the most detailed open description I’m aware of, and even it leaves real gaps about the data mixture and reward shaping.

The visible consequence: the model emits a long internal monologue (often hidden from the user), and the answer that follows is markedly better on tasks that benefit from working through. The invisible cost: those tokens are real compute, and your latency and bill scale with them.

Where CoT misleads you

A few seams worth knowing:

CoT is not a window into the model’s “real” reasoning. It’s a generated artifact, sampled from the same model. There’s a growing literature (Turpin et al. and others) showing that models can produce a fluent CoT that is not the actual cause of their answer — the answer correlates with biases in the prompt the CoT never mentions. Treat the trace as suggestive, not as a proof.
It can hurt on easy tasks. Forcing a model to ramble through a trivial classification can introduce errors that wouldn’t have happened in a one-token answer. The “think step by step” reflex is a tool, not a constant.
The improvement is not uniform across model sizes. The original CoT papers reported that small models barely benefited, or got worse, while large models benefited a lot — the so-called “emergence” of CoT ability. Whether “emergence” is a real discontinuity or an artifact of how we measure has been argued back and forth (see Schaeffer et al., “Are Emergent Abilities of Large Language Models a Mirage?”); I don’t think it’s settled.

Self-consistency — self-consistency = sample N chains of thought + majority vote on final answer. A cheap variance reducer on top of CoT.
Tree-of-thoughts — ToT ≈ CoT + branching + a search procedure. Explore multiple reasoning paths and prune. More expensive, sometimes much better on planning-shaped tasks.
Reasoning model — reasoning model = base LLM + RL training that rewards correct answers after long internal CoT. The production form of the trick.
RLHF — the older “make the model behave” lever. Reasoning training is a sibling: same RL machinery, different reward.
In-context learning — the broader phenomenon CoT lives inside: models adapt their behavior to patterns shown in the prompt, with no weight updates. See in-context learning.
Test-time compute — test-time compute = work done per query at inference, not at training. CoT is the original way to spend it; sampling, search, and verifier-guided decoding are newer ways.

Going deeper

Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022) — the paper that named the effect.
Kojima et al., Large Language Models are Zero-Shot Reasoners (2022) — the “let’s think step by step” version, no examples needed.
Turpin et al., Language Models Don’t Always Say What They Think (2023) — the most-cited demonstration that CoT traces can be unfaithful to the actual reasoning.
DeepSeek-AI, DeepSeek-R1 technical report (2025) — the most detailed public description of training a reasoning model, useful even if you never plan to train one.

What I’m confident about: the mechanical story (more tokens = more compute = more intermediate state) and the empirical effect on hard multi-step tasks. What I’m less confident about: the precise mix of “extra compute,” “self-conditioning on intermediate results,” and “matching a training-distribution pattern” that explains how much CoT helps on any given task. If someone tells you they have that decomposition pinned down, ask for the citation.