Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does chain-of-thought prompting work?

Adding 'let's think step by step' to a prompt makes models measurably better at hard problems. Nobody fully agrees on why, and the wrong story will mislead you about how to use it.

AI & ML intermediate Apr 29, 2026

Why it exists

Around 2022 a strange empirical fact got named. If you asked a sufficiently large LLM a multi-step word problem and demanded the answer directly, it often flubbed it. If you asked it the same problem and added something like “think step by step” — or showed it a few worked examples with the intermediate steps written out — accuracy jumped, sometimes by tens of points on the same benchmark. No new training. No new weights. Just more tokens between the question and the answer.

That fact was a problem for the mental model people had. The standard account of an LLM was “next-token predictor” — a giant lookup that, given the prompt, samples a likely continuation. Under that account, the content of the prompt should matter, but the length of the model’s own scratchwork shouldn’t. Yet here was a clean empirical handle: let the model write more before answering and it gets things right that it otherwise gets wrong.

Chain-of-thought (CoT) prompting exists because once that handle was discovered, it was too cheap and too useful to ignore. It also exists because it forced the field to confront a less comfortable observation: the model’s “reasoning” wasn’t a property of the model alone — it was a property of how much room you gave it to compute on the way to an answer.

Why it matters now

Every modern frontier model — and most of the cheap ones built on top of distilled versions — leans on this in some form:

The practical version of the question, for an engineer in 2026, isn’t “should I add ‘think step by step’ to my prompt?” — it’s “how much intermediate computation does this task actually need, and am I paying for it on purpose?”

The short answer

chain-of-thought = let the model write intermediate tokens before the answer + condition each new token on all the previous ones

CoT works because a transformer’s “thinking” is the tokens it produces. Each new token is computed from the prompt plus everything the model has already written. If the answer is the very next token, the model has exactly one forward pass to get there. If the model is allowed to write a paragraph of working first, it has dozens or hundreds of forward passes, each one able to use the previous ones as scratch space. More tokens before the answer means more compute, more intermediate state, and more chances to recover from a bad guess.

That’s the mechanical story. The harder question — why does the scratch space help so much, given that the model wasn’t explicitly trained to use it? — is the part nobody has fully nailed down.

How it works

Three things are happening at once. It’s worth keeping them apart.

1. Compute per answer scales with output length

A transformer does a roughly fixed amount of work per token generated. If your task secretly requires composing several sub-results, and you demand the final answer in one token, you’ve capped the model at one forward pass of compute. Asking for the working out gives it more forward passes. There’s a line of theoretical work formalizing this — showing that transformers with a CoT scratchpad can solve problem classes that fixed-depth, no-scratchpad transformers provably cannot. I won’t pretend to know the exact formal results well enough to cite them precisely; the rough shape is “scratchpad tokens are not just nice, they expand what the architecture can express.”

If you only remember one thing: CoT buys the model more compute, and that compute is the resource it was short on.

2. Each token conditions on the ones before it

A transformer is autoregressive: token n is sampled from a distribution conditioned on tokens 1…n−1. When the model writes “first, let’s compute 17 × 6 = 102,” that string is now in the context for every subsequent token. The next sentence can use 102 as if it were given in the prompt. The model has, effectively, written itself a note.

This is why CoT can fail in fascinating ways. If the model writes a wrong intermediate step (“17 × 6 = 112”), every later step conditions on the wrong number. The error doesn’t just survive — it gets elaborated. CoT outputs are often confidently coherent stories built on top of an early arithmetic slip. That’s a feature of the mechanism, not a bug to be patched.

3. The pretraining corpus is full of step-by-step reasoning

Textbook solutions, math-stack threads, code with comments, legal opinions, chess annotations — the internet is saturated with examples of “here’s the problem, here’s the working, here’s the answer.” A model that has read enough of that has learned, statistically, that the distribution of correct answers given a worked-out scratchpad is sharper than the distribution of correct answers given the bare question. When you prompt it to think step by step, you’re nudging it into the part of its training distribution where this pattern lives. That’s a meaningful slice of why a prompt — no weight changes — moves the needle.

The 2022/2023 papers that put CoT on the map (Wei et al., “Chain-of- Thought Prompting Elicits Reasoning in Large Language Models,” and Kojima et al., “Large Language Models are Zero-Shot Reasoners,” which introduced the now-iconic “Let’s think step by step” trigger) are the canonical references for the empirical effect. The “why does this really work” literature is messier, and I don’t think there’s consensus.

What changed with reasoning models

Plain CoT prompting was a free trick: add a phrase, get better results. Reasoning-model training is the same idea, but moved from prompt-time to train-time. Roughly: generate many CoTs per problem, score the final answers, and reinforce the model toward producing CoTs that lead to correct answers — often via reinforcement learning with verifiable rewards on math and code. The exact recipes behind o1, R1, and the extended-thinking modes are not fully public; the public DeepSeek-R1 paper is the most detailed open description I’m aware of, and even it leaves real gaps about the data mixture and reward shaping.

The visible consequence: the model emits a long internal monologue (often hidden from the user), and the answer that follows is markedly better on tasks that benefit from working through. The invisible cost: those tokens are real compute, and your latency and bill scale with them.

Where CoT misleads you

A few seams worth knowing:

Going deeper

What I’m confident about: the mechanical story (more tokens = more compute = more intermediate state) and the empirical effect on hard multi-step tasks. What I’m less confident about: the precise mix of “extra compute,” “self-conditioning on intermediate results,” and “matching a training-distribution pattern” that explains how much CoT helps on any given task. If someone tells you they have that decomposition pinned down, ask for the citation.