Why reasoning models exist

Why we suddenly have a separate class of LLMs that 'think before answering' — and what changed to make spending compute at inference, not training, the new lever.

AI & ML intermediate Apr 29, 2026

Why it exists

For most of the modern LLM era, there was one knob: make the model bigger, or train it on more data. You typed a question, the model produced its answer in roughly the same number of tokens regardless of how hard the question was. A trivia question and an Olympiad math problem cost about the same.

That always felt wrong. Humans don’t think this way. A hard problem takes longer. You scribble. You backtrack. You try a thing, notice it’s not working, try something else. The amount of effort you spend scales with the difficulty of the problem.

In September 2024, OpenAI shipped o1-preview (the full o1 followed in December), a model that did something visibly different: before answering, it produced a long internal monologue — sometimes hundreds of tokens, sometimes tens of thousands — and the longer it was allowed to think, the better its answers got on hard problems. On the 2024 AIME math contest, GPT-4o scored about 9% pass@1; o1-preview reached ~45% and o1 reached ~74%, with majority-vote and reranking pushing higher still. Then DeepSeek-R1 shipped in January 2025 with an open paper and weights showing a recipe for how to train a model to behave this way. After that the category had a name — “reasoning model” — and most major frontier labs shipped one within the following year.

The reason this category exists is that the old knob (more pretraining) was getting expensive and slow, and a second knob — let the model spend more compute per query at inference — turned out to be a real, separate axis of improvement. Reasoning models are the productized form of that second knob.

Why it matters now

If you’re a software engineer in 2026, the practical consequences matter every time you pick a model:

Two pricing modes, not one. A reasoning model bills you for output tokens you never see — the reasoning tokens in its hidden scratchpad. A “thinking” call can cost much more than the same prompt on a non-reasoning model — how much more depends on the provider, the model, and how long it decides to think. Sometimes that’s worth it; sometimes it isn’t. You need a feel for which.
Latency is now a deliberate trade. Non-reasoning models stream the answer in seconds. A reasoning model can sit silent for tens of seconds — sometimes longer on hard problems — before the first user-visible token. Products typically surface this with a “thinking…” indicator because the silence otherwise looks broken.
A new dial in your code. Reasoning APIs typically expose an effort setting — OpenAI’s reasoning models take reasoning.effort with low / medium / high, and most providers offer some form of token budget. The exact knobs vary by provider. Picking one well requires understanding what it actually does.
Different failure modes. A reasoning model that gets the answer wrong doesn’t fail like a non-reasoning model. It fails after a long, confident-looking deliberation. The hallucination is now reasoned-into.
Agents lean on them. Long-horizon tool-using agents are unusually sensitive to per-step quality. A reasoning model is often the cheapest way to buy that quality without a fine-tune.

If you don’t have a model in your head for why this category exists, you’ll either pay for thinking you didn’t need, or skip it on the problems that actually require it.

The short answer

reasoning model = base LLM + RL training that rewards correct final answers after long internal chain-of-thought

A reasoning model is a normal language model that has been post-trained — usually with reinforcement learning against verifiable answers — to first emit a long internal scratchpad and only then emit the answer. At inference time, that scratchpad is where the extra compute goes. More thinking tokens, more compute spent per query, better answers on problems that benefit from search and self-checking.

How it works

To see why this is its own thing rather than “just chain-of-thought,” you have to look at three connected ideas: the limit reasoning models were invented to push on, the training trick that makes the long scratchpad reliable, and the new scaling curve that makes any of this worth doing.

The limit they push on

Pretraining scaling — bigger model + more tokens, the scaling laws result — has driven most LLM progress so far. But pretraining gives you the same compute budget per query regardless of difficulty. Whether you ask “what is 2+2” or “prove that there are infinitely many primes of the form 4k+1,” a vanilla LLM does roughly the same amount of work: one forward pass per output token, no looking back, no trying-and-discarding.

Chain-of-thought prompting already showed that letting the model write more tokens before the answer helped on hard problems. That’s a free lunch in one sense — no retraining — but it’s fragile. The model wasn’t trained to think carefully on a scratchpad; it was trained to predict next tokens. So its “thinking” mimics tutorials in the training data more than it actually searches.

The reasoning-model bet is: if you train a model so the reward signal flows from “did you get the right final answer” back into the scratchpad, it will learn to use the scratchpad for what scratchpads are for — backtracking, double-checking, trying multiple approaches. And the published results back this up. OpenAI’s announcement showed that o1’s performance keeps climbing with more test-time compute; the DeepSeek-R1 paper reports self-correction and verification behaviors emerging from RL alone in their R1-Zero variant — trained with no supervised reasoning traces at all. (The final R1 model adds back a small amount of cold-start data and several SFT/RL stages on top; R1-Zero is the cleaner statement of the emergent-reasoning claim.)

The training trick: verifiable rewards

The DeepSeek-R1 recipe is the most public version of this and worth internalizing because every other lab is doing some variant.

Standard RLHF uses a learned reward model that imitates human preferences — useful for “be helpful, be polite,” but a noisy and gameable signal for “is this proof correct.” DeepSeek’s paper describes a rule-based reward system with two parts: an accuracy reward (for math problems, force the model to put its final answer in a known format and check it with a symbolic verifier; for code, run the unit tests) plus a format reward for using the scratchpad correctly. No learned reward model, no human raters in the loop. The community has since latched onto the shorthand reinforcement learning with verifiable rewards (RLVR) for this family of techniques; the term isn’t from the R1 paper itself.

What “emerged” from this training, per the paper, was a set of behaviors nobody hand-wrote: the model started spending more tokens on harder problems, started saying things like “wait, let me reconsider,” started trying alternative approaches when the first one wasn’t working. This is the load-bearing claim of the reasoning-model story: long, useful scratchpads are an emergent property of training against a correctness signal, not something you can reliably get from prompting.

The standard account is also that not every domain has a clean verifier. Math and code do. “Write a kind condolence email” doesn’t. This is one reason reasoning models tend to dominate math/code/STEM benchmarks much more than they dominate, say, creative-writing benchmarks. I don’t have a cleanly-sourced number for the size of that gap by domain — treat the direction as well-established and the magnitude as task-specific.

The new scaling curve

The conceptual reason this is a Big Deal, not a parlor trick, is the shape of the curve.

Pretraining scaling is the well-known story: more training compute buys you a roughly predictable improvement in loss. Reasoning models opened a second curve: at inference time, holding the model fixed, spending more compute per query also buys measurable improvement on hard tasks — at least up to a point. OpenAI’s original o1 chart showed AIME accuracy climbing roughly linearly against the log of test-time compute.

Two important caveats people skip past:

Log axes flatter brute force. A linear-log chart can make expensive compute look like clean, predictable scaling. To go from 80% to 90% on a benchmark might cost 10x or 100x more inference compute than 70% to 80%. The line is straight; the underlying resource demand is exponential. Toby Ord’s Inference Scaling and the Log-x Chart (Jan 2025) is the cleanest argument that these charts are partly a visual rhetorical move.
Diminishing returns and ceilings exist. Recent work (e.g. arXiv:2502.12215) argues that some o1-like models don’t keep scaling past a budget; they plateau or even degrade if forced to think longer. The “more compute = better” story is real but bounded.

Still, even with those caveats, there’s now a second scaling axis. You can buy quality at training time or at inference time. That changes how labs build models (smaller base + heavier RL run) and how engineers deploy them (route easy queries to a cheap model, hard ones to a reasoning model with a budget).

Where the seams show

A few things worth knowing if you’re shipping with these models:

You don’t see the raw thinking. Providers deliberately hide the raw scratchpad. OpenAI ships a summarized version in the chat UI and bills the hidden tokens as output tokens via the API. The stated reasons mix safety monitoring with competitive moat; whichever you find more convincing, the practical effect is that you’re paying for tokens you can’t audit.
Context window pressure. Reasoning tokens consume context window during generation. A reasoning model burning 20k thinking tokens has 20k fewer tokens of room for the rest of the response. OpenAI’s docs note the hidden tokens are discarded between turns, but within a single turn they compete with your prompt and the visible answer for the same budget. Long-context tasks plus heavy reasoning are uncomfortably close to the wall.
Distillation works. A common pattern is to use a frontier reasoning model to generate scratchpads, then fine-tune a smaller base model on those traces. The s1 paper (Muennighoff et al., Stanford / UW / AI2 / Contextual AI, 2025) reproduced strong test-time-scaling behavior by fine-tuning on just 1,000 curated reasoning traces and adding a simple “budget forcing” trick to control thinking length. My read is that this is suggestive — once a frontier model exists to generate traces, copying its reasoning shape into a smaller model is much cheaper than inventing it from scratch — but that’s interpretation, not the paper’s headline claim. (See the related post on distillation.)
It’s not magic for everything. Reasoning models are roughly same-or-worse than their non-reasoning siblings on tasks that don’t benefit from deliberation: simple chat, summarization, classification, cheap function calls. Spending 30 seconds and $0.50 to answer “what time is it in Tokyo” is a misuse of the tool.
The split is probably temporary. It’s not obvious that “reasoning model” will stay a separate SKU long-term. The natural end state is one model with a knob that decides per-query whether to think, and how long. The current split is partly product packaging, partly training-recipe maturity. I don’t know which way it’ll go and I haven’t seen a confident public prediction worth trusting.

The compression is: pretraining bought you a smarter model. Reasoning training buys you a model that knows when to keep working. The scratchpad is the place where compute lives at inference time, and training the model to use it well is the actual product.

Chain-of-thought — CoT = prompt the model to write its reasoning before the answer. The prompt-time ancestor; reasoning models bake the same idea into training.
Test-time compute — test-time compute = work done per query at inference, not at training. The axis reasoning models scale on.
RLVR (Reinforcement Learning with Verifiable Rewards) — RLVR = RL + a deterministic checker for the answer instead of a learned reward model. The training trick behind DeepSeek-R1; covered in detail in the R1 paper.
RLHF — RLHF = SFT + reward model + RL loop. The older post-training recipe RLVR partially replaces for verifiable domains.
Reasoning tokens — reasoning tokens = output tokens emitted into a hidden scratchpad before the visible answer. What you’re billed for and can’t read.
Distillation — distillation = teacher model + student trained on teacher's outputs. How reasoning behavior gets copied from a frontier model into a smaller, cheaper one.

Going deeper

Learning to reason with LLMs (OpenAI, September 12, 2024) — the o1-preview launch post, with the AIME-vs-test-time-compute chart that kicked off the current wave.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek, arXiv:2501.12948) — the most detailed public description of training a reasoning model with pure RL on verifiable rewards.
s1: Simple test-time scaling (Muennighoff et al., arXiv:2501.19393) — argues that good test-time-scaling behavior can be reproduced from a small curated set of reasoning traces, and proposes simple ways to control thinking length at inference.
Revisiting the Test-Time Scaling of o1-like Models (arXiv:2502.12215) — pushes back on the “more thinking always helps” framing; reads well alongside the OpenAI/DeepSeek optimistic curves.
Toby Ord, Inference Scaling and the Log-x Chart — a sharp reading of why log-x scaling charts can look more impressive than the underlying compute economics actually are.