Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why reasoning models exist

Why we suddenly have a separate class of LLMs that 'think before answering' — and what changed to make spending compute at inference, not training, the new lever.

AI & ML intermediate Apr 29, 2026

Why it exists

For most of the modern LLM era, there was one knob: make the model bigger, or train it on more data. You typed a question, the model produced its answer in roughly the same number of tokens regardless of how hard the question was. A trivia question and an Olympiad math problem cost about the same.

That always felt wrong. Humans don’t think this way. A hard problem takes longer. You scribble. You backtrack. You try a thing, notice it’s not working, try something else. The amount of effort you spend scales with the difficulty of the problem.

In September 2024, OpenAI shipped o1-preview (the full o1 followed in December), a model that did something visibly different: before answering, it produced a long internal monologue — sometimes hundreds of tokens, sometimes tens of thousands — and the longer it was allowed to think, the better its answers got on hard problems. On the 2024 AIME math contest, GPT-4o scored about 9% pass@1; o1-preview reached ~45% and o1 reached ~74%, with majority-vote and reranking pushing higher still. Then DeepSeek-R1 shipped in January 2025 with an open paper and weights showing a recipe for how to train a model to behave this way. After that the category had a name — “reasoning model” — and most major frontier labs shipped one within the following year.

The reason this category exists is that the old knob (more pretraining) was getting expensive and slow, and a second knob — let the model spend more compute per query at inference — turned out to be a real, separate axis of improvement. Reasoning models are the productized form of that second knob.

Why it matters now

If you’re a software engineer in 2026, the practical consequences matter every time you pick a model:

If you don’t have a model in your head for why this category exists, you’ll either pay for thinking you didn’t need, or skip it on the problems that actually require it.

The short answer

reasoning model = base LLM + RL training that rewards correct final answers after long internal chain-of-thought

A reasoning model is a normal language model that has been post-trained — usually with reinforcement learning against verifiable answers — to first emit a long internal scratchpad and only then emit the answer. At inference time, that scratchpad is where the extra compute goes. More thinking tokens, more compute spent per query, better answers on problems that benefit from search and self-checking.

How it works

To see why this is its own thing rather than “just chain-of-thought,” you have to look at three connected ideas: the limit reasoning models were invented to push on, the training trick that makes the long scratchpad reliable, and the new scaling curve that makes any of this worth doing.

The limit they push on

Pretraining scaling — bigger model + more tokens, the scaling laws result — has driven most LLM progress so far. But pretraining gives you the same compute budget per query regardless of difficulty. Whether you ask “what is 2+2” or “prove that there are infinitely many primes of the form 4k+1,” a vanilla LLM does roughly the same amount of work: one forward pass per output token, no looking back, no trying-and-discarding.

Chain-of-thought prompting already showed that letting the model write more tokens before the answer helped on hard problems. That’s a free lunch in one sense — no retraining — but it’s fragile. The model wasn’t trained to think carefully on a scratchpad; it was trained to predict next tokens. So its “thinking” mimics tutorials in the training data more than it actually searches.

The reasoning-model bet is: if you train a model so the reward signal flows from “did you get the right final answer” back into the scratchpad, it will learn to use the scratchpad for what scratchpads are for — backtracking, double-checking, trying multiple approaches. And the published results back this up. OpenAI’s announcement showed that o1’s performance keeps climbing with more test-time compute; the DeepSeek-R1 paper reports self-correction and verification behaviors emerging from RL alone in their R1-Zero variant — trained with no supervised reasoning traces at all. (The final R1 model adds back a small amount of cold-start data and several SFT/RL stages on top; R1-Zero is the cleaner statement of the emergent-reasoning claim.)

The training trick: verifiable rewards

The DeepSeek-R1 recipe is the most public version of this and worth internalizing because every other lab is doing some variant.

Standard RLHF uses a learned reward model that imitates human preferences — useful for “be helpful, be polite,” but a noisy and gameable signal for “is this proof correct.” DeepSeek’s paper describes a rule-based reward system with two parts: an accuracy reward (for math problems, force the model to put its final answer in a known format and check it with a symbolic verifier; for code, run the unit tests) plus a format reward for using the scratchpad correctly. No learned reward model, no human raters in the loop. The community has since latched onto the shorthand reinforcement learning with verifiable rewards (RLVR) for this family of techniques; the term isn’t from the R1 paper itself.

What “emerged” from this training, per the paper, was a set of behaviors nobody hand-wrote: the model started spending more tokens on harder problems, started saying things like “wait, let me reconsider,” started trying alternative approaches when the first one wasn’t working. This is the load-bearing claim of the reasoning-model story: long, useful scratchpads are an emergent property of training against a correctness signal, not something you can reliably get from prompting.

The standard account is also that not every domain has a clean verifier. Math and code do. “Write a kind condolence email” doesn’t. This is one reason reasoning models tend to dominate math/code/STEM benchmarks much more than they dominate, say, creative-writing benchmarks. I don’t have a cleanly-sourced number for the size of that gap by domain — treat the direction as well-established and the magnitude as task-specific.

The new scaling curve

The conceptual reason this is a Big Deal, not a parlor trick, is the shape of the curve.

Pretraining scaling is the well-known story: more training compute buys you a roughly predictable improvement in loss. Reasoning models opened a second curve: at inference time, holding the model fixed, spending more compute per query also buys measurable improvement on hard tasks — at least up to a point. OpenAI’s original o1 chart showed AIME accuracy climbing roughly linearly against the log of test-time compute.

Two important caveats people skip past:

Still, even with those caveats, there’s now a second scaling axis. You can buy quality at training time or at inference time. That changes how labs build models (smaller base + heavier RL run) and how engineers deploy them (route easy queries to a cheap model, hard ones to a reasoning model with a budget).

Where the seams show

A few things worth knowing if you’re shipping with these models:

The compression is: pretraining bought you a smarter model. Reasoning training buys you a model that knows when to keep working. The scratchpad is the place where compute lives at inference time, and training the model to use it well is the actual product.

Going deeper