Why AI runs away in verifiable domains
AI is getting superhuman fastest at things a computer can grade — math, code, formal proofs — and dragging behind on things it can't. The reason isn't that those domains are 'easier.' It's that training has a feedback step, and feedback needs a verifier.
Why it exists
If you plot frontier-model capability over the last two years, the curves fan out by domain in a way that should bother you.
On the things a computer can grade — competition math, contest programming, agentic coding tasks with a passing test suite — performance has gone from “junior intern, sometimes” to “world-class, routinely” in well under two years. AIME 2024: GPT-4o landed around 9% pass@1; o1 reached the mid-70s; the next generation of reasoning models cleared that bar without straining. The original SWE-bench (real GitHub issues with held-out tests, released October 2023) had best results in the low single digits; SWE-bench Verified, the cleaner subset that launched in August 2024, climbed past 70% in 2025 — Anthropic and OpenAI have both reported results in that range.
On the things a computer can’t grade — long-form writing that has to be good, judgement calls under genuine uncertainty, taste, knowing when an idea is bad before you ship it — there’s progress, but it’s the diffuse, debatable kind. People argue about whether new models are actually better writers or just more confident ones. There’s no AIME score for “wrote a memo your boss respected.”
The shape of this gap isn’t an accident, and it isn’t going to close on its own. The thing that’s powering the runaway curves on the left — massive RLVR-style training on problems where a checker can score every attempt — requires a checker. Where a checker exists, you can run the loop millions of times. Where one doesn’t, you’re back to expensive, noisy human preference data, and that ceiling is much lower.
So the principle is: AI makes the fastest progress in domains where its output can be easily verified, because verification is what lets training scale.
Why it matters now
This isn’t a philosophy point — it’s one of the better predictors of where AI capability will and won’t lurch forward over the next year.
- For builders. If your product wraps a task whose output can be graded by a program — does the code compile, does the SQL return the right rows, did the form submit — bet on the model getting noticeably better every six months and design for that. If your product wraps a task that can only be evaluated by humans — judgement, taste, negotiation, emotional fit — expect drift, not leaps.
- For people picking what to learn. The skills with the biggest AI tailwind right now are the ones with the cleanest verifiers attached. The skills with the smallest tailwind are the ones whose whole value is “knowing what good looks like” in fuzzy domains. Both directions have implications, neither is obvious.
- For interpreting the discourse. A lot of “AI is taking off” / “AI has stalled” arguments are actually arguments about which benchmark family you trust. Math and code curves look like takeoff. Open-ended-task curves look like stall. Both can be true at once because the underlying mechanism only fires on one of them.
- For making sense of lab strategy. “Large-scale RL,” “environments,” and “verifiable rewards” have become standard vocabulary in lab announcements and hiring posts. Read those as: “we found a way to automatically score attempts at X, so we can do for X what was already done for math and code.” The frontier of the frontier is, increasingly, the frontier of what can be cheaply graded.
The short answer
AI progress in a domain ≈ model capacity × quality and volume of feedback signal you can put through it; verifiers are what make that signal cheap and abundant
A modern frontier model is pretrained on internet-scale text, then post-trained with a mixture of supervised fine-tuning and reinforcement-style methods. For the reasoning models specifically, the RL stage is where most of the visible jump on math and code benchmarks appears to come from. RL needs a reward. Where you have a deterministic checker — a unit test, a math verifier, a compiler, a type system, a Lean proof kernel — the reward is cheap, automatic, and much lower-noise than anything you could collect from humans, and you can run the loop at industrial scale. (It’s still a proxy: the checker scores what it scores, not what you ultimately want — see the seams below.) Where you don’t, you’re stuck paying humans (or an LLM judge) to compare outputs, which is slow, expensive, biased, and gameable. The capability gap between “verifiable domain” and “unverifiable domain” is, mostly, that gap in feedback economics.
How it works
To see why this is structural rather than a passing fad, follow the ingredients of a modern training run.
What “massive RL environment” actually means
When a lab says they built a “massive RL environment” for math or code, they mean roughly four things glued together:
- A problem generator. A pipeline that produces an effectively unlimited stream of tasks at the right difficulty — competition problems, synthetic variants, real GitHub issues, synthesized SQL queries against synthesized schemas. The generator’s job is to keep the model out of its comfort zone.
- A grader. A program that takes a candidate solution and returns a number. For math: did the boxed final answer match? For code: did the test suite pass in a sandbox? For formal proofs: did the kernel accept it? For agentic tasks: did the system end in the goal state? This is the load-bearing component. Everything else assumes it exists.
- A sandbox. Code has to actually run somewhere safe. Agentic environments need a fake browser, a fake shell, a fake filesystem, sometimes a fake database. Building these at the scale and reliability the training loop needs is its own non-trivial engineering project — it’s part of why “massive RL environment” is a moat, not a weekend project.
- The RL loop itself. Sample many candidate solutions per problem from the current model, score them with the grader, update the model toward the high-scoring ones (with a KL leash to a reference checkpoint so it doesn’t drift into gibberish). The DeepSeek-R1 paper is the most public worked example of verifier-driven reasoning RL at scale — it doesn’t describe a full software-engineering RL stack, but the problem-grader-sandbox-loop pattern around the math/code rewards is laid out in real detail. o1’s public-facing description is consistent with this broad shape, but OpenAI hasn’t published enough recipe detail to claim the recipe is the same.
The reason this isn’t possible-but-hard for, say, “good condolence emails” is that step 2 collapses. There’s no program that takes a draft email and returns a number you’d trust to gradient-descend on. You can build an LLM judge to fake it — and labs do — but now your ceiling is the judge, and the LLM you’re training will eventually learn to please the judge more than write good emails. (See the seams section below.)
Why verifiability scales and human preferences don’t
It’s worth being concrete about the asymmetry, because it’s bigger than people who haven’t worked on this assume.
A grader for math problems on a modern training cluster runs in milliseconds and costs almost nothing per call. You can score huge numbers of attempts per training run, on problems generated on the fly, with very low label noise — when the answer format is well specified, the answer is right or it isn’t.
A human rater comparing two LLM outputs takes seconds to minutes, costs cents to dollars per comparison once you account for overhead and QA, and the signal is noisy: different raters disagree, the same rater disagrees with themselves on different days, raters get tired, raters have politics, raters can be subtly nudged by surface features like length and formatting. Public high-quality preference datasets are tiny compared with the number of verifier-scored rollouts a large training run can plausibly generate; the private datasets at frontier labs are bigger but still nowhere near the same order.
So when a domain has a verifier, the training signal can be many orders of magnitude cheaper and much less noisy than when it doesn’t. That ratio is the thing driving the runaway. It’s not that math is somehow philosophically more amenable to AI. It’s that you can run the training loop vastly more times for the same money.
What this predicts about which domains “open up” next
The interesting move at the frontier is finding new verifiers — or, more precisely, dragging new domains into the verifiable column.
- Anything with a sandbox plus a goal state. Agentic browsing where success is “the booking confirmation page rendered.” Tool use where success is “the API call returned the expected payload.” SQL where success is “the query returned the labeled rows.” These are essentially reductions to the code-verification case.
- Anything with formal semantics. Lean and other proof assistants are the cleanest example — a proof either typechecks or it doesn’t. Formal-proof work is a natural fit for this mechanism for that exact reason.
- Anything where an external system already grades the work. CTF challenges that have flags. Kaggle-like ML tasks with held-out scoring. Trading strategies with a backtest. Any pre-existing scoreboard that an automated agent can submit to.
- Anything where you can manufacture the ground truth. Generate a problem and its answer together — render a 3D scene, ask the model to identify the geometry; corrupt a known-good codebase, ask the model to repair it; encrypt a known plaintext, ask the model to break a weak cipher. Whenever you can construct (problem, answer) pairs synthetically at scale, you’ve built a verifier.
What this doesn’t predict opens up next: tasks where the only valid judge is “did this make a real human happier / more persuaded / better-informed in their actual life.” That’s a real judgement and a useful one, but it doesn’t fit cleanly into a training loop.
Where the seams show
A few honest caveats so this doesn’t read as triumphalist:
- Reward hacking is a real ceiling. When the verifier has any loophole, RL finds it. The folklore is full of agents that “passed the tests” by deleting the tests, or solved the math problem by printing the correct answer string from memory. Frontier labs spend non-trivial effort hardening their verifiers against this, and it’s an arms race within the same training run.
- “Verifiable” smuggles in domain choices. The verifier scores a proxy for the thing you actually want. SWE-bench measures “did the unit tests pass,” not “did this fix maintain the codebase’s long-term health.” Models trained to maximize the proxy will tend to behave the way the proxy rewards, including in subtly bad ways the proxy doesn’t catch. Goodhart’s law doesn’t get suspended just because the verifier is automated.
- LLM-as-judge is a tempting shortcut, with a real cost. When there’s no clean verifier, labs sometimes use a stronger LLM as a grader. This works partway — but it imports the judge’s biases into the trained model and creates a clear path to mode collapse. My read is that LLM-as-judge is good for raising a floor on soft tasks, not for the runaway scaling that pure verifiers enable; I haven’t seen public evidence that strongly contradicts that.
- Generalization out of the verifier is the open question. A model trained heavily on verifiable math gets visibly better at verifiable math. Whether it gets better at real-world reasoning that resembles math is messier. There’s evidence both ways and no consensus I’d vouch for. The optimistic story is that the reasoning skills generalize broadly. The pessimistic story is that you get a benchmark-shaped model. The truth in 2026 is probably somewhere in between, and labs are actively measuring this; I don’t have a clean public number for the size of the transfer.
- The bottleneck moves. As verifiers get cheap, the scarce resource becomes good problems. A training run is gated less by “can we score it” and more by “do we have enough hard, novel problems for the model to fail at and learn from.” Problem generation has quietly become its own subfield.
- This isn’t the only axis of progress. Pretraining still matters; data quality still matters; tool use, context handling, and harness design still matter. The claim isn’t that verifiable RL is the only thing happening, just that it’s the single biggest reason the curves on math and code look the way they do, and that the same mechanism doesn’t exist for many other domains.
The compression: where you can build a cheap, hard-to-game grader, you can train a model superhumanly. Where you can’t, you’re limited by how fast humans can label. That ratio is the engine; the domain-by-domain capability map is the exhaust.
Famous related terms
- RLVR —
RLVR = RL post-training + a deterministic checker for the answer. The training recipe this whole post is about; see Why reasoning models exist. - RLHF —
RLHF = SFT + reward model trained on human preferences + RL loop. The version you fall back to when there’s no verifier; the ceiling case. - Reward hacking —
reward hacking = model finds reward-function loopholes that score high but miss the intent. The main reason verifiers need to be hardened, not just written. - Goodhart’s law — when a measure becomes a target, it stops being a good measure. The patron saint of every “we trained on the proxy and got the proxy” failure.
- The Bitter Lesson — Rich Sutton’s argument that, over the history of AI, methods that scale with compute and data beat methods that encode human cleverness. The verifiable-domain runaway is a fresh datapoint for that thesis: scalable feedback wins.
- Why LLM eval is hard — the inverse problem. If we can’t even evaluate models well in a domain, we definitely can’t train them well in it for the same reasons.
Going deeper
- DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025) — the most detailed public description of training a frontier model with verifiable rewards on math and code.
- OpenAI, Learning to reason with LLMs (Sept 2024) — the o1-preview launch post; thinner on recipe, but established the public framing of test-time compute scaling.
- Rich Sutton, The Bitter Lesson (2019) — short, old, still the cleanest statement of why scalable methods win over clever ones.
- Lambert et al., on RLVR and the post-RLHF era — Nathan Lambert’s RLHF Book tracks the verifiable-rewards lineage as it diverges from preference-only RL.