Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why AI runs away in verifiable domains

AI is getting superhuman fastest at things a computer can grade — math, code, formal proofs — and dragging behind on things it can't. The reason isn't that those domains are 'easier.' It's that training has a feedback step, and feedback needs a verifier.

AI & ML intermediate Apr 30, 2026

Why it exists

If you plot frontier-model capability over the last two years, the curves fan out by domain in a way that should bother you.

On the things a computer can grade — competition math, contest programming, agentic coding tasks with a passing test suite — performance has gone from “junior intern, sometimes” to “world-class, routinely” in well under two years. AIME 2024: GPT-4o landed around 9% pass@1; o1 reached the mid-70s; the next generation of reasoning models cleared that bar without straining. The original SWE-bench (real GitHub issues with held-out tests, released October 2023) had best results in the low single digits; SWE-bench Verified, the cleaner subset that launched in August 2024, climbed past 70% in 2025 — Anthropic and OpenAI have both reported results in that range.

On the things a computer can’t grade — long-form writing that has to be good, judgement calls under genuine uncertainty, taste, knowing when an idea is bad before you ship it — there’s progress, but it’s the diffuse, debatable kind. People argue about whether new models are actually better writers or just more confident ones. There’s no AIME score for “wrote a memo your boss respected.”

The shape of this gap isn’t an accident, and it isn’t going to close on its own. The thing that’s powering the runaway curves on the left — massive RLVR-style training on problems where a checker can score every attempt — requires a checker. Where a checker exists, you can run the loop millions of times. Where one doesn’t, you’re back to expensive, noisy human preference data, and that ceiling is much lower.

So the principle is: AI makes the fastest progress in domains where its output can be easily verified, because verification is what lets training scale.

Why it matters now

This isn’t a philosophy point — it’s one of the better predictors of where AI capability will and won’t lurch forward over the next year.

The short answer

AI progress in a domain ≈ model capacity × quality and volume of feedback signal you can put through it; verifiers are what make that signal cheap and abundant

A modern frontier model is pretrained on internet-scale text, then post-trained with a mixture of supervised fine-tuning and reinforcement-style methods. For the reasoning models specifically, the RL stage is where most of the visible jump on math and code benchmarks appears to come from. RL needs a reward. Where you have a deterministic checker — a unit test, a math verifier, a compiler, a type system, a Lean proof kernel — the reward is cheap, automatic, and much lower-noise than anything you could collect from humans, and you can run the loop at industrial scale. (It’s still a proxy: the checker scores what it scores, not what you ultimately want — see the seams below.) Where you don’t, you’re stuck paying humans (or an LLM judge) to compare outputs, which is slow, expensive, biased, and gameable. The capability gap between “verifiable domain” and “unverifiable domain” is, mostly, that gap in feedback economics.

How it works

To see why this is structural rather than a passing fad, follow the ingredients of a modern training run.

What “massive RL environment” actually means

When a lab says they built a “massive RL environment” for math or code, they mean roughly four things glued together:

  1. A problem generator. A pipeline that produces an effectively unlimited stream of tasks at the right difficulty — competition problems, synthetic variants, real GitHub issues, synthesized SQL queries against synthesized schemas. The generator’s job is to keep the model out of its comfort zone.
  2. A grader. A program that takes a candidate solution and returns a number. For math: did the boxed final answer match? For code: did the test suite pass in a sandbox? For formal proofs: did the kernel accept it? For agentic tasks: did the system end in the goal state? This is the load-bearing component. Everything else assumes it exists.
  3. A sandbox. Code has to actually run somewhere safe. Agentic environments need a fake browser, a fake shell, a fake filesystem, sometimes a fake database. Building these at the scale and reliability the training loop needs is its own non-trivial engineering project — it’s part of why “massive RL environment” is a moat, not a weekend project.
  4. The RL loop itself. Sample many candidate solutions per problem from the current model, score them with the grader, update the model toward the high-scoring ones (with a KL leash to a reference checkpoint so it doesn’t drift into gibberish). The DeepSeek-R1 paper is the most public worked example of verifier-driven reasoning RL at scale — it doesn’t describe a full software-engineering RL stack, but the problem-grader-sandbox-loop pattern around the math/code rewards is laid out in real detail. o1’s public-facing description is consistent with this broad shape, but OpenAI hasn’t published enough recipe detail to claim the recipe is the same.

The reason this isn’t possible-but-hard for, say, “good condolence emails” is that step 2 collapses. There’s no program that takes a draft email and returns a number you’d trust to gradient-descend on. You can build an LLM judge to fake it — and labs do — but now your ceiling is the judge, and the LLM you’re training will eventually learn to please the judge more than write good emails. (See the seams section below.)

Why verifiability scales and human preferences don’t

It’s worth being concrete about the asymmetry, because it’s bigger than people who haven’t worked on this assume.

A grader for math problems on a modern training cluster runs in milliseconds and costs almost nothing per call. You can score huge numbers of attempts per training run, on problems generated on the fly, with very low label noise — when the answer format is well specified, the answer is right or it isn’t.

A human rater comparing two LLM outputs takes seconds to minutes, costs cents to dollars per comparison once you account for overhead and QA, and the signal is noisy: different raters disagree, the same rater disagrees with themselves on different days, raters get tired, raters have politics, raters can be subtly nudged by surface features like length and formatting. Public high-quality preference datasets are tiny compared with the number of verifier-scored rollouts a large training run can plausibly generate; the private datasets at frontier labs are bigger but still nowhere near the same order.

So when a domain has a verifier, the training signal can be many orders of magnitude cheaper and much less noisy than when it doesn’t. That ratio is the thing driving the runaway. It’s not that math is somehow philosophically more amenable to AI. It’s that you can run the training loop vastly more times for the same money.

What this predicts about which domains “open up” next

The interesting move at the frontier is finding new verifiers — or, more precisely, dragging new domains into the verifiable column.

What this doesn’t predict opens up next: tasks where the only valid judge is “did this make a real human happier / more persuaded / better-informed in their actual life.” That’s a real judgement and a useful one, but it doesn’t fit cleanly into a training loop.

Where the seams show

A few honest caveats so this doesn’t read as triumphalist:

The compression: where you can build a cheap, hard-to-game grader, you can train a model superhumanly. Where you can’t, you’re limited by how fast humans can label. That ratio is the engine; the domain-by-domain capability map is the exhaust.

Going deeper