Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why agents fall apart over long horizons

Your agent solves any single step beautifully. Run it for fifty steps and it falls off a cliff. The math behind that cliff is older than LLMs, but a newer twist makes it worse.

AI & ML intermediate Apr 29, 2026

Why it exists

The first time you build an agent, the failure mode is shocking. You watch a model that just one-shot-solved a hard coding question, that wrote your unit tests, that summarized a contract correctly — and then, asked to do thirty small steps in a row, go off the rails somewhere around step twelve. It edits the wrong file. It re-reads a doc it just read. It “fixes” working code. By step twenty, it’s confidently in a different problem than the one you asked about.

The natural reaction is “the model isn’t smart enough yet.” It’s the wrong reaction. You can take a frontier model with near-100% accuracy on a single step and still watch its end-to-end success rate collapse as you stack steps. Sinha et al. (2025), The Illusion of Diminishing Returns, make this concrete: per-step accuracy isn’t constant — it itself degrades as the trajectory gets longer, and even tiny per-step error rates compound multiplicatively across a long chain. A 99% per-step model has a hard time crossing 100 steps without an error somewhere; a 95% model is already in trouble at 14.

Long-horizon failure isn’t a separate kind of mistake the model makes. It’s a structural property of running a stochastic decoder in a loop.

Why it matters now

This is the bottleneck for almost every product labeled “agent” in 2026.

METR’s Measuring AI Ability to Complete Long Tasks (March 2025) put numbers on the trend: the length of tasks frontier agents can complete at 50% reliability has been roughly doubling every ~7 months over the last six years, and METR notes the trend may have accelerated in 2024. That’s a real trajectory, but it’s also the flip side of the same fact: the headline capability metric of the era is a length, not an IQ score, because length is exactly where these systems break.

If you’re building agents, the practical consequences are:

The short answer

long-horizon failure ≈ chain-success arithmetic + a self-conditioning effect that makes per-step accuracy itself degrade as the chain grows

If each step fails independently with probability p, the chance of finishing a chain of n steps with no errors is roughly (1−p)ⁿ — that’s exponential decay in the length of the task even when single-step accuracy looks great. The newer, more uncomfortable finding is that per-step accuracy isn’t even constant: once a model’s own context contains its earlier mistakes, it starts making more of them. The chain isn’t just multiplying a fixed risk; the risk is also rising.

How it works

There are really two things stacked on top of each other. Keeping them separate is the whole point.

1. The boring multiplicative part

Treat a multi-step task as a sequence of independent gates and you get the “chain success” identity:

P(finish n steps cleanly) = Π P(step i succeeds)
                          ≈ (1 − p)ⁿ      if errors are i.i.d.

Plug in numbers and the cliff appears immediately:

per-step accuracy10 steps50 steps100 steps
99%90%60%37%
95%60%8%0.6%
90%35%0.5%~0%

Every percentage point of per-step accuracy buys you a multiplicative win in the length of task you can complete. This is the same arithmetic that has always governed pipelines, manufacturing yield, and any system where you have to nail every step in a row. It just hits harder than people expect because intuitions about accuracy live in the single-question regime.

This part is not specific to LLMs. It’s why “the model is 95% accurate” is almost meaningless without “…over a chain of how long?“.

2. The newer, weirder part: self-conditioning

If per-step error were truly independent and constant, scaling and fine-tuning would be enough — keep nudging p down and the cliff moves right. But Sinha et al. (2025) show something more interesting: as a trajectory grows, the model becomes more likely to make a mistake at each step, because its context now contains its previous mistakes. They call this self-conditioning: the model isn’t just sampling i.i.d. errors, it’s sampling conditioned on a transcript that includes its own bad outputs, and that transcript pulls the next sample further toward “this is the kind of thing I do.”

A few illustrative shapes this takes in practice (these are intuition, not measured findings from the paper):

The empirical claim from Sinha et al. that’s worth carrying around: this self-conditioning isn’t fully fixed by scaling the model. Bigger models have lower base error rates, but they still degrade across long trajectories from this mechanism. What does help, in their setup, is explicit thinking (CoT-style or reasoning-model behavior). My read on why — and this is my read, not a result from the paper — is that thinking lets the model re-examine the trajectory before committing the next step, instead of mechanically extending it.

(The paper’s headline phrasing, “thinking mitigates self-conditioning,” is one of the cleaner reasons to take reasoning models seriously beyond benchmark scores. I’m describing their setup, not claiming it generalizes to every agent harness.)

Why naive harnesses make this worse

Most “agent loops” in the wild look like:

while not done:
    thought, action = model(history)
    observation     = run(action)
    history.append((thought, action, observation))

Two things are quietly hostile to long horizons here:

The interventions that work in practice all attack one of these:

None of these make the model smarter. They all reduce the number of steps the model has to nail in a row, or break the self-reinforcing-mistake loop. That’s the lever.

Where this argument has limits

A few honest caveats:

The takeaway is structural. If you build agents and you only remember one thing: length is the dominant axis of difficulty. Anything that makes the chain shorter, anything that catches a mistake before it conditions the next step, anything that lets you discard a contaminated trajectory and start fresh — that’s where the reliability gains are. Smarter models help. Shorter horizons help more.

Going deeper