Why agents fall apart over long horizons
Your agent solves any single step beautifully. Run it for fifty steps and it falls off a cliff. The math behind that cliff is older than LLMs, but a newer twist makes it worse.
Why it exists
The first time you build an agent, the failure mode is shocking. You watch a model that just one-shot-solved a hard coding question, that wrote your unit tests, that summarized a contract correctly — and then, asked to do thirty small steps in a row, go off the rails somewhere around step twelve. It edits the wrong file. It re-reads a doc it just read. It “fixes” working code. By step twenty, it’s confidently in a different problem than the one you asked about.
The natural reaction is “the model isn’t smart enough yet.” It’s the wrong reaction. You can take a frontier model with near-100% accuracy on a single step and still watch its end-to-end success rate collapse as you stack steps. Sinha et al. (2025), The Illusion of Diminishing Returns, make this concrete: per-step accuracy isn’t constant — it itself degrades as the trajectory gets longer, and even tiny per-step error rates compound multiplicatively across a long chain. A 99% per-step model has a hard time crossing 100 steps without an error somewhere; a 95% model is already in trouble at 14.
Long-horizon failure isn’t a separate kind of mistake the model makes. It’s a structural property of running a stochastic decoder in a loop.
Why it matters now
This is the bottleneck for almost every product labeled “agent” in 2026.
- Coding agents that pair with you for an hour have to make hundreds of small decisions without a human in the loop between them.
- Research / browsing agents click through dozens of pages and form one answer at the end.
- Computer-use agents — taking screenshots, moving a mouse, typing into an OS — burn through steps fast, and a single confused click can poison the next twenty.
METR’s Measuring AI Ability to Complete Long Tasks (March 2025) put numbers on the trend: the length of tasks frontier agents can complete at 50% reliability has been roughly doubling every ~7 months over the last six years, and METR notes the trend may have accelerated in 2024. That’s a real trajectory, but it’s also the flip side of the same fact: the headline capability metric of the era is a length, not an IQ score, because length is exactly where these systems break.
If you’re building agents, the practical consequences are:
- The interesting reliability work is at the harness, not the model. Retries, checkpoints, verifiers, scoped subtasks — that’s what turns a 95% model into a usable system.
- “It works on the demo task” tells you almost nothing. A 5-step scripted demo lives on the easy side of the cliff. Your real workload probably doesn’t.
- Throwing a smarter model at it helps less than you’d expect. The failure is partially structural — the model is conditioning on its own earlier mistakes — and scale doesn’t fully erase that. (More on this below.)
The short answer
long-horizon failure ≈ chain-success arithmetic + a self-conditioning effect that makes per-step accuracy itself degrade as the chain grows
If each step fails independently with probability p, the chance of
finishing a chain of n steps with no errors is roughly (1−p)ⁿ — that’s
exponential decay in the length of the task even when single-step
accuracy looks great. The newer, more uncomfortable finding is that
per-step accuracy isn’t even constant: once a model’s own context
contains its earlier mistakes, it starts making more of them. The chain
isn’t just multiplying a fixed risk; the risk is also rising.
How it works
There are really two things stacked on top of each other. Keeping them separate is the whole point.
1. The boring multiplicative part
Treat a multi-step task as a sequence of independent gates and you get the “chain success” identity:
P(finish n steps cleanly) = Π P(step i succeeds)
≈ (1 − p)ⁿ if errors are i.i.d.
Plug in numbers and the cliff appears immediately:
| per-step accuracy | 10 steps | 50 steps | 100 steps |
|---|---|---|---|
| 99% | 90% | 60% | 37% |
| 95% | 60% | 8% | 0.6% |
| 90% | 35% | 0.5% | ~0% |
Every percentage point of per-step accuracy buys you a multiplicative win in the length of task you can complete. This is the same arithmetic that has always governed pipelines, manufacturing yield, and any system where you have to nail every step in a row. It just hits harder than people expect because intuitions about accuracy live in the single-question regime.
This part is not specific to LLMs. It’s why “the model is 95% accurate” is almost meaningless without “…over a chain of how long?“.
2. The newer, weirder part: self-conditioning
If per-step error were truly independent and constant, scaling and
fine-tuning would be enough — keep nudging p down and the cliff moves
right. But Sinha et al. (2025) show something more interesting: as a
trajectory grows, the model becomes more likely to make a mistake at
each step, because its context now contains its previous mistakes.
They call this self-conditioning: the model isn’t just sampling
i.i.d. errors, it’s sampling conditioned on a transcript that includes
its own bad outputs, and that transcript pulls the next sample further
toward “this is the kind of thing I do.”
A few illustrative shapes this takes in practice (these are intuition, not measured findings from the paper):
- Hallucinated facts get reaffirmed. Once a wrong number lands in the working notes, the model tends to treat it as evidence and build on it.
- Bad tool calls beget more bad tool calls. A failed
grepthat returned nothing teaches the model — wrongly — that the file is empty, and it starts working from that conclusion. - The agent’s tone shifts. After a few confused steps, you can often see the writing get more apologetic, more tentative, more prone to “let me try a different approach” loops that don’t actually change anything.
The empirical claim from Sinha et al. that’s worth carrying around: this self-conditioning isn’t fully fixed by scaling the model. Bigger models have lower base error rates, but they still degrade across long trajectories from this mechanism. What does help, in their setup, is explicit thinking (CoT-style or reasoning-model behavior). My read on why — and this is my read, not a result from the paper — is that thinking lets the model re-examine the trajectory before committing the next step, instead of mechanically extending it.
(The paper’s headline phrasing, “thinking mitigates self-conditioning,” is one of the cleaner reasons to take reasoning models seriously beyond benchmark scores. I’m describing their setup, not claiming it generalizes to every agent harness.)
Why naive harnesses make this worse
Most “agent loops” in the wild look like:
while not done:
thought, action = model(history)
observation = run(action)
history.append((thought, action, observation))
Two things are quietly hostile to long horizons here:
- The history grows monotonically. Every mistake the model has ever made on this task is in the prompt for every subsequent step. That’s exactly the substrate self-conditioning eats.
- There’s no global state. The “memory” of the agent is the chat log. There’s no separate ledger of “what’s actually true so far,” “which subgoals are done,” “which constraints have been verified.” So every step has to re-derive the world from a transcript that gets noisier over time.
The interventions that work in practice all attack one of these:
- Plan-then-execute. Pin a plan up front, treat each step as a scoped subtask with a clear success criterion, re-plan only when a step actually fails. This bounds how far one bad step can propagate.
- External verifiers / checkers. Run tests after a code edit, run a type-check, diff a file against expectations — anything that turns “the model thinks it succeeded” into a machine-checkable signal. Compounding error tolerates wishful thinking; a green check doesn’t.
- Scratchpad pruning / summarization. Periodically replace the long, error-laced history with a clean summary of “where we are.” This is the harness-side answer to self-conditioning: stop feeding the model its own mess.
- Hard reset on detected failure. If a verifier says step 12 broke, rolling back to step 11’s known-good state and retrying is much more reliable than asking the model to “fix it from here.” The state at step 12 is contaminated; the cleanest fix is to throw it out.
- Decompose into shorter horizons. Two 10-step subtasks chained by a hand-coded glue layer is qualitatively easier than one 20-step agent run. Length is the enemy; cutting length is the most reliable win.
None of these make the model smarter. They all reduce the number of steps the model has to nail in a row, or break the self-reinforcing-mistake loop. That’s the lever.
Where this argument has limits
A few honest caveats:
- The i.i.d. model is a simplification. Real per-step errors are
correlated — some steps are intrinsically harder, some are tutorial-
easy. The clean
(1−p)ⁿcurve is a useful first approximation, not a measurement of any specific agent. - “Self-conditioning” as a name is from a specific 2025 paper. The underlying observation — models doubling down on their own outputs — is older folklore (you’ve seen it in any chatbot that gets stuck in a loop). Sinha et al. give it a tighter operational definition and an experiment; I don’t have a clean comparative result against earlier framings of the same phenomenon.
- The METR doubling-time number is for a specific suite of tasks — HCAST, RE-Bench, SWAA, mostly software/research-shaped. How well it extrapolates to other domains (computer-use, embodied agents) is genuinely open. The doubling number is real; the universality is a read, not a measurement.
- A lot of “agent failure” in the wild isn’t pure long-horizon reasoning — it’s bad tool design, bad prompts, bad observability. Multi-agent systems also have their own failure mode taxonomy (Cemri et al., 2025, Why Do Multi-Agent LLM Systems Fail?). The compounding-error argument here is the most common single answer, not the only one.
The takeaway is structural. If you build agents and you only remember one thing: length is the dominant axis of difficulty. Anything that makes the chain shorter, anything that catches a mistake before it conditions the next step, anything that lets you discard a contaminated trajectory and start fresh — that’s where the reliability gains are. Smarter models help. Shorter horizons help more.
Famous related terms
- Self-conditioning —
self-conditioning = model sampling errors conditioned on its own past errors in context— Sinha et al.’s name for the “why it gets worse, not just stays bad” effect. - Time horizon (METR) —
time horizon = task duration at which the agent is X% reliable— the metric METR popularized as the measure of agent capability. - ReAct loop —
ReAct = Thought → Action → Observation, repeat— the dominant agent-loop shape; flexible but accumulates history aggressively. - Plan-and-execute —
plan-then-execute = upfront plan + per-step executor + re-planner on failure— limits blast radius of any single bad step at the cost of flexibility. - Reasoning models —
reasoning model = LLM + extended internal deliberation before answering— in Sinha et al.’s setup, dampens self-conditioning. - Agent harness —
agent harness = loop + tools + memory + verifiers around the model— where most reliability work lives. - Hallucination —
hallucination = confidently generated false content— the per-step error type that, once in context, feeds self-conditioning.
Going deeper
- Sinha, Arun, Goel, Staab, Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, arXiv:2509.09677, 2025 (ICLR 2026). The self-conditioning paper. The setup is small and clean; the claims are unusually concrete for this corner of the field.
- METR, Measuring AI Ability to Complete Long Tasks, March 2025. The “doubling every ~7 months” piece. The methodology section is the interesting part; the headline number is downstream of it.
- Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657, 2025. A taxonomy of failure modes when you stack multiple agents together — different problem, related shape.
- Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, 2022. An influential early template for interleaving reasoning and action — the Thought/Action/Observation pattern most agent harnesses still echo. Reading it now, with long-horizon hindsight, makes the failure modes obvious in a way they weren’t when the paper was new.