What is harness engineering?

Most of the work that turns a frontier model into a reliable product happens around the model, not inside it. Harness engineering is the name for that work.

AI & ML intermediate Apr 30, 2026

Why it exists

Why does coding with Cursor feel completely different from coding with plain ChatGPT, even when both apps are pointed at the same Claude model? The model is identical. What’s different is everything around it: which tools the model can call, which files end up in its prompt, how the loop decides when to stop, what gets retried after a failure. That wrapping is the harness. A model alone is like a brain in a jar — smart, but with no eyes, no hands, no memory of yesterday. The harness is the body. “Harness engineering” is the (newly named) craft of building good bodies.

A weird thing happened over the last two years. The same handful of frontier models — three or four families, all trained by labs you can name — ended up inside dozens of very different-feeling products. One feels like a tireless pair programmer. One feels like a research assistant that browses the web patiently for an hour. One feels like a scatterbrained chatbot that forgets what you said three turns ago. Same brains, wildly different behavior.

The thing that varies isn’t the model. It’s the harness — the program that runs around the model, deciding what tools it has, what stays in its context, when to stop, what to retry, what to ask the user before doing. Harness engineering is the discipline of designing, measuring, and improving that program.

It exists as a named thing because the work turned out to be its own craft. You can’t do it well by being a good ML engineer; the model is a black box you’re prompting, not a thing you’re training. You can’t do it well by being a good systems engineer either; your “system” is non-deterministic and re-rolls the dice every run. The skill set is something else: part product design, part distributed-systems-with-an-unreliable-worker, part prompt craft, part eval design. Hence its own name.

Why it matters now

In 2026 the headline capabilities of the top three or four models are close enough that, for many product use cases, model choice is no longer the dominant variable in how good the product feels. The harness often is. This is a working thesis, not a measured claim — but it shows up in a few shapes worth taking seriously:

The same model in two different harnesses can behave like two different products. A coding agent with carefully designed tools, context pruning, and a verifier loop will often out-ship a coding agent on a stronger model with a sloppy harness. Common folklore among teams who ship agents; not a measured result.
Many user complaints are about harness, not model. “It forgot what I told it” — context management. “It deleted the wrong file” — permissions. “It looped forever” — stopping condition. “It hallucinated an API” — tool design and verification. The model is often not the proximate cause; it’s the thing the harness failed to corral.
Reliability gains live here. The fastest way to make an unreliable agent more reliable is rarely “wait for the next model.” It’s tighter tools, shorter horizons, better verifiers, cleaner context. (See Why agents fall apart over long horizons for the structural reason.)

This is also why “prompt engineering” stopped being the whole story. Prompt engineering is one slice of harness engineering — the system-prompt slice. The rest of the harness — tools, loop shape, memory, permissions, eval — matters at least as much, often more.

The short answer

harness engineering = tool design + context engineering + loop & control flow + permissions + evals, all iterated against a non-deterministic worker

It’s the craft of building everything around a language model so that the combined system is useful, safe, and improvable. The model is the worker; the harness is the workplace, the manager, the safety officer, and the QA team.

How it works

Think of harness engineering as five intertwined sub-disciplines. They’re not separate teams; the interesting bugs almost always cross boundaries.

1. Tool design

Tools are the model’s interface to the world. Their names, descriptions, argument schemas, and — crucially — their error messages are part of the prompt for every subsequent step.

The unintuitive part: a model with great tools behaves like a smarter model. A model with bad tools behaves like a dumber one. Concretely:

Names and descriptions matter as much as code. read_file vs. fetch_path_contents is not a stylistic choice — in practice the model often picks tools partly by lexical match against the task wording. Inconsistent vocabulary across tools tends to cost accuracy. (Operator heuristic, not a measured effect.)
Error messages are teaching signals. If edit_file fails, the string it returns is what the model will read and condition on. “File not found” is fine; “File not found. Did you mean to create it? Use write_file for new files.” is better — it routes the model’s next action.
Schema strictness is a design choice, not a default. Loose schemas let the model pass the wrong types; over-strict schemas burn turns on validation errors. The right level depends on how forgiving downstream code is.
Tool surface area has a cost. Every tool you add competes for attention in the system prompt and broadens the space of things the model might wrongly try. Ten well-chosen tools usually beat fifty exhaustive ones.

2. Context engineering

Models have a finite context window, and even within that window, attention isn’t uniform — material in the middle of long contexts is often used worse than material at the edges. So “what’s in the context, in what order, in what shape” is a real design problem.

The standard moves:

Summarize old turns once they’re far enough back that detail no longer matters. The harness — not the model — decides when.
Retrieve on demand instead of pre-stuffing. If the agent can call read_file, you don’t need to dump the whole repo into the prompt.
Pin invariants at the top: the user’s actual goal, the constraints that must hold, the plan. These get re-read every step.
Prune contaminated history when the agent has gone down a wrong path. Leaving every failed attempt in context is exactly the substrate that self-conditioning feeds on.
Cache aggressively. If the system prompt and tool definitions don’t change between calls, prompt caching can shift a large fraction of each turn’s cost into a one-time charge — exact savings depend on the provider, the cache TTL, and how much of the prefix is stable. (See Why prompt caching exists.)

The hard part is that these moves trade off. Summarizing too eagerly loses the detail the next step needs; pruning too aggressively erases the reason a path was rejected and the agent re-tries it.

3. Loop and control flow

The minimal agent loop — call model, run tools, append, repeat — is about ten lines of code. The interesting questions are everything that goes around it:

When does the loop end? Model says “done”? A budget hit? A verifier passes? Realistic harnesses have several stopping conditions and pick the first one to trip.
Plan first or improvise? Plan-then-execute bounds how far one bad step propagates, at the cost of flexibility. Pure ReAct-style improvisation is more flexible but compounds errors faster.
One agent or several? Sometimes the right move is one big loop; sometimes it’s a coordinator that spawns a fresh sub-agent per subtask, each with its own clean context. Sub-agents are the closest thing harness engineering has to a “fork the process” primitive.
What runs after the model speaks but before the user sees it? Linters, type-checkers, test runners, schema validators, policy checks. These are cheap and brutal — the agent thinks it’s done; the harness disagrees, and the loop continues.

4. Permissions and safety rails

The model is allowed to propose anything. The harness decides what actually executes. Concretely:

Read vs. write asymmetry. Most harnesses let the model freely read files, search, and inspect, but pause before writes, deletes, network calls to production, or anything that costs money.
Allowlists over denylists. “These tools can run without asking” is more robust than “these tools require confirmation.” With a denylist, any tool you didn’t think to flag as dangerous defaults to running silently — that’s where unknown unknowns live. With an allowlist, the default for anything new is “ask,” which is the safer failure mode.
Confirmation UX is part of the harness. A confirmation prompt that buries the relevant detail (which file? what diff?) gets rubber-stamped and provides no real safety. A clear one is the difference between a rail and a placebo.

This is also where the harness’s relationship to the user lives. “Auto mode” vs. “ask before each step” isn’t a UI toggle layered on top — it’s a fundamental knob in the harness itself.

5. Evals and observability

This is where harness engineering looks least like model training and most like its own thing.

A single run is rarely enough to tell whether a harness change helped. The agent is stochastic; one good run and one bad run on the same task tell you very little on their own. The honest workflow is closer to A/B testing than to unit tests:

A frozen task suite, ideally tasks that take the agent more than a trivial number of steps so harness effects show up.
Multiple runs per task per variant — a single sample is noise.
Metrics that distinguish “got the right answer” from “got there cleanly” — token cost, tool-call count, wall-clock time, number of retries, human interventions per task.
Traces, not just logs. When something goes wrong on step 27 of 40, the only debugging substrate is the full transcript: model inputs, outputs, tool results, decisions the harness made. Many teams who work on this seriously end up building or buying a trace-viewing UI fairly early; the alternative is staring at JSON.

A specific failure mode worth naming: eval drift on real tasks. The production workload changes faster than your eval suite, and your evals slowly stop reflecting it. The discipline is rotating in fresh tasks from real user traces — anonymized — at a steady cadence.

How the pieces interact

The reason these aren’t independent: a change in one layer often only pays off if another layer changes too.

Adding a verifier (loop layer) is wasted if the harness can’t act on its signal — i.e., can’t roll back a contaminated state (context layer).
A tool with great error messages (tool layer) is wasted if those errors get summarized away three turns later (context layer).
Tighter permissions (safety layer) without clear confirmation UX (loop / UX layer) just train users to click through.

The mental model that works: harness engineering is iteration on a non-deterministic compound system, where every change has to be evaluated end-to-end because local improvements can degrade global behavior in surprising ways.

Where this framing has limits

A few honest caveats:

“Harness engineering” is an emerging label. People have been doing this work since the first tool-using agents; the name has caught on in the last year or so, mostly via the agent-coding crowd. I don’t have a clean origin for who used it first; if you read this in a few years, the boundaries between “harness engineering,” “AI engineering,” and “agent design” may have settled differently.
The model still matters. A weak model in a perfect harness is still a weak product. The claim is that, between today’s frontier models, the harness is the dominant variable — not that models are irrelevant.
Some tasks are model-bound. A task that fails because the model genuinely can’t reason about the domain isn’t going to be saved by better tool design. Knowing which kind of failure you’re looking at is itself a harness-engineering skill.
The discipline is young and the literature is thin. Most of what’s known about harness engineering is folklore inside teams that have shipped agents, blog posts, and source code of open agents. There isn’t a textbook yet, and a lot of strong-sounding claims (including some in this post) are working hypotheses, not measured results.

Agent harness — agent = model + harness — the noun this discipline is the verb of.
Prompt engineering — prompt engineering = one slice of harness engineering — historically the whole story; today one sub-skill alongside tool design, context engineering, loop design, and eval.
Context engineering — context engineering = deciding what's in the model's context, in what shape, at what cost — the sub-discipline most teams discover second.
Tool / function calling — tool use = model emits structured calls + harness executes them — the protocol via which the harness offers the model things it can do.
MCP (Model Context Protocol) — MCP = an open protocol for exposing tools, resources, and prompts to AI applications — lets you reuse harness work across products.
Eval harness — eval harness = task suite + runner + scoring — overloaded term: in research, it’s the rig that scores models on a benchmark; in product work, the rig that scores your harness on your task suite. Same word, different artifact.
Scaffolding — scaffolding ≈ harness — older word for roughly the same idea, mostly used in academic agent papers from 2023–2024.

Going deeper

Anthropic, Building effective agents (December 2024) — practical patterns for tools, context, and control flow. One of the more influential pieces of writing on the discipline; not the only one, but a good entry point.
The source of any open coding agent — Aider, OpenHands, Continue, Cline. Reading one end-to-end is the fastest way to internalize what the harness actually is, where the seams are, and what each layer costs.
Sinha et al., The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (arXiv:2509.09677, 2025) — not a harness paper directly, but the cleanest recent statement of how long-horizon execution degrades and how self-conditioning amplifies it. Most of the harness moves in this post (verifiers, pruning, decomposition) are easier to motivate after reading it; the paper itself doesn’t claim to study harness design.
Cemri et al., Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657, 2025) — what changes when the harness is coordinating multiple agents instead of one.