Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is harness engineering?

Most of the work that turns a frontier model into a reliable product happens around the model, not inside it. Harness engineering is the name for that work.

AI & ML intermediate Apr 30, 2026

Why it exists

Why does coding with Cursor feel completely different from coding with plain ChatGPT, even when both apps are pointed at the same Claude model? The model is identical. What’s different is everything around it: which tools the model can call, which files end up in its prompt, how the loop decides when to stop, what gets retried after a failure. That wrapping is the harness. A model alone is like a brain in a jar — smart, but with no eyes, no hands, no memory of yesterday. The harness is the body. “Harness engineering” is the (newly named) craft of building good bodies.

A weird thing happened over the last two years. The same handful of frontier models — three or four families, all trained by labs you can name — ended up inside dozens of very different-feeling products. One feels like a tireless pair programmer. One feels like a research assistant that browses the web patiently for an hour. One feels like a scatterbrained chatbot that forgets what you said three turns ago. Same brains, wildly different behavior.

The thing that varies isn’t the model. It’s the harness — the program that runs around the model, deciding what tools it has, what stays in its context, when to stop, what to retry, what to ask the user before doing. Harness engineering is the discipline of designing, measuring, and improving that program.

It exists as a named thing because the work turned out to be its own craft. You can’t do it well by being a good ML engineer; the model is a black box you’re prompting, not a thing you’re training. You can’t do it well by being a good systems engineer either; your “system” is non-deterministic and re-rolls the dice every run. The skill set is something else: part product design, part distributed-systems-with-an-unreliable-worker, part prompt craft, part eval design. Hence its own name.

Why it matters now

In 2026 the headline capabilities of the top three or four models are close enough that, for many product use cases, model choice is no longer the dominant variable in how good the product feels. The harness often is. This is a working thesis, not a measured claim — but it shows up in a few shapes worth taking seriously:

This is also why “prompt engineering” stopped being the whole story. Prompt engineering is one slice of harness engineering — the system-prompt slice. The rest of the harness — tools, loop shape, memory, permissions, eval — matters at least as much, often more.

The short answer

harness engineering = tool design + context engineering + loop & control flow + permissions + evals, all iterated against a non-deterministic worker

It’s the craft of building everything around a language model so that the combined system is useful, safe, and improvable. The model is the worker; the harness is the workplace, the manager, the safety officer, and the QA team.

How it works

Think of harness engineering as five intertwined sub-disciplines. They’re not separate teams; the interesting bugs almost always cross boundaries.

1. Tool design

Tools are the model’s interface to the world. Their names, descriptions, argument schemas, and — crucially — their error messages are part of the prompt for every subsequent step.

The unintuitive part: a model with great tools behaves like a smarter model. A model with bad tools behaves like a dumber one. Concretely:

2. Context engineering

Models have a finite context window, and even within that window, attention isn’t uniform — material in the middle of long contexts is often used worse than material at the edges. So “what’s in the context, in what order, in what shape” is a real design problem.

The standard moves:

The hard part is that these moves trade off. Summarizing too eagerly loses the detail the next step needs; pruning too aggressively erases the reason a path was rejected and the agent re-tries it.

3. Loop and control flow

The minimal agent loop — call model, run tools, append, repeat — is about ten lines of code. The interesting questions are everything that goes around it:

4. Permissions and safety rails

The model is allowed to propose anything. The harness decides what actually executes. Concretely:

This is also where the harness’s relationship to the user lives. “Auto mode” vs. “ask before each step” isn’t a UI toggle layered on top — it’s a fundamental knob in the harness itself.

5. Evals and observability

This is where harness engineering looks least like model training and most like its own thing.

A single run is rarely enough to tell whether a harness change helped. The agent is stochastic; one good run and one bad run on the same task tell you very little on their own. The honest workflow is closer to A/B testing than to unit tests:

A specific failure mode worth naming: eval drift on real tasks. The production workload changes faster than your eval suite, and your evals slowly stop reflecting it. The discipline is rotating in fresh tasks from real user traces — anonymized — at a steady cadence.

How the pieces interact

The reason these aren’t independent: a change in one layer often only pays off if another layer changes too.

The mental model that works: harness engineering is iteration on a non-deterministic compound system, where every change has to be evaluated end-to-end because local improvements can degrade global behavior in surprising ways.

Where this framing has limits

A few honest caveats:

Going deeper