Why is evaluating an LLM so much harder than testing normal software?
Unit tests pass or fail. LLM outputs don't. The hard part isn't running the eval — it's deciding what 'correct' even means when there are a million right answers.
Why it exists
Anyone who has shipped a feature backed by a LLM has hit the same wall. You write a prompt. It looks great on the five examples you tried. You ship it. Two weeks later support is forwarding you outputs that are wrong in ways your five examples never hinted at. You tweak the prompt. The new version fixes those cases — and silently regresses three others you’d already considered solved.
Normal software has a clean answer to “is this change good?” — the test suite is green or it isn’t. With an LLM there is no green. Outputs are free-form text, the same input can produce different outputs, “correct” is a judgment call, and the model under test is a giant black box you didn’t train and can’t introspect. Every team building on top of these things ends up reinventing the same painful machinery: a synthetic eval set, a scoring function that mostly works, and a haunted feeling that they don’t really know if today’s model is better than yesterday’s.
This post is about why that machinery is so hard to build, not how to build it. The shape of the problem is what trips people up.
Why it matters now
Every team using agents, chatbots, summarizers, classifiers, or “AI features” of any kind faces this. Three things make it especially painful right now:
- Models change under you. A provider rolls out a new snapshot. Your prompt that worked yesterday produces subtly different outputs today. Without an eval, you can’t detect the regression — let alone localize it.
- Prompt changes are nonlocal. Editing one bullet in a system prompt can change behaviour on inputs that didn’t mention that bullet at all. There is no equivalent of “this function is now pure, so the diff is bounded.”
- The cost of being wrong is real. Hallucinations in customer-facing outputs aren’t a “fix it next sprint” bug; depending on the domain they’re a refund, a compliance incident, or worse.
So eval is the load-bearing thing that makes the rest of LLM engineering not-a-vibes-exercise. And it is much harder than it looks.
The short answer
LLM eval = test inputs + a grader you can defend + a metric that aggregates noisy outcomes
A regular test suite hides two pieces of that equation because they’re
trivial: the grader is == and the aggregator is “all green.” An LLM eval
forces you to build both pieces explicitly, and each one is its own
research problem. That’s the whole reason it’s hard.
How it works
Walk through what a passing test means in normal software:
- Run the function on a known input.
- Compare output to a known expected output with
==. - Repeat for many inputs. Aggregate: any failure → fail.
Now try to do the same for “summarize this support ticket”:
-
There is no single expected output. A good summary can use different words, different ordering, different emphasis. Two correct summaries written by two humans will not be string-equal. So
==doesn’t work, and neither does string-similarity — a rewording can be near-identical to the reference and still wrong, or wildly different and still right. -
The output isn’t deterministic. Even at temperature 0, you can get different tokens across runs (see why temperature 0 isn’t deterministic). So a single run doesn’t tell you “the model fails on this input”; it tells you “this sample failed.” If you want a stable failure rate, you often need multiple samples per input — which multiplies cost.
-
You need a grader. Something has to decide if an output is correct. The realistic options are all flawed:
- Exact-match / regex. Works only for narrow tasks (multiple choice, numeric answers, code that runs). Most real tasks aren’t this shape.
- Reference-based metrics (BLEU, ROUGE, embedding similarity). Cheap, but they reward looking like the reference more than being right. Embedding similarity captures some meaning; it routinely misses the task-relevant kind.
- LLM judges. Flexible, scale well, and can grade open-ended outputs. They also have known biases — preferring longer answers, preferring outputs that look like their own writing, and sometimes confidently mis-grading. They are not free of the same hallucination problem they’re meant to detect.
- Humans. The gold standard, slow and expensive, and humans disagree with each other more than people expect.
Whichever grader you pick, you’ve added a second model-shaped thing that itself needs to be evaluated. “How do I know my judge is right?” is a real question with no clean answer.
-
You need a dataset that represents production. This is where most eval suites quietly die. The five examples the developer tried are not a sample — they’re the inputs the developer can already think of. Real users hit edge cases the developer never imagined. Building an eval set that catches the long tail means harvesting real traffic, which raises privacy issues, requires labelling, and goes stale every time the product changes.
-
The aggregate is a distribution, not a pass/fail. You end up with a summary like “94% of outputs are acceptable” (numbers illustrative, not measured). Whether today’s number beats yesterday’s depends on confidence intervals, on which slices regressed, and on whether the failures got worse even if fewer.
Now stack those problems together and you can see why LLM eval feels fractal. Every single step that was free in unit testing is its own project.
Where it gets especially weird
A few specific traps that hit teams over and over:
- Goodhart on the eval. Once you optimize a prompt against a fixed eval, you start solving for the eval rather than the underlying task. Held-out evals exist for exactly this reason, and people forget to keep them held out.
- Contamination. If your eval inputs are anywhere on the public internet, the model may have seen them in training and “ace” them for reasons that don’t generalize. There’s no clean way to confirm contamination from outside the lab; the usual workaround is to generate fresh eval data privately, but I don’t have a survey to point at for how widely that’s actually done in industry.
- Tasks where the right answer is “I don’t know.” Eval sets often reward producing some answer. A model that hallucinates confidently on impossible questions can outscore one that correctly refuses.
- Multi-step / agent eval. When the LLM is in a loop calling tools, there is no single output to grade. You’re grading a trajectory. Trajectory eval is its own research area and the public state of the art is rough — a lot of teams just look at success rate on end goals and don’t really know why a run failed.
I want to mark a gap honestly: there is no settled, widely accepted methodology for evaluating open-ended LLM outputs at the time of this post. Public benchmarks exist, judge models exist, frameworks like Evals and others exist, but the question “is your model better than mine on my actual task” mostly does not have a turnkey answer. Anyone who tells you it does is selling something.
Famous related terms
- LLM-as-a-judge —
LLM judge = one model graded by another— fast and flexible, biased in known ways. Not a free pass. - Held-out eval —
held-out eval = test set you don't tune against— the main defence against Goodharting your own benchmark. Useful only as long as it stays held out. - Hallucination —
hallucination = confident output, no grounding— eval is partly a defence against this leaking into prod. - Temperature —
temperature = how peaked the sampling distribution is— affects how many samples per input you need to estimate a rate. - Goodhart’s law —
Goodhart = "a measure that becomes a target stops being a good measure"— every eval suite eventually has to outrun this. - Trajectory eval —
trajectory eval = grading a sequence of tool calls and observations, not one answer— open research area for agents.
Going deeper
- Anthropic, OpenAI, and others publish eval methodology notes in their model cards and system cards — those are the most honest public accounts I know of how labs themselves grade their models, including the limitations.
- Stanford HELM and similar academic benchmarking projects — useful for seeing the shape of the problem at scale, less useful as a stand-in for evaluating your task.
- The “LLM-as-a-judge” line of papers (Zheng et al. and follow-ups) — a well-studied option among the imperfect ones, with several biases documented.
- Anything by people running eval in production at agent companies — the public writeups are still patchy, but the genre is growing fast and worth tracking.