Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why is evaluating an LLM so much harder than testing normal software?

Unit tests pass or fail. LLM outputs don't. The hard part isn't running the eval — it's deciding what 'correct' even means when there are a million right answers.

AI & ML intermediate Apr 29, 2026

Why it exists

Anyone who has shipped a feature backed by a LLM has hit the same wall. You write a prompt. It looks great on the five examples you tried. You ship it. Two weeks later support is forwarding you outputs that are wrong in ways your five examples never hinted at. You tweak the prompt. The new version fixes those cases — and silently regresses three others you’d already considered solved.

Normal software has a clean answer to “is this change good?” — the test suite is green or it isn’t. With an LLM there is no green. Outputs are free-form text, the same input can produce different outputs, “correct” is a judgment call, and the model under test is a giant black box you didn’t train and can’t introspect. Every team building on top of these things ends up reinventing the same painful machinery: a synthetic eval set, a scoring function that mostly works, and a haunted feeling that they don’t really know if today’s model is better than yesterday’s.

This post is about why that machinery is so hard to build, not how to build it. The shape of the problem is what trips people up.

Why it matters now

Every team using agents, chatbots, summarizers, classifiers, or “AI features” of any kind faces this. Three things make it especially painful right now:

So eval is the load-bearing thing that makes the rest of LLM engineering not-a-vibes-exercise. And it is much harder than it looks.

The short answer

LLM eval = test inputs + a grader you can defend + a metric that aggregates noisy outcomes

A regular test suite hides two pieces of that equation because they’re trivial: the grader is == and the aggregator is “all green.” An LLM eval forces you to build both pieces explicitly, and each one is its own research problem. That’s the whole reason it’s hard.

How it works

Walk through what a passing test means in normal software:

  1. Run the function on a known input.
  2. Compare output to a known expected output with ==.
  3. Repeat for many inputs. Aggregate: any failure → fail.

Now try to do the same for “summarize this support ticket”:

  1. There is no single expected output. A good summary can use different words, different ordering, different emphasis. Two correct summaries written by two humans will not be string-equal. So == doesn’t work, and neither does string-similarity — a rewording can be near-identical to the reference and still wrong, or wildly different and still right.

  2. The output isn’t deterministic. Even at temperature 0, you can get different tokens across runs (see why temperature 0 isn’t deterministic). So a single run doesn’t tell you “the model fails on this input”; it tells you “this sample failed.” If you want a stable failure rate, you often need multiple samples per input — which multiplies cost.

  3. You need a grader. Something has to decide if an output is correct. The realistic options are all flawed:

    • Exact-match / regex. Works only for narrow tasks (multiple choice, numeric answers, code that runs). Most real tasks aren’t this shape.
    • Reference-based metrics (BLEU, ROUGE, embedding similarity). Cheap, but they reward looking like the reference more than being right. Embedding similarity captures some meaning; it routinely misses the task-relevant kind.
    • LLM judges. Flexible, scale well, and can grade open-ended outputs. They also have known biases — preferring longer answers, preferring outputs that look like their own writing, and sometimes confidently mis-grading. They are not free of the same hallucination problem they’re meant to detect.
    • Humans. The gold standard, slow and expensive, and humans disagree with each other more than people expect.

    Whichever grader you pick, you’ve added a second model-shaped thing that itself needs to be evaluated. “How do I know my judge is right?” is a real question with no clean answer.

  4. You need a dataset that represents production. This is where most eval suites quietly die. The five examples the developer tried are not a sample — they’re the inputs the developer can already think of. Real users hit edge cases the developer never imagined. Building an eval set that catches the long tail means harvesting real traffic, which raises privacy issues, requires labelling, and goes stale every time the product changes.

  5. The aggregate is a distribution, not a pass/fail. You end up with a summary like “94% of outputs are acceptable” (numbers illustrative, not measured). Whether today’s number beats yesterday’s depends on confidence intervals, on which slices regressed, and on whether the failures got worse even if fewer.

Now stack those problems together and you can see why LLM eval feels fractal. Every single step that was free in unit testing is its own project.

Where it gets especially weird

A few specific traps that hit teams over and over:

I want to mark a gap honestly: there is no settled, widely accepted methodology for evaluating open-ended LLM outputs at the time of this post. Public benchmarks exist, judge models exist, frameworks like Evals and others exist, but the question “is your model better than mine on my actual task” mostly does not have a turnkey answer. Anyone who tells you it does is selling something.

Going deeper