Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why is structured output so hard?

You ask the model for JSON. Sometimes it gives you a trailing comma. Sometimes a markdown fence. Sometimes prose. Why is this still a problem?

AI & ML intermediate Apr 29, 2026

Why it exists

Here’s the thing that should feel weird the first time you hit it.

You ask an LLM to “respond with valid JSON matching this schema.” It mostly does. Then, on maybe one in fifty calls — or one in five if your schema is awkward — it gives you back something that is almost JSON: a stray trailing comma, an unescaped quote, a markdown code fence wrapped around the JSON, a chatty preamble like “Sure! Here’s the JSON:” before the actual object, or a key name that matches the vibe of your schema but not the spelling. Your parser explodes. Your pipeline retries. Your ops dashboard lights up.

This should be surprising. The model can write working code in fifteen languages. It can summarize a legal contract. It clearly knows what JSON is — JSON is everywhere in any plausible web-scale training corpus, even if no provider tells you the exact count. Why does this one specific task — “produce a string of bytes that satisfies a formal grammar” — fail in a way that “write a haiku about kubernetes” does not?

The answer lives in the gap between two things that look similar from the outside but aren’t: a model that has seen a lot of JSON and a model that can only emit valid JSON. By default the model is the first kind. Making it the second kind is harder than it looks, and the awkwardness of every “JSON mode,” “structured outputs,” “function calling,” and tool-use API you’ve used is downstream of that.

Why it matters now

Structured output is the seam between LLMs and everything else.

If you don’t have a feel for why this is hard, you’ll either over-trust prompt-only JSON (“works on my eval, fails in prod”) or over-engineer around a problem that has a clean solution at the inference layer.

The short answer

structured output = next-token sampling + a hard constraint that the whole emitted string parses against a grammar

A language model is a probability distribution over the next token. A schema is a hard yes/no constraint on the whole sequence of tokens. Those two objects don’t compose for free. Every approach to structured output is some way of forcing them to compose — either by hoping (“please return JSON”), by fixing it after the fact (parse, retry), or by changing what the model is allowed to sample at each step (constrained decoding). Each option is a different trade between reliability, speed, and how much of the schema the model actually understood.

How it works

To see why this is hard, you have to look at what the model is actually doing when it generates text.

The mismatch: distributions vs. grammars

At each step of generation, the model produces a probability over its entire vocabulary — typically tens to hundreds of thousands of token IDs. Sampling picks one. The next step conditions on that pick. The model has no built-in notion of “I am currently inside a JSON string” or “the next character must be a closing brace.” It just has the conditional distribution that fell out of training on a giant pile of text.

A schema, on the other hand, is a hard constraint. The output either parses or it doesn’t. There’s no “70% valid JSON.” A trailing comma turns a 5,000-character valid response into a 0-character valid response.

The model has learned, statistically, that JSON-looking prefixes tend to be followed by JSON-looking continuations. That’s good enough most of the time. It is not good enough all of the time, because “JSON-looking” is a soft prior and “valid JSON” is a hard property. Soft priors fail at the tails, and the tails are where production lives.

Three approaches, in increasing order of “actually works”

1. Prompt only — “please return JSON.”

You write the schema in the system prompt. You add “respond ONLY with JSON, no preamble, no markdown fence.” The model mostly complies. This is what every developer tries first.

What goes wrong: the failure modes are exactly the ones you’d predict from training data. Markdown code fences are extremely common in JSON examples on the web, so the model has a strong prior to wrap output in ```json ... ```. Tutorial-style preambles (“Here’s the result:”) are common too. And subtle invariants — every key from the schema is present, no extra keys, enums use the right casing — aren’t things the model can verify; it can only imitate.

2. Retry on parse failure.

Run the model. Try to parse. If it fails, send the error back and ask it to fix. This is what most early LLM apps do, and it works surprisingly well — modern models are good at fixing their own JSON.

What goes wrong: every retry is another full call, with full prompt and full context. Latency doubles. Cost doubles. And there’s no upper bound on retries that’s both safe and cheap.

3. Constrained decoding — restrict what the model is allowed to sample.

This is the approach behind the strict, schema-guaranteeing structured-output features both OpenAI and Anthropic now document explicitly (OpenAI’s “Structured Outputs,” Anthropic’s strict tool use and structured-output modes). Not every JSON-ish mode is strict — older “JSON mode” features only promise parseable JSON, not schema conformance — but the strict ones are doing the same trick. At each generation step, before sampling, take the model’s distribution over the vocabulary and mask out every token that would make the output invalid under the schema. Renormalize. Sample from what’s left. Repeat.

The grammar — typically expressed as a regex or context-free grammar and compiled to a finite-state machine or pushdown automaton — tells you, given the prefix emitted so far, which tokens can still keep the output valid. If the prefix is {"name": ", then a token whose bytes fit somewhere inside a JSON string body is legal and a token that starts with } is not. You set the probability of every illegal token to zero.

The model still chooses which legal token, weighted by its own distribution. It still picks the name, the value, the wording. It just can’t go off the rails of the grammar.

This is why providers can promise “the output will parse.” They aren’t trusting the model; they’re forbidding everything else.

Where it gets subtle

The thing to walk away with: structured output isn’t a prompt-engineering problem the model is bad at. It’s a fundamental mismatch between a probability distribution and a hard constraint. The only way to guarantee the format at decode time is to let the constraint veto the distribution at every step. Prompting and retries can get you most of the way, but they’re statistical, not categorical — and which one you reach for should depend on whether your pipeline treats parse failures as noise or as bugs.

Going deeper