Why is structured output so hard?
You ask the model for JSON. Sometimes it gives you a trailing comma. Sometimes a markdown fence. Sometimes prose. Why is this still a problem?
Why it exists
Here’s the thing that should feel weird the first time you hit it.
You ask an LLM to “respond with valid JSON matching this schema.” It mostly does. Then, on maybe one in fifty calls — or one in five if your schema is awkward — it gives you back something that is almost JSON: a stray trailing comma, an unescaped quote, a markdown code fence wrapped around the JSON, a chatty preamble like “Sure! Here’s the JSON:” before the actual object, or a key name that matches the vibe of your schema but not the spelling. Your parser explodes. Your pipeline retries. Your ops dashboard lights up.
This should be surprising. The model can write working code in fifteen languages. It can summarize a legal contract. It clearly knows what JSON is — JSON is everywhere in any plausible web-scale training corpus, even if no provider tells you the exact count. Why does this one specific task — “produce a string of bytes that satisfies a formal grammar” — fail in a way that “write a haiku about kubernetes” does not?
The answer lives in the gap between two things that look similar from the outside but aren’t: a model that has seen a lot of JSON and a model that can only emit valid JSON. By default the model is the first kind. Making it the second kind is harder than it looks, and the awkwardness of every “JSON mode,” “structured outputs,” “function calling,” and tool-use API you’ve used is downstream of that.
Why it matters now
Structured output is the seam between LLMs and everything else.
- Tool calling and agents. Every agent that calls a function, hits an API, or returns a result to a UI is doing structured output under the hood. An MCP tool call is a JSON object. A function call is a JSON object. If those fail to parse, the agent stalls.
- Pipelines that mix LLM and non-LLM code. “Extract the invoice total, the vendor, and the line items” is the model talking to a database. The schema is the contract. Schema drift means data drift.
- Cost of retries. Every “the parse failed, retry” round-trip is a full inference call. On a long-context request that’s not free. The cheaper structured-output is, the more you’re willing to use it.
- Why providers ship “JSON mode” and “Structured Outputs.” OpenAI, Anthropic, and the open-model serving stacks (vLLM, llama.cpp, etc.) all have features that promise the output will parse against a schema. They exist because the naive approach — ask nicely in the prompt — has a non-zero failure rate, and “non-zero” is unacceptable in a pipeline that runs millions of times.
If you don’t have a feel for why this is hard, you’ll either over-trust prompt-only JSON (“works on my eval, fails in prod”) or over-engineer around a problem that has a clean solution at the inference layer.
The short answer
structured output = next-token sampling + a hard constraint that the whole emitted string parses against a grammar
A language model is a probability distribution over the next token. A schema is a hard yes/no constraint on the whole sequence of tokens. Those two objects don’t compose for free. Every approach to structured output is some way of forcing them to compose — either by hoping (“please return JSON”), by fixing it after the fact (parse, retry), or by changing what the model is allowed to sample at each step (constrained decoding). Each option is a different trade between reliability, speed, and how much of the schema the model actually understood.
How it works
To see why this is hard, you have to look at what the model is actually doing when it generates text.
The mismatch: distributions vs. grammars
At each step of generation, the model produces a probability over its entire vocabulary — typically tens to hundreds of thousands of token IDs. Sampling picks one. The next step conditions on that pick. The model has no built-in notion of “I am currently inside a JSON string” or “the next character must be a closing brace.” It just has the conditional distribution that fell out of training on a giant pile of text.
A schema, on the other hand, is a hard constraint. The output either parses or it doesn’t. There’s no “70% valid JSON.” A trailing comma turns a 5,000-character valid response into a 0-character valid response.
The model has learned, statistically, that JSON-looking prefixes tend to be followed by JSON-looking continuations. That’s good enough most of the time. It is not good enough all of the time, because “JSON-looking” is a soft prior and “valid JSON” is a hard property. Soft priors fail at the tails, and the tails are where production lives.
Three approaches, in increasing order of “actually works”
1. Prompt only — “please return JSON.”
You write the schema in the system prompt. You add “respond ONLY with JSON, no preamble, no markdown fence.” The model mostly complies. This is what every developer tries first.
What goes wrong: the failure modes are exactly the ones you’d predict
from training data. Markdown code fences are extremely common in JSON
examples on the web, so the model has a strong prior to wrap output in
```json ... ```. Tutorial-style preambles (“Here’s the
result:”) are common too. And subtle invariants — every key from the
schema is present, no extra keys, enums use the right casing — aren’t
things the model can verify; it can only imitate.
2. Retry on parse failure.
Run the model. Try to parse. If it fails, send the error back and ask it to fix. This is what most early LLM apps do, and it works surprisingly well — modern models are good at fixing their own JSON.
What goes wrong: every retry is another full call, with full prompt and full context. Latency doubles. Cost doubles. And there’s no upper bound on retries that’s both safe and cheap.
3. Constrained decoding — restrict what the model is allowed to sample.
This is the approach behind the strict, schema-guaranteeing structured-output features both OpenAI and Anthropic now document explicitly (OpenAI’s “Structured Outputs,” Anthropic’s strict tool use and structured-output modes). Not every JSON-ish mode is strict — older “JSON mode” features only promise parseable JSON, not schema conformance — but the strict ones are doing the same trick. At each generation step, before sampling, take the model’s distribution over the vocabulary and mask out every token that would make the output invalid under the schema. Renormalize. Sample from what’s left. Repeat.
The grammar — typically expressed as a regex or context-free grammar
and compiled to a finite-state machine or pushdown automaton — tells
you, given the prefix emitted so far, which tokens can still keep the
output valid. If the prefix is {"name": ", then a token whose bytes
fit somewhere inside a JSON string body is legal and a token that
starts with } is not. You set the probability of every illegal token
to zero.
The model still chooses which legal token, weighted by its own distribution. It still picks the name, the value, the wording. It just can’t go off the rails of the grammar.
This is why providers can promise “the output will parse.” They aren’t trusting the model; they’re forbidding everything else.
Where it gets subtle
- Tokenization vs. grammars. The grammar is defined over characters
(or bytes). The model emits tokens, which can be multi-character
chunks that straddle grammar states. A token might encode
", "— a quote, a comma, two spaces, another quote — all at once. The constraint engine has to reason about which token sequences can continue the prefix legally, not which characters. That’s a real engineering problem; it’s why Outlines, XGrammar, and similar projects exist as their own engineering effort, not three-line scripts. The library has to compute, for whichever grammar state the decoder is in, the set of legal tokens — which is bigger than the set of legal next characters and depends on the model’s specific vocabulary. Outlines famously precomputes this as an index from grammar states to legal token IDs; XGrammar argues that for some grammars (e.g. with infinitely many pushdown-automaton states) full precomputation isn’t possible and you have to mask adaptively at decode time. Different libraries pick different points on that trade. - Constrained decoding doesn’t make the answer correct. It makes the answer parse. The model can still emit a syntactically valid JSON object whose values are wrong, hallucinated, or not what you wanted. Schema enforcement is a guard against parse errors, not against hallucination. This is a common confusion in shipping product: “we added structured outputs and the agent still gave wrong answers.” Yes — those are separate problems.
- Grammars can hurt quality. Forcing the model into a grammar can push probability mass onto tokens it didn’t want to emit, which is a small distributional perturbation at every step. There is published work — including “Let Me Speak Freely?” (Tam et al., 2024) — arguing that aggressive format constraints can degrade reasoning quality on some tasks. There are also blog responses and provider experience pushing back, arguing that with good grammar design and prompting the effect is small or zero. I don’t have a clean, unanimous answer here; treat it as an open empirical question, and measure on your own task before assuming either direction.
- Why “function calling” looks like its own feature. Tool/function calling APIs are a specialized, productized version of constrained decoding: the schema is the function’s argument schema, and the output is forced to be a valid call. The provider is doing exactly the masking dance under the hood, plus some prompt-side scaffolding to teach the model when a tool call is appropriate.
- Streaming gets weird. If you stream tokens to a UI, a partially emitted constrained JSON object is almost-but-not-quite parseable at every intermediate step. Renderers either parse the partial with a forgiving parser or wait for the close brace. Both have failure modes.
- The same trick works for any grammar. JSON is the popular case, but constrained decoding will happily restrict output to valid SQL, valid regex, a specific BNF, a list of allowed strings, or a regex over the vocabulary. “Structured output” is the friendly name; the underlying machinery is general.
The thing to walk away with: structured output isn’t a prompt-engineering problem the model is bad at. It’s a fundamental mismatch between a probability distribution and a hard constraint. The only way to guarantee the format at decode time is to let the constraint veto the distribution at every step. Prompting and retries can get you most of the way, but they’re statistical, not categorical — and which one you reach for should depend on whether your pipeline treats parse failures as noise or as bugs.
Famous related terms
- JSON mode —
JSON mode ≈ "the output will be syntactically valid JSON" + nothing about which JSON. The minimal version: you get a parseable object, but no guarantee it matches your schema. Provider docs still note edge cases (e.g. truncation under max-tokens), so even “valid” is best-effort. - Structured Outputs / strict tool use —
Structured Outputs = JSON mode + schema conformance. Provider product names vary (OpenAI’s “Structured Outputs,” Anthropic’s strict tool use / structured outputs); the underlying mechanism is constrained decoding against the schema. - Function calling / tool use —
function calling = constrained decoding + a per-tool argument schema + prompt scaffolding for "when to call". The agent-flavored version of structured output. - Constrained decoding —
constrained decoding = mask out illegal tokens at each step + sample from what's left. The general technique under all of the above. - Outlines / XGrammar / llama.cpp grammars — open-source libraries that implement constrained decoding for self-hosted models. Worth reading the source if you want to demystify what hosted “structured output” features are doing.
- Hallucination — the failure mode structured output does not fix.
- Tokenization — the reason constrained decoding is harder than “filter by character.”
Going deeper
- Efficient Guided Generation for Large Language Models (Willard & Louf, arXiv 2023) — the paper behind the Outlines library; the cleanest explanation of how to compile a regex or grammar into a per-step token mask over a given model’s vocabulary.
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models (Tam et al., 2024) — the best-known “format constraints can hurt” paper. Read it alongside the blog responses from constrained-decoding library authors; the truth is task-specific.
- XGrammar (Dong et al., 2024) — a more recent constrained-decoding engine focused on making the per-step mask cheap enough to keep up with batched LLM serving.
- The OpenAI “Structured Outputs” and Anthropic tool-use docs — useful not for the marketing, but to notice what they promise (parses, matches schema) versus what they don’t (correctness).