Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why do LLMs hallucinate confidently instead of saying 'I don't know'?

The model isn't lying. It was never trained to know when to stop talking.

AI & ML intro Apr 29, 2026

Why it exists

The first time you catch a language model fabricating a citation — a real- sounding paper title, plausible authors, a journal that exists, a year that fits, and none of it true — the natural reaction is to ask what’s broken. Why didn’t it just say “I don’t know”? A search engine would have shrugged. A junior engineer would have admitted the gap. The model wrote a confident paragraph instead.

Nothing is broken. The behavior is what falls out of how the thing was built. An LLM is a function that takes a sequence of tokens and outputs a probability distribution over the next one. It was trained, for months, on the single objective of making that distribution match the next token in a giant pile of human-written text. Every gradient step rewarded it for producing the most likely continuation. None of those steps explicitly rewarded it for recognizing the continuation it was about to produce wasn’t grounded in anything real.

So when you ask “what paper showed X?” and there is no such paper in the model’s parameters, the model still has to emit a next token. The distribution doesn’t have a “no-paper” option. It has tokens. The most probable tokens, given that the prompt looks like a citation request, are ones that look like a citation. The model fluently emits one. The fluency is the whole problem: it’s the same fluency that makes it useful for everything else.

Hallucination, in this frame, isn’t a glitch in an otherwise truthful system. It’s the same machinery as every other answer the model gives — just applied to a question whose true answer wasn’t in the training data or wasn’t recoverable from it. The system’s default mode is “complete the text.” “Don’t know” is a learned exception that has to be installed on top.

Why it matters now

Every product built on an LLM in 2026 inherits this property. A chatbot that confidently invents a refund policy is the same failure mode as a coding agent that imports a library function that doesn’t exist, which is the same failure mode as a research assistant fabricating a citation. Different surfaces, identical underlying cause: the model was asked something it didn’t actually know, and its job description is “produce fluent text,” not “produce true text.”

This is also why the engineering response to hallucination almost never lives inside the model itself. It lives around the model: RAG to ground answers in real documents, tool calls so the model can actually look things up instead of guessing, structured outputs so a downstream validator can reject malformed answers, and human review for anything load-bearing. If you’ve wondered why so much agent infrastructure is “shove ground truth into the prompt at the last moment” — this is why.

The short answer

hallucination = next-token model + no built-in "I don't know" + a prompt the model can't actually answer

The model is doing exactly what it was trained to do: emit a fluent continuation. When the prompt asks for a fact that isn’t reliably encoded in its weights, “fluent” and “true” come apart, and “fluent” wins because “fluent” is what was optimized.

How it works

Three pieces stack to produce the behavior.

1. The training objective doesn’t reward calibration. Pretraining is next-token prediction over text scraped from the internet and books. The loss function compares the model’s predicted distribution to the actual next token. There is no term in that loss that says “if you don’t know, output a token meaning I don’t know.” There can’t be — the training data doesn’t come with a label for which sentences the model will eventually struggle with. So the model learns to predict text; calibration about its own knowledge is a side effect at best.

2. Post-training teaches style, not always epistemics. After pretraining comes RLHF and similar methods. Humans rate responses; the model is fine-tuned to produce the kind of response humans prefer. The catch is what humans tend to prefer: confident, fluent, specific, helpful answers. A response that hedges or refuses on a question the rater thinks should have an answer often gets penalized. So the post-training step can actively increase the model’s tendency to produce a confident answer, even on questions where confidence isn’t warranted. The exact magnitude of this effect varies by lab and is hard to measure from outside — the public details on modern post-training pipelines are sparse — but the direction is well- documented in the research literature.

3. The model has no privileged access to its own uncertainty. This is the subtle one. You might hope the model “knows” when it’s guessing — that there’s some internal signal, like a low-probability distribution over the next token, that maps cleanly to “I’m unsure.” Sometimes there is. Often there isn’t. A model can be highly confident (a very peaked distribution) about a token that turns out to be wrong, because the training data made that continuation look overwhelmingly likely in that context. Confidence in the output distribution is confidence about the text, not about the world. The two only line up when the training data lined them up. There’s active research on whether internal model states encode something like a truth signal that could be read out separately; I don’t have a settled, citable bottom line on how well that works in current production models, and I’d treat any strong claim either way with suspicion.

A small worked example. Ask a model: “What was the title of the paper that introduced the term ‘banana attention’ in 2019?” No such paper exists. The model’s internal state at that prompt looks roughly like: “this is a citation question, the user expects a paper title, here are the kinds of tokens that follow ‘The title of the paper was…’ in my training data.” It samples from that. Out comes a fluent, plausible, fabricated title. The model didn’t lie. It pattern-matched, because pattern-matching is all it does.

The reason “just say I don’t know” is harder than it sounds: the model would have to (a) detect that this particular prompt is in its unknown-unknown zone, (b) override the strong prior that “user asked a question, fluent specific answer is what gets rewarded,” and (c) emit a refusal that itself looks like a fluent continuation. All three are learnable behaviors — modern models are noticeably better at this than 2022-era ones — but they’re added against the grain of the base objective, not naturally produced by it.

A useful mental model: the LLM is a very good mimic of “what someone who knew the answer would say.” When such a someone exists in the training distribution, the mimicry is also true. When such a someone doesn’t exist — because the answer isn’t knowable, isn’t in the data, or isn’t recoverable from how it was encoded — the mimicry continues anyway, because mimicry is what the weights compute. The output is indistinguishable on the surface; that’s the whole problem.

Going deeper

A note on what I’m sure of: the mechanism above (next-token objective, no native “don’t know” signal, post-training that can amplify confidence) is well-established in the research literature. The quantitative picture — exactly how often current frontier models hallucinate, on which task families, and how much each post-training trick helps — moves every few months and varies wildly by benchmark. Treat any specific number you read with the same skepticism you’d apply to a model’s own confident citation.