Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

How does an AI model decide what to say?

It looks like one big choice — you type a question, you get an answer. Underneath it's thousands of tiny choices, made one token at a time, with no plan and no rewind.

AI & ML intro Apr 30, 2026

Why it exists

The question feels simple. You typed a prompt. The model wrote an answer. Something in there made a decision: which words to use, which facts to assert, which direction to take the response. Where does that decision happen? Is there a moment, somewhere inside the network, where the model “picks the answer”?

The honest answer is: no. There is no such moment. The model never decides on an answer, because there is no single act of deciding. The answer assembles itself one token at a time, and the model has no commitment to any of it until each token is already chosen. By the time the response feels like a thing, several hundred or thousand tiny picks have already happened, each one conditional on the picks before it, and none of them representing the answer as a whole.

That sounds wrong because it doesn’t match how it feels to read the output. Coherent paragraphs look like they were planned. The mechanism underneath is stranger than that, and most of the user-visible quirks of language models fall out of it once you see it clearly.

Why it matters now

Most of the things engineers run into when they ship LLM features stop being mysterious once you stop thinking of generation as a single decision:

If your mental model is the model has an answer in mind and is typing it out, the above are confusing. If your mental model is the model is taking one step at a time, with no plan it has access to, they’re inevitable.

The short answer

model decides ≈ forward pass over the whole context + a sampler picks one token + append it + repeat

There is no other step. Every “decision” you’re seeing is the cumulative effect of doing this hundreds or thousands of times in a row. The forward pass is the model. The sampler is not — it’s a small, cheap function that runs after, and changing it changes everything you see.

How it works

One step

Start with the prompt. It’s been chopped into tokens — small chunks of text, each with an integer ID. The model takes that whole sequence and runs it through the transformer: dozens of layers of attention and feed-forward computation. The output of all that work, at the very end, is a single vector of logits — one number per token in the model’s vocabulary. The vocabulary is typically 30k–200k tokens.

That vector of logits is what the model “thinks.” It is not an answer. It’s a guess, for this position, at how likely each possible next token is.

Pass the logits through softmax to convert them into a probability distribution. Now you have, e.g., the at 0.31, a at 0.22, however at 0.04, octopus at 0.0000003, all the way down through every entry in the vocabulary.

This is everything the model has to say about the next token. It is not a token. It is a distribution.

The sampler

To get an actual token, something has to pick from that distribution. That’s sampling, and it lives outside the model — usually in the inference server. The simplest sampler is “pick the highest-probability token.” The most common in production is “draw randomly from the distribution after reshaping it” (see temperature and the related top-p / top-k knobs covered in why beam search died for LLMs).

The sampler is the only thing in the entire pipeline that turns the model’s smear of probabilities into a definite token. It’s small, it’s cheap, and the choice of sampler is part of why two providers running the same model can produce noticeably different outputs.

Repeat

Once a token is sampled, append it to the sequence. The new sequence — your original prompt plus this one token — goes back through the network. A new forward pass. A new logit vector. A new distribution. A new sample. Append.

Do this until the model emits a special end-of-turn token, or until you hit a length cap.

That is the whole loop. Forward pass, sample, append, forward pass, sample, append. A 500-token answer is 500 of these. A 10,000-token reasoning trace is 10,000. (In practice, the KV cache reuses the attention work already done on prior tokens, so each new step is fast — but the logical loop is the same.)

What “deciding” actually is

A few consequences are worth pulling out, because they’re the parts that don’t feel right at first:

Where the metaphor breaks

A few honest caveats — places where “no plan at all” is too strong:

The headline still holds: there is no single decision, just many small ones; the model is not executing a plan it has access to, only sampling from a fresh distribution at each position; and the answer you read is the path the sampler walked, not a hidden truth being read out.

Going deeper