How does an AI model decide what to say?
It looks like one big choice — you type a question, you get an answer. Underneath it's thousands of tiny choices, made one token at a time, with no plan and no rewind.
Why it exists
The question feels simple. You typed a prompt. The model wrote an answer. Something in there made a decision: which words to use, which facts to assert, which direction to take the response. Where does that decision happen? Is there a moment, somewhere inside the network, where the model “picks the answer”?
The honest answer is: no. There is no such moment. The model never decides on an answer, because there is no single act of deciding. The answer assembles itself one token at a time, and the model has no commitment to any of it until each token is already chosen. By the time the response feels like a thing, several hundred or thousand tiny picks have already happened, each one conditional on the picks before it, and none of them representing the answer as a whole.
That sounds wrong because it doesn’t match how it feels to read the output. Coherent paragraphs look like they were planned. The mechanism underneath is stranger than that, and most of the user-visible quirks of language models fall out of it once you see it clearly.
Why it matters now
Most of the things engineers run into when they ship LLM features stop being mysterious once you stop thinking of generation as a single decision:
- Why the same prompt gives different answers. Each token is sampled from a probability distribution. Different rolls, different completions.
- Why chain-of-thought prompting helps. The model’s “reasoning” is the trajectory of tokens it writes. Asking it to write the steps changes which token is most likely at each step, and that changes the destination.
- Why prefilling
the assistant’s reply works. Once the model is conditioned on
Sure, here's the JSON: {, the most likely next tokens are JSON, not refusals. You haven’t changed the model — you’ve put it on a different trajectory. - Why the model “can’t take it back” mid-sentence. Tokens already generated are now part of the prompt. There is no rewind. The model is committed to finishing the thing it started.
If your mental model is the model has an answer in mind and is typing it out, the above are confusing. If your mental model is the model is taking one step at a time, with no plan it has access to, they’re inevitable.
The short answer
model decides ≈ forward pass over the whole context + a sampler picks one token + append it + repeat
There is no other step. Every “decision” you’re seeing is the cumulative effect of doing this hundreds or thousands of times in a row. The forward pass is the model. The sampler is not — it’s a small, cheap function that runs after, and changing it changes everything you see.
How it works
One step
Start with the prompt. It’s been chopped into tokens — small chunks of text, each with an integer ID. The model takes that whole sequence and runs it through the transformer: dozens of layers of attention and feed-forward computation. The output of all that work, at the very end, is a single vector of logits — one number per token in the model’s vocabulary. The vocabulary is typically 30k–200k tokens.
That vector of logits is what the model “thinks.” It is not an answer. It’s a guess, for this position, at how likely each possible next token is.
Pass the logits through
softmax
to convert them into a probability distribution. Now you have, e.g.,
the at 0.31, a at 0.22, however at 0.04, octopus at 0.0000003,
all the way down through every entry in the vocabulary.
This is everything the model has to say about the next token. It is not a token. It is a distribution.
The sampler
To get an actual token, something has to pick from that distribution. That’s sampling, and it lives outside the model — usually in the inference server. The simplest sampler is “pick the highest-probability token.” The most common in production is “draw randomly from the distribution after reshaping it” (see temperature and the related top-p / top-k knobs covered in why beam search died for LLMs).
The sampler is the only thing in the entire pipeline that turns the model’s smear of probabilities into a definite token. It’s small, it’s cheap, and the choice of sampler is part of why two providers running the same model can produce noticeably different outputs.
Repeat
Once a token is sampled, append it to the sequence. The new sequence — your original prompt plus this one token — goes back through the network. A new forward pass. A new logit vector. A new distribution. A new sample. Append.
Do this until the model emits a special end-of-turn token, or until you hit a length cap.
That is the whole loop. Forward pass, sample, append, forward pass, sample, append. A 500-token answer is 500 of these. A 10,000-token reasoning trace is 10,000. (In practice, the KV cache reuses the attention work already done on prior tokens, so each new step is fast — but the logical loop is the same.)
What “deciding” actually is
A few consequences are worth pulling out, because they’re the parts that don’t feel right at first:
- The model has no internal “answer” that gets serialized into tokens. The tokens are the computation. Whatever shape the answer eventually has is a property of the trajectory the sampling rolled out, not of some hidden plan. There is no place inside the network where the full answer is sitting, waiting to be written.
- Each step’s distribution depends on every token chosen so far. This is how coherence happens at all. The forward pass at step 437 sees all 436 tokens already committed. It has no choice but to be consistent with them — the patterns the model learned during pretraining strongly weight tokens that “fit” the context.
- Earlier tokens are load-bearing. A token sampled at step 5 reshapes the distribution at step 6, which reshapes the distribution at step 7. Small perturbations early can fork the trajectory hard. This is why temperature 0 looks stable: the same maximum-likelihood path gets followed, so the small perturbations don’t fire.
- There is no rewind. The model can’t un-sample a token. If a wrong fact gets emitted at token 12, by token 50 the model is still continuing from “wrong fact” and will often double down rather than retract — once the prefix exists, the next-token distribution is shaped by it, and locally coherent continuations dominate. This is one mechanical contributor to some hallucinations: not a lie, but a trajectory the sampler stepped onto and that the model is now finishing under continuation pressure.
- Chain-of-thought works because it moves the answer farther down the trajectory. Each intermediate step the model writes becomes part of the context for the next step. The final answer is now conditioned on its own derivation, which on average produces better answers than asking for the answer in one shot.
Where the metaphor breaks
A few honest caveats — places where “no plan at all” is too strong:
- The forward pass isn’t purely about the next token. The transformer’s internal representations seem to carry information about several positions ahead. Anthropic’s 2025 interpretability work on Claude 3.5 Haiku, for instance, found that when writing rhyming poetry the model appears to commit internally to the line’s end-word before generating the words leading up to it. So a better way to put it is: the model can have anticipations, but those anticipations only become externally committed text when a token is actually sampled. How general this kind of planning is, and at what horizons, is an active research question — I’m describing the shape, not citing a settled result.
- Some inference systems do work several tokens ahead. Speculative decoding drafts several tokens with a small model, then has the big model verify them in a single pass. This doesn’t change the underlying logic — the big model still has the final say at every position — but mechanically the loop is no longer strictly one-token-at-a-time.
- “Reasoning” models complicate the picture. Models trained with extra reinforcement learning on long chains of thought sample large internal scratchpads before producing the visible answer. The decision-making is still token-by-token, but the visible answer is now a function of a much longer hidden trajectory the user never sees.
- There is no clean theorem for why this loop produces good answers. Empirically it does — see why next-token prediction generalizes — and the intuition is that to predict text well, the model has to model what produced the text. But “good distribution at each step + sampling” being enough for coherent multi-paragraph answers is a thing that works, not a thing that’s been derived from first principles.
The headline still holds: there is no single decision, just many small ones; the model is not executing a plan it has access to, only sampling from a fresh distribution at each position; and the answer you read is the path the sampler walked, not a hidden truth being read out.
Famous related terms
- LLM —
LLM = neural net + "predict the next token" objective at scale. The thing producing the distribution at every step. - Logits —
logits = the raw vector of scores the model emits per vocabulary token. Pre-softmax, pre-sampling. - Softmax —
softmax = exp(logits) / sum(exp(logits)). Turns the raw scores into a probability distribution. - Sampling / temperature —
sampling = turn the distribution into a single token. The step that actually converts model output into text. - Greedy decoding —
greedy = always pick the highest-probability token. The simplest sampler; deterministic on paper, often boring in practice. - Beam search —
beam search = keep the top-k partial sequences alive at each step. Used to be the default decoder; lost to plain sampling for open-ended generation. - Chain-of-thought —
CoT = ask the model to write its steps before its answer. Works by changing the trajectory the sampler walks. - Hallucination —
hallucination = a confidently wrong continuation. One mechanical contributor: a trajectory the sampler stepped onto early that the model then has to finish coherently.
Going deeper
- The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — the paper that showed plain argmax decoding on strong language models produces looping, degenerate text. The empirical case for “the model’s most-probable continuation isn’t the one you want.”
- Andrej Karpathy, Let’s build GPT (YouTube, 2023) — builds a tiny language model from scratch and shows the sampling loop running token-by-token. Much easier to internalize once you see the loop in code.
- Anthropic, On the Biology of a Large Language Model / “tracing thoughts” interpretability writeup (2025) — the source for the rhyme-planning result mentioned above. Worth reading as a concrete example of internal computation that doesn’t fit a strict “next-token-only” picture.
- Any modern LLM provider’s API reference — the parameters they expose (
temperature,top_p,top_k,logit_bias, stop sequences) are all knobs on the sampler, not the model. Reading them with that frame in mind makes the docs make more sense.