Why does temperature exist as a knob?

If the model knows the right answer, why is there a dial that asks it to be wrong on purpose?

AI & ML intro Apr 29, 2026

Why it exists

The first time most engineers hit an LLM API, temperature looks like a mistake. There’s a parameter, usually a float between 0 and 2, that the docs describe with words like “creativity” or “randomness.” Cranking it up makes the output weirder. Cranking it down makes the output more stable. Neither description tells you what it actually is, and the framing is misleading in a specific way: it suggests the model has a “right answer” and you’re deciding how far to wander from it.

That’s not what’s happening.

A language model doesn’t pick a next token. It produces, on every step, a full probability distribution over its entire vocabulary — tens or hundreds of thousands of numbers, each one a guess at how likely that token is to come next. the might get 0.31, a might get 0.22, octopus might get 0.0000003. The model’s output is the distribution, not a word.

To get text out of that, somebody has to actually pick. That picking step is called sampling, and it’s not part of the model — it’s a separate stage that runs after the forward pass. Temperature is a knob on the sampler, not on the model.

The reason the knob exists is that “always pick the most likely token” — the obvious-seeming default — turns out to produce worse text than picking probabilistically. It loops. It collapses into the same safe phrases. It stops surprising itself, which means it stops surprising you. So real samplers reach for the distribution and reshape it before sampling, and temperature is the simplest control on that reshaping.

Why it matters now

Temperature shows up everywhere a model is generating text, which is now everywhere:

Determinism vs. variety in agents. A coding agent that re-runs the same prompt and gets a different patch every time is hard to debug. Set temperature to 0 and (mostly) the same prompt yields the same output. That’s not a personality choice; it’s an engineering one.
Evals are temperature-sensitive. A benchmark score reported “at temperature 0.7” and one at “temperature 0” can differ noticeably for the same model. If a leaderboard doesn’t tell you, you’re comparing apples to oranges.
Creative tasks need spread. Brainstorming, fiction, marketing copy — all collapse into mush at temperature 0 because the model just keeps reaching for the single most-likely continuation. You want the distribution to actually be a distribution.
“Hallucination” interacts with temperature, but isn’t caused by it. Higher temperature widens the set of tokens the model is willing to emit, which can let through factually wrong ones — but a confident wrong answer at temperature 0 is just as much a hallucination. Temperature isn’t a truthfulness knob.

The short answer

temperature = a number that flattens or sharpens the model's probability distribution before a token is sampled from it

At temperature 1, you sample from the model’s distribution as-is. Below 1, you make the peaks taller and the valleys deeper — the most-likely tokens get even more likely, and the long tail gets crushed. Above 1, you flatten the distribution — unlikely tokens get a real shot. Temperature 0 is a limit case: pick the single most-likely token, every time, no randomness left.

How it works

The mechanism is a one-line change inside the softmax that converts the model’s raw output scores (“logits”) into probabilities.

Vanilla softmax over logits z:

p_i = exp(z_i) / sum_j exp(z_j)

With temperature T, you divide every logit by T first:

p_i = exp(z_i / T) / sum_j exp(z_j / T)

That’s the whole intervention. The model’s logits don’t change. The distribution you sample from does.

Walk through what T does:

T = 1 — divide-by-1 is a no-op. You sample from the model’s native distribution.
T < 1 (e.g. 0.2) — dividing by a small number magnifies the gaps between logits before they get exponentiated. The biggest logit pulls even further ahead. The distribution becomes spiky. Sampling from a spiky distribution almost always gives you the top token.
T → 0 — the limit of the above. The single highest-logit token has probability 1; everything else is 0. This is mathematically argmax, often called “greedy decoding.”
T > 1 (e.g. 1.5) — dividing by a number bigger than 1 shrinks the gaps between logits. The distribution flattens toward uniform. Rare tokens become plausible. At very high T, output approaches gibberish because nearly every token is roughly equally likely.

A worked example with three candidate tokens and logits [2.0, 1.0, 0.0]:

T = 1.0:  probs ≈ [0.66, 0.24, 0.09]   moderate preference for the top
T = 0.5:  probs ≈ [0.87, 0.12, 0.02]   strong preference for the top
T = 0.1:  probs ≈ [1.00, 0.00, 0.00]   effectively greedy
T = 2.0:  probs ≈ [0.51, 0.31, 0.19]   much closer to uniform

Same model, same logits — four different sampling regimes.

A few things that trip people up:

Temperature 0 is not always actually deterministic. On the math, yes: argmax is a function. In practice, hosted APIs run on GPUs in batched, parallel kernels where floating-point summations are not order-stable, and ties (or near-ties) in the top logits resolve differently across runs. So “temperature 0” usually means “nearly deterministic” in production. The exact reasons are well-documented in the CUDA / GPU numerics literature; the user-visible takeaway is “don’t bet correctness on bit-exact reproducibility.”
Temperature is not the only sampler knob. Most APIs also expose top-p (nucleus) and top-k. These chop off the tail of the distribution before sampling. Temperature reshapes the distribution; top-p/top-k truncate it. They compose, and the order matters. Different providers apply them in different orders, and the docs don’t always say which.
Different APIs use different scales. OpenAI’s chat API accepts temperature in roughly [0, 2]. Anthropic’s Claude API accepts roughly [0, 1]. The same numeric value (0.7) is not the same intervention across providers, because their underlying logit distributions and any internal renormalization differ. Calibrate per model.
Temperature does nothing during training. It’s purely an inference- time sampling parameter. The model is the same model whether you set temperature to 0 or 2; only the picking step changes.
It’s not really “creativity.” That word makes it sound like a deeper cognitive setting. It isn’t. It’s a slider on how much you trust the model’s top guess versus how much you let the long tail in.

The deep reason a knob like this has to exist is that there is no single correct distribution to sample from for every task. “Translate this sentence” wants a sharp distribution that always finds the same right answer. “Brainstorm names for a pet snail” wants a flat one that lets genuinely different ideas through. The model produces logits; the caller decides which task they’re doing. Temperature is the cheapest control we have for moving between those modes.

Softmax — softmax = exp(logits) / sum(exp(logits)) — the function temperature is parameterizing. Turns arbitrary scores into a probability distribution.
Greedy decoding — greedy = argmax of the logits at every step. The T → 0 limit. Simple, deterministic, often boring.
Top-k sampling — top-k = keep the k highest-probability tokens, sample from those. A truncation, not a reshape.
Top-p / nucleus sampling — top-p = keep tokens whose probabilities sum to ≥ p, sample from those. Adapts how many tokens are in play based on how peaked the distribution already is. Often combined with temperature.
Logit bias — logit bias = additive nudge to a specific token's logit before sampling — the “I want to forbid this token” or “I want to encourage that one” knob.
LLM — LLM = transformer + next-token objective at scale — the thing producing the logits in the first place.
Hallucination — hallucination = confidently wrong output — interacts with temperature but is not caused by it.

Going deeper

The original “temperature” framing comes from statistical physics, where the same exp(−E / T) form (the Boltzmann distribution) describes how thermal energy spreads particles over states. High T means a more uniform spread; low T means everything settles into the ground state. The borrowing into ML is direct, not metaphorical.
The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — the paper that put nucleus sampling on the map and articulated why pure-likelihood decoding produces dull, repetitive text. Worth reading for the empirical case that you don’t actually want the most-likely continuation.
Any modern LLM provider’s API reference, side-by-side: compare how OpenAI, Anthropic, and Google describe temperature, top_p, and top_k. The differences in what each one defaults to and how they interact tell you most of what you need to know about why your prompt behaves differently across providers.

What I’m confident about: the math (logit / T inside softmax), the qualitative behavior across the T range, and the fact that “temperature 0” on hosted APIs is near-deterministic but not bit-exact. What I’m less confident about: the exact order in which any specific provider composes temperature with top-p / top-k, and any internal logit renormalization they do before exposing the knob. Those details change across model versions and aren’t always documented; check the current API reference rather than trust a blog post.