Why isn't temperature 0 actually deterministic?

You set temperature to 0, send the same prompt twice, get two different answers. The math says argmax is a function. The hardware disagrees.

AI & ML intermediate Apr 29, 2026

Why it exists

Every engineer who has tried to write a regression test against a hosted LLM hits this wall.

You set temperature: 0. You re-send the exact same prompt. You expect the exact same output, because the docs say temperature 0 means “always pick the most likely token” and that’s a function — same input, same output, end of story. Then the second response comes back slightly different. Sometimes a word, sometimes a whole reordered paragraph. Run it ten times and you’ll see two or three variants.

This is not a bug in your code. It’s not the model being “creative.” The sampling step really is argmax. The reason the output drifts has nothing to do with the model’s distribution and everything to do with what’s happening to the numbers underneath: floating-point arithmetic on a GPU, running in a kernel whose behavior depends on what other requests happen to be in the same batch as yours.

It’s worth understanding because the determinism story breaks in a place people don’t look. If you assume “temperature 0 = reproducible,” you’ll write tests that flake, caches that miss, and evals that drift between runs of the same model on the same hardware. The model isn’t lying to you. The abstraction is.

Why it matters now

Determinism is load-bearing for a lot of what people are trying to build on top of LLMs in 2026:

Evals and benchmarks. “We re-ran the same eval and the score moved by half a point” is constant noise in the model-comparison world. Some of that is genuine model variance. A surprising amount is just nondeterministic inference on the same model.
Caching by output. If you wanted to cache “this prompt → this response” to skip a model call, you can’t, because the response isn’t stable. (Caching by prompt still works, which is what providers actually ship.)
Agent debugging. When a coding agent does the wrong thing, the first thing you want is to replay the run. If the same prompt at T=0 gives a different tool call, you can’t isolate whether the bug was in the model, the harness, or the world it was acting on.
Scientific reproducibility. Papers that report a benchmark number without specifying batch size, hardware, kernel library, and concurrency conditions are reporting a number with hidden error bars.

Recent engineering work (see Going deeper) has shown that the user-visible piece of this problem is mostly tractable if you make inference kernels batch-invariant. But the default for hosted APIs and most open-source serving stacks is still “nearly deterministic, not bit-exact.” The right mental model is the latter.

The short answer

nondeterminism at T=0 = floating-point non-associativity + batch-dependent GPU kernels

Argmax over the logits is deterministic. The logits themselves aren’t bit-identical across runs, because the matrix multiplications that produce them sum thousands of floating-point numbers in a different order each time — and on top of that, the kernels chosen depend on the shape of the batch your request happens to land in. Different order of additions, different rounding, occasionally different argmax.

How it works

Three independent reasons stack on top of each other. Any one of them is enough to break bit-exact reproducibility; together they make it almost impossible by default.

1. Floating-point addition isn’t associative

In real arithmetic, (a + b) + c = a + (b + c). In floating-point arithmetic, those two expressions can give different answers, because each intermediate sum is rounded to fit in 16 or 32 bits. Sum a thousand near-zero numbers in one order, get one result. Sum them in a different order, get a result that differs in the last few bits.

A single attention layer’s logits come from summing thousands of products. The final logits — the ones argmax runs on — are the result of many such reductions stacked through dozens of layers. Tiny rounding differences anywhere in that chain can flip a near-tie at the top of the final distribution.

Most of the time, the top token is far enough ahead that this doesn’t matter. Occasionally, two tokens are within a hair’s breadth — say, the at logit 8.4012 and a at 8.4007 — and a different summation order tips which one wins. That single token then changes the rest of the generation, because the model conditions on its own outputs.

2. GPU kernels can reduce in input-shape-dependent order

You could in principle write a matmul that always sums in the same order for a given input shape. Real high-performance GPU kernels do something weaker: they’re often deterministic given a fixed input shape, but the reduction strategy they pick can vary with shape. Sources of variability:

Auto-tuned kernel selection. Libraries like cuBLAS and cuDNN pick from multiple kernel implementations at runtime based on tensor shapes; two different shapes can pick two different algorithms with different tiling and reduction trees.
Parallel reduction trees. A sum of N numbers is split across many threads; the partial-sum tree’s shape depends on tile and split choices, which depend on shape.
Atomic adds in some routines. Certain GPU operations accumulate into shared memory with atomics, which complete in scheduling order. NVIDIA documents this for specific routines. Worth noting: for typical LLM forward passes, atomics aren’t usually the operative cause of run-to-run variation — the Thinking Machines piece linked below is pretty firm on this point.

Every modern deep-learning framework has a “deterministic mode” flag that forces order-stable kernels. Turning it on costs throughput. Hosted inference providers, optimizing for tokens/sec/dollar, don’t turn it on by default.

3. Your batch isn’t your batch

This is the one most engineers don’t see coming, and the recent Thinking Machines piece (below) argues it’s the dominant cause of T=0 drift on hosted APIs.

Inference servers don’t run one request at a time. They batch many concurrent requests together so the GPU isn’t sitting idle. When your request arrives, it gets stitched into a batch with whoever else is on the server right now: a 200-token prompt next to a 10,000-token prompt, padded or packed together, processed in a single forward pass.

The mechanism isn’t that other users’ values leak into yours — they don’t. It’s that the shape of the combined tensor changes which kernel implementation gets picked, which tiling it uses, and therefore the order in which your own values get summed. The arithmetic on your numbers changes because they’re sitting in a differently-shaped tensor.

For mixture-of-experts models, there’s an additional path: in capacity-limited MoE implementations, expert capacity is allocated per batch, and two requests competing for the same expert can cause one to overflow and get routed differently than it would alone. Whether this actually happens depends on the specific MoE serving policy.

So you can send the same prompt twice in a row, and the only thing that changed between call 1 and call 2 is which other users were on the server. That’s enough to change kernel selection and occasionally flip the argmax.

This is why local inference (one model, one request, fixed batch shape, fixed hardware, deterministic kernels enabled) can be made bit-reproducible, and hosted APIs typically aren’t — not without engineering work that gives up some of the batching flexibility that makes them affordable.

Where it gets subtle

It’s usually small. Most prompts produce the same first 50 tokens across runs, then diverge once a near-tie hits. The output is qualitatively the same; the bytes aren’t.
Greedy decoding hides it less than you’d think. People assume that setting temperature: 0 and calling it “deterministic mode” closes the question. It only closes the sampling question. Everything upstream of the sampler is still nondeterministic.
Different precisions amplify it differently. Models served in bfloat16 or fp8 have less headroom against rounding noise than fp32. Quantized serving can make near-ties flip more often.
“Seeded” APIs don’t solve it. Some providers offer a seed parameter, and the docs themselves usually say it’s best-effort and not guaranteed reproducible. The seed pins the sampler’s randomness — useful at T > 0 — but the upstream kernel and batch-shape variation is still there. Same seed, same prompt, T=0 on a busy server: still drifts.
It can be defeated, with effort. It is possible to build a fully deterministic inference stack — pinned kernels, fixed batch shapes, invariant reductions across batch sizes. The cost is throughput and engineering work. Recent research has shown that most of the nondeterminism people attribute to GPUs is really about batch-invariance of kernels, and that fixing that makes inference reproducible at a real but bounded throughput cost.

The one-sentence version: argmax is a function, but the inputs to argmax are the output of a long chain of floating-point reductions whose order depends on the shape of the batch your request lands in, so the function gets a slightly different input each time.

Temperature — temperature = a knob that flattens or sharpens the logit distribution before sampling. The thing people think controls determinism.
Greedy decoding — greedy = argmax of the logits at every step. Deterministic on identical logits; the logits are the problem.
KV cache — KV cache = stored keys and values for past tokens, reused on each step. Cache layout and paging strategy can subtly change reduction order too.
Batch invariance — batch invariance = a kernel returns bit-identical results regardless of how its inputs are batched. The property that, if enforced everywhere, would make hosted inference reproducible.
Floating-point non-associativity — (a + b) + c ≠ a + (b + c) in finite precision. The root cause sitting under everything else.
Seeded sampling — seed = pin the sampler's RNG. Fixes nondeterminism at T > 0; does nothing at T = 0.
LLM — the thing whose logits we’re arguing about.

Going deeper

Defeating Nondeterminism in LLM Inference — Horace He and collaborators at Thinking Machines Lab, September 2025. An engineering note (not a peer-reviewed paper) arguing that the user-visible source of T=0 drift on hosted LLMs is specifically the batch-invariance of inference kernels — i.e. that kernels return different results for the same logical input depending on how they’re batched — rather than generic GPU scheduling noise. Sharper than this post on the actual mechanism; worth reading.
PyTorch’s reproducibility docs — concrete on which operations are nondeterministic on GPU and what torch.use_deterministic_algorithms(True) actually changes.
What Every Computer Scientist Should Know About Floating-Point Arithmetic (David Goldberg, ACM Computing Surveys 23(1), March 1991) — the canonical reference for why the arithmetic doesn’t behave the way the math does. Thirty-five years old and still the right starting point.

Confidence note: I’m confident about the underlying mechanisms — floating-point non-associativity, shape-dependent kernel selection, and batch composition affecting that shape. The Thinking Machines piece in Going deeper argues batch-invariance of kernels is the dominant user-visible cause on hosted LLMs; I find that argument persuasive but haven’t independently measured it across providers, and provider internals (kernel libraries, batching policy, precision, MoE routing) aren’t public. Treat the relative weights of the three mechanisms as “all three are real and stack,” not as a settled ranking for any specific stack.