Why floating-point addition isn't associative

Schoolroom math says (a + b) + c equals a + (b + c). On a real computer it doesn't, and that one fact ripples out into nondeterministic GPU reductions, irreproducible training runs, and LLM outputs that aren't bit-stable across hardware.

Computer Science intermediate May 2, 2026

Why it exists

You re-run the same training script with the same seed, the same data, the same model, the same hardware. The loss curve at step 10,000 is almost the same as last time — but not quite. Off by 0.0003. Run it again: off by something different. Nothing in your code changed. Nothing in your config changed. The bits going into the GPU are identical. The bits coming out aren’t. This is not a bug in CUDA or in your framework. It’s the consequence of a single, deeply unintuitive fact about computer arithmetic: floating-point addition isn’t associative. (a + b) + c and a + (b + c) can give different answers, and on modern hardware they routinely do.

The reason is finite precision. A real number has, in principle, infinitely many digits. A floating-point number has a fixed number of bits — 32, 16, 8 — split between an exponent (how big) and a mantissa (how precisely). After every single arithmetic operation, the result is rounded back to fit. Round, round, round. Once you accept that, the non-associativity is forced: the rounding intermediate values you produce depends on the order you produced them in. Different order, different intermediates, different rounding errors, different final answer.

This is not a flaw the standards committee forgot to fix. IEEE 754 — the spec almost every CPU and GPU implements — is deliberately non-associative. The trade is intentional: fixed-width floats give you reproducible per-operation rounding in exchange for giving up the algebraic laws of the real numbers. You keep one and lose the other; you can’t have both at finite cost.

Why it matters now

This sounds like a textbook curiosity. It isn’t — it’s the load-bearing explanation for several things engineers hit constantly in 2026.

GPU reductions associate differently from serial code, and many paths are nondeterministic. Summing a million numbers on a GPU isn’t done left-to-right; it’s done as a parallel tree of partial sums whose shape depends on tile sizes, the kernel the runtime picked for the current input shape, and — for paths that use atomics or unordered work queues — the actual scheduling. Different shape or different reduction path, different rounding errors, different final logits. (A fixed kernel on the same hardware can be bit-stable; the surprises tend to come from atomics, library version drift, or shape-dependent kernel dispatch.)
Training runs aren’t bitwise reproducible. Same script, same seed, same hardware — the loss curve is close but not identical. Frameworks expose flags to force order-stable reductions, but they cost throughput, so the default is “fast and almost-deterministic” rather than “slower and bit-exact.”
Temperature-zero LLM sampling drifts. Even with greedy decoding, the argmax is computed over logits that are themselves the output of a long chain of GPU reductions. A near-tie between two tokens can flip based on which other requests happened to share the inference batch. That’s the thread the why-temperature-zero-isnt-deterministic post pulls on.
Lower precision amplifies it. In BF16 or FP8, the rounding step throws away more bits per operation, so the same reordering produces a larger drift. The non-associativity is the same; the gap between “two valid answers” is just wider.

The short answer

float non-associativity = finite mantissa + rounding after every op

Real numbers carry as many digits as they need. Floating-point numbers don’t — every result gets rounded to a fixed mantissa width. Rounding is where the information loss happens, and information loss isn’t order-invariant. Re-arrange the additions, you re-arrange which intermediate values get rounded, and that changes the final answer. The same equation, evaluated two valid ways, gives two slightly different floats.

How it works

Here is the cleanest counterexample. In FP32, with a = 1e20, b = -1e20, c = 1:

import numpy as np
a, b, c = np.float32(1e20), np.float32(-1e20), np.float32(1.0)
(a + b) + c   # → 1.0
a + (b + c)   # → 0.0

Same three numbers. Same operator. Different parenthesization. Different answer. (You can paste those four lines into a Python shell and reproduce the split.)

Walk through (a + b) + c. First 1e20 + (-1e20) = 0 exactly — two equal- magnitude opposites cancel, no rounding needed. Then 0 + 1 = 1. The 1 survived because it never got near a giant number.

Walk through a + (b + c). First -1e20 + 1. The number -1e20 lives at a magnitude where the gap between consecutive FP32 values is enormous — much bigger than 1. (That gap has a name: ULP.) At 1e20, one ULP in FP32 is roughly 2^43 ≈ 9e12. Adding 1 to -1e20 produces a true result that lies between two representable FP32 values, and rounding picks the nearer one — which is -1e20 itself. The 1 was annihilated by the rounding step. Then 1e20 + (-1e20) = 0. The 1 is gone, never to return.

Re-association is what decided whether the small number survived. Pair it with its near-equal partner first and it lives. Pair it with a giant first and it’s silently rounded away.

The general shape: every floating-point operation is round(true_result). Re-association doesn’t change true_result, but it changes which intermediate gets rounded. If a small number meets a big number first, the small one is silently swallowed. If two big numbers meet first and cancel, the small one survives. The associative law is a property of the exact arithmetic underneath, not of the rounded arithmetic the hardware actually performs.

A few consequences fall out of this:

Commutativity is not broken. a + b and b + a produce the same float in IEEE 754 (NaN payload edge cases aside). The rounding rule doesn’t care about the order of two operands. Re-grouping is the thing that breaks, not re-ordering of a single sum.
Sum order matters in long reductions. Summing a million floats left-to-right, in pairs, or in a parallel tree gives three different answers. None of them is “wrong” — they’re all valid IEEE 754 computations of subtly different expressions. The “best” answer (closest to the exact sum) usually comes from a tree or from Kahan summation, not from naive left-to-right.
GPUs lean into this. A GPU matmul doesn’t sum left-to-right — it splits the dot product across thousands of threads and combines partial sums in a tree whose shape depends on tile choices. That’s the source of the “same matmul, different bits” problem. Frameworks like PyTorch expose deterministic-mode flags that force order-stable reductions, but at a real throughput cost. The exact source of nondeterminism in a given training run is hardware- and library-specific, and you generally have to read framework determinism docs and vendor numerical-precision guides to track it down case by case.

The honest seam: I’m describing the mechanism — finite mantissa plus per-op rounding plus reordering. The empirical question of how much this matters for a given workload (training a 70B model? doing a 3×3 matmul?) depends on numerical conditioning, precision, and reduction strategy in ways that don’t compress to a single rule. For a deeper treatment, the canonical reference is Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic (1991).

ULP — ULP = gap between adjacent floats at this magnitude. Under round-to-nearest, an addend below half a ULP rounds away entirely. The reason 1e20 + 1 swallows the 1.
Kahan summation — Kahan sum = running total + a compensation term that recovers lost bits. A clever trick that recovers most of the precision a naive long sum throws away. Costs a few extra ops per element.
Denormals (subnormals) — denormal = float below the smallest normal value, encoded with reduced precision. The format’s escape hatch for “almost zero” — slow on many CPUs, and often flushed to zero in fast-math GPU paths.
TF32 — TF32 = FP32 inputs rounded to TF32 precision inside tensor cores + FP32 accumulate. Trades a few mantissa bits for big throughput; another place rounding-order shows up.
BF16 vs FP16 — different ways to spend a 16-bit budget; the format you pick changes how much the non-associativity hurts.
Why temperature 0 isn’t deterministic — the most visible downstream consequence.

Going deeper

David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys 23(1), 1991. Thirty-plus years old and still the canonical reference. If you only read one thing, read this.
IEEE 754-2019 — the standard itself. Dense, but the rounding rules that make non-associativity unavoidable are stated here, not folklore.
Nicholas Higham, Accuracy and Stability of Numerical Algorithms — the textbook treatment of error analysis. Heavy reading; useful when “just be careful” stops being good enough.