Why bf16 won the training format wars
Half-precision floats came in two flavors: fp16, which had been around for years, and bf16, which kept fp32's exponent and threw away mantissa bits. The less-precise format won. Here's why that's not a typo.
Why it exists
Imagine you can only fit a 4-digit number on a label. You have two choices for what those four digits mean. Option A: cover everything from 0.0001 up to 9999, but only with rough resolution (you can tell 12.34 from 12.35, but not from 12.345). Option B: cover only 0.001 up to 999, but with sharper resolution at every value. Both are “4 digits,” but they’re aimed at very different worlds. That’s the choice every 16-bit float format is making — how to spend a fixed bit budget between how big a number you can write down and how precisely you can write it. fp16 picked precision. bf16 picked range. For training neural networks, range turned out to matter much more, and that’s why bf16 won.
Here’s the puzzle. You’re training a neural network. You’d like to use 16-bit floats instead of 32-bit floats — half the memory, half the bandwidth, often more than double the throughput on modern matrix engines. There are two candidates sitting on the shelf:
- fp16, standardized in IEEE 754, has been in graphics hardware for over a decade. 5 exponent bits, 10 mantissa bits.
- bf16, invented at Google for the TPU and later adopted by everyone else. 8 exponent bits, 7 mantissa bits.
Read those numbers carefully. bf16 has fewer mantissa bits than fp16 — 7 vs. 10. It’s the less precise of the two. And yet bf16 is now the default mixed-precision format on every major training stack I’ve seen public recipes for, and fp16 is treated as the format you have to work around.
That should read as backwards, because the entire historical case for floating point was “trade range for precision smartly, but get plenty of both.” bf16 deliberately throws precision away to keep range. Why was that the right trade?
The short version: training a neural network is a lot less sensitive to how precisely you can represent a number than to how small or large a number you can represent at all. fp16’s narrower exponent meant gradients underflowed to zero, training diverged, and engineers spent years inventing workarounds (loss scaling, mixed precision recipes) to paper over the problem. bf16 just gives the gradients somewhere to live, accepts that the last couple of bits of the mantissa are noise anyway, and gets out of the way.
Why it matters now
If you train models, this is one of the few hardware decisions that’s still
visible in your code. PyTorch’s torch.bfloat16 and torch.float16 are not
interchangeable, and picking the wrong one is the difference between a stable
run and one where the loss diverges to NaN at hour 14.
It also explains why specific accelerators became popular when they did. Google developed bfloat16 for Cloud TPU v2 and v3; Cloud TPU went publicly available in beta in February 2018. NVIDIA’s tensor cores gained native bf16 with the A100 (Ampere, 2020). After that point, “use mixed precision” stopped meaning “fp16 with a bag of tricks” and started meaning “bf16, which Just Works.” Frameworks shifted defaults, and by the H100 generation (2022) bf16 was the de facto choice in most published foundation-model training recipes.
It matters even outside training. Inference is moving toward fp8 and below (fp8 in H100, fp4 announced for Blackwell), and the design decisions there inherit bf16’s lesson: when you have to throw bits away, throw away precision before you throw away range.
The short answer
bf16 = sign + 8-bit exponent (same as fp32) + 7-bit mantissa
bf16 is essentially fp32 with the bottom 16 bits of the mantissa lopped off (plus a rounding rule, and in practice TPUs flush denormals to zero). It has the same exponent range as fp32 — the same ability to represent very small gradients and very large activations — but only enough mantissa precision for ~3 decimal digits. That’s plenty for gradient descent, which spends its life adding noisy small updates to noisy current weights. It is not plenty for, say, solving a stiff ODE — but neural network training isn’t that.
How it works
A floating-point number is sign × mantissa × 2^exponent. Three parameters.
You get a fixed bit budget; you split it between mantissa and exponent.
fp32: 1 sign | 8 exponent | 23 mantissa (32 bits total)
fp16: 1 sign | 5 exponent | 10 mantissa (16 bits)
bf16: 1 sign | 8 exponent | 7 mantissa (16 bits)
The exponent sets the range: the smallest and largest numbers the format can express. The mantissa sets the precision: how finely you can subdivide the gap between consecutive representable numbers.
What the exponent buys you
With 5 exponent bits, fp16’s normal range runs roughly 6e-5 to 6e4. Below
6e-5, you fall off into subnormals
or simply round to zero. With 8 exponent bits, bf16’s normal range runs from
about 1e-38 to 3e38 — same as fp32.
That gap matters because of how training works. The gradient of a deep
network is a product of many small numbers — one per layer. Multiply 50 small
numbers together and the result is very small. In fp16, those tail
gradients silently round to zero, and any weight that depended on them
stops learning. The standard fp16 fix is
loss scaling:
multiply the loss by some large constant (1024, 65536) so the resulting
gradients land in fp16’s representable window, then divide back out before
the optimizer step. It works, but you have to dynamically tune the scale —
too high and you overflow to infinity, too low and you underflow. Frameworks
shipped entire subsystems (torch.cuda.amp, NVIDIA’s APEX) just to manage it.
bf16 has fp32’s exponent. Gradients that fit in fp32 fit in bf16. No loss scaling needed. Mixed-precision recipes in bf16 are dramatically simpler.
What the mantissa costs you
bf16’s 7 mantissa bits give you roughly 1-in-128 relative precision — about 0.8%. That sounds bad. The reason it’s fine for training is more interesting than the usual “neural networks are noise-tolerant” hand-wave.
Stochastic gradient descent is, fundamentally, an algorithm that takes noisy gradient estimates from a small batch and adds them to weights with a small learning rate. The noise from minibatch sampling typically dwarfs the quantization noise from low-precision math by orders of magnitude. Adding ~1% relative error to a number that already has ~10–100% noise in it from the batch you happened to sample doesn’t move the needle.
There’s a real cost — accumulators inside matrix multiplies still need fp32
to avoid losing precision in long sums, which is why tensor cores compute
bf16 × bf16 → fp32 and only round back at the end. And the optimizer state
(Adam moments)
is usually kept in fp32 because it accumulates across millions of steps and
small biases compound. But the weights and activations — the things that
make up the bulk of memory and bandwidth — sit happily in bf16.
Why fp16 came first anyway
fp16 wasn’t designed for ML. It was designed for graphics — pixel shaders, HDR color, normal maps — where the values you store are bounded (a color channel, a unit-length normal vector) and you care about visible precision across that bounded range. For graphics, 5 exponent bits is plenty and 10 mantissa bits is a real upgrade over 8-bit fixed-point.
It got reused for ML because the hardware existed. The mismatch — that ML gradients want exponent and graphics shaders wanted mantissa — only became obvious once people tried to train large models in it. bf16 is the version that was designed once someone actually thought about the workload.
The seam: bf16 isn’t always enough
bf16 is great for training and inference of dense models. It starts to fray in a few places:
- Numerically delicate operations. Some normalization schemes (LayerNorm’s variance computation, log-sum-exp in softmax) benefit from being computed in fp32 even when surrounding ops are bf16. Most frameworks do this implicitly.
- Long-running accumulators. As mentioned, Adam state usually stays fp32. So do gradient accumulators across micro-batches in pipeline parallel training.
- Below 16 bits. Once you go to fp8, the exponent budget gets tight again — fp8 comes in two flavors (E4M3 with 4 exponent bits, E5M2 with 5) precisely because there’s no good single answer at that width. The bf16 lesson — keep exponent — pushes you toward E5M2 for things that need range, and E4M3 for things where you’ve already controlled the range with scaling.
I don’t have a clean public timeline for when each frontier lab switched their default training format from fp16+loss-scaling to bf16. By the early 2020s it was already the dominant choice in publicly described training recipes that ran on bf16-capable hardware, but the exact internal cutover dates inside frontier labs aren’t something I’d assert without a source.
Famous related terms
- fp32 —
fp32 = 1 sign + 8 exponent + 23 mantissa— the “default” float for decades. Still the format of choice for optimizer state and numerically sensitive reductions. - fp16 —
fp16 = 1 sign + 5 exponent + 10 mantissa— IEEE half. More precise than bf16, narrower range. Lost the training fight; still useful in graphics and some inference settings. - fp8 (E4M3 / E5M2) —
fp8 ≈ bf16's logic taken one step further— two flavors because at 8 bits you really do have to pick range or precision, not both. - Mixed precision training —
mixed precision = compute in low precision + accumulate in high precision. The recipe that makes any of this actually train. - Loss scaling —
loss scaling = multiply loss by big constant + divide gradients back + dynamically tune the constant. The fp16 workaround that bf16 made unnecessary. - Tensor core —
tensor core = matrix-multiply unit + native low-precision input + fp32 accumulate. The hardware that made low-precision training cheap.
Going deeper
- Google’s bfloat16: The secret to high performance on Cloud TPUs (Cloud blog, 2019) is the cleanest public explainer for the format and why Google picked the exponent/mantissa split they did. It references bfloat16 figures adapted from 2018 TensorFlow Dev Summit material, so “publicly discussed by 2018” is supportable.
- Mixed Precision Training (Micikevicius et al., arXiv 1710.03740, October 2017; ICLR 2018) is the canonical paper for the fp16 + loss-scaling recipe. Reading it is the fastest way to understand what bf16 got rid of. The author list spans NVIDIA and Baidu, not NVIDIA alone.
- IEEE 754-2008 first standardized
binary16(fp16); IEEE 754-2019 retained it. bf16 is not IEEE-standardized — it’s a de facto standard codified by hardware vendors (Google, NVIDIA, ARM, Intel) rather than by IEEE. That’s a small but interesting status difference. - Higham, Accuracy and Stability of Numerical Algorithms — not ML-specific and not light reading, but the canonical treatment of why “precision” and “range” are different things and how to reason about each.