Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why FP8 training is stable

FP8 has only 256 representable values. Training a frontier model in it sounds insane — and it almost is. Here's the trick that makes it work.

AI & ML intermediate Apr 30, 2026

Why it exists

A trillion-parameter LLM is mostly matrix multiplies. Halve the number of bits each number takes, and you roughly double how many of those multiplies you can do per second on the same chip — and halve the memory traffic feeding them. That is the entire prize. Going from FP32 down to 16-bit (FP16 or BF16) gave the field one such doubling. Going to 8 bits is the next one.

The reason “just use 8 bits” sounds insane is that an 8-bit float has exactly 256 representable values. That’s not a typo. Across a tensor with millions of entries spanning many orders of magnitude — gradients near zero, activation outliers shooting into the thousands — 256 buckets is brutal. Naively cast everything to FP8 and the loss curve diverges within a few hundred steps.

So FP8 training sat in the “in theory yes, in practice no” bucket for years. NVIDIA’s Hopper architecture (H100, announced March 2022) was the first NVIDIA GPU architecture with native FP8 tensor cores; the question was whether anyone could actually keep a model converging on them. By late 2024 there were public examples — DeepSeek-V3 (arXiv preprint, Dec 27 2024) used an FP8 mixed-precision training framework and, in their ablations, reported relative loss error below 0.25% versus BF16 baselines.

The interesting question isn’t can you train in FP8 — it’s what makes it stable, given that 256 values per tensor really is the constraint.

Why it matters now

Every halving of training precision is, roughly, a halving of training cost — both in flops and in HBM bandwidth. For a frontier run, that’s tens of millions of dollars and weeks of wall-clock time. It also expands what fits at all: more parameters, longer context, bigger batches, on the same cluster.

It matters past the giant labs too. FP8 inference is increasingly common for serving — the same numerical tricks let you keep a deployed model honest at roughly half the memory of BF16. Understanding why FP8 training works tells you why FP8 inference works, since they share the failure modes.

The short answer

FP8 training = (E4M3 + E5M2) + per-tensor scale factors + a BF16 master copy of weights

Two FP8 formats, not one: E4M3 for the forward pass where precision matters, and E5M2 for gradients, which span a huge dynamic range. (DeepSeek-V3 deviates here and uses E4M3 for all of its FP8 GEMMs — more on that below.) Each tensor gets its own scale factor that slides its values into FP8’s narrow window before quantization, then slides back out. And the “real” weights — the ones the optimizer updates — live in higher precision; FP8 is just the format the matmuls run in.

How it works

Three ideas, each addressing a specific way naive FP8 explodes.

1. Two formats, because one isn’t enough.

FP8 has 8 bits to spend. You can spend more on the exponent (range) or more on the mantissa (precision); you can’t have both. NVIDIA standardized two splits:

The “hybrid” recipe — E4M3 for forward activations and weights, E5M2 for backward gradients — is the default in NVIDIA’s Transformer Engine. Forward values cluster in a manageable range; gradients can be tiny one layer and large the next, so they need the headroom.

2. Per-tensor scaling, because the dynamic range is the enemy.

If your tensor’s largest absolute value is 800, and E4M3’s max is 448, every value above 448 saturates to the same number — you’ve thrown away the tail. If it’s 0.001, almost everything rounds to zero — you’ve thrown away the body. The fix is a single FP32 scale factor s per tensor: store x / s in FP8, multiply by s on the way out. Now you only need to fit the shape of the distribution into 256 buckets, not its absolute scale.

Picking s is the whole game. Set it from the current tensor’s max-abs and you pay a reduction op every step. A common production trick is delayed scaling: keep a short history of recent max-abs values, derive s from that. Cheap, but vulnerable — one outlier iteration poisons the history and the next step blows up. This failure mode shows up in the literature on FP8 training stability (e.g. Lee et al., 2024 — “To FP8 and Back Again”); how often it bites in real frontier runs isn’t something I can source publicly.

3. A higher-precision master copy of the weights.

This is the part people miss. The matmuls run in FP8. The weights themselves — the parameters the optimizer updates — are kept in BF16 or FP32. Each step you cast down to FP8 to compute, then apply the gradient update to the high-precision master copy. Optimizer state (Adam’s moments) also stays in higher precision.

Why? Because gradient updates are tiny — often many orders of magnitude smaller than the weights they’re updating. In FP8 those updates round to zero. In BF16 they don’t. So FP8 buys you the matmul throughput; the master copy buys you the ability to actually accumulate learning. This is the same pattern as FP16 mixed-precision training (Micikevicius et al., 2017) — FP8 just pushes it further.

The seam: where naive FP8 still breaks.

Per-tensor scaling assumes one number can describe a whole tensor’s distribution. That’s wrong when a tensor has outliers — a few entries 100× larger than the rest. The big values force s up, which crushes the small values into a handful of buckets. Activation outliers in transformer attention are known to be a real problem here.

DeepSeek-V3’s contribution was to scale at finer granularity: 1×128 tiles for activations, 128×128 blocks for weights. Each tile/block gets its own scale, so an outlier only ruins precision for its neighborhood, not the whole tensor. They also use E4M3 (not the hybrid E4M3/E5M2 split) for all FP8 GEMMs, arguing the finer-grained scaling reclaimed enough effective range; some non-GEMM ops still stay in BF16/FP32 in their framework. Whether this fine-grained approach is now the dominant production recipe across other frontier labs, I don’t have public information to say — labs are quiet about training stacks.

Going deeper