Why quantization works
Stuffing a 70-billion-parameter model into 4-bit weights sounds like it should ruin it. It mostly doesn't — and the reason is more about how the model gets used at inference than about the math of rounding.
Why it exists
The first time someone tells you that a 70-billion-parameter model trained in 16-bit precision can be squashed down to 4-bit weights and still answer almost as well, the natural reaction is that can’t be right. Sixteen bits to four bits is a 4× reduction in resolution. If you did that to a photograph, it would visibly fall apart. If you did it to audio, it would hiss. Why doesn’t the model just become gibberish?
It’s a fair question, and the answer is the whole reason quantization has gone from research curiosity to default deployment trick in maybe three years. The short version: a trained LLM turns out to be much more robust to weight noise than its precision suggests, and inference turns out to be bottlenecked on something quantization directly relieves — moving bytes, not crunching numbers. So the cost of quantization is small and the benefit is large, in a way that’s specific to how this kind of model gets used, not a universal law about neural networks.
This post is about why that asymmetry exists. The mechanics of any particular scheme (GPTQ, AWQ, bitsandbytes, FP8, INT4, MXFP4) you can look up; the load-bearing intuition is what makes them all keep working.
Why it matters now
Quantization is the lever that decides whether a given model actually runs on a given GPU. A 70B model in 16-bit weights is roughly 140 GB. An H100 SXM has 80 GB of VRAM. The model does not fit. In INT8 it’s around 70 GB and barely fits with no room for KV cache. In INT4 it’s around 35 GB and you have real headroom for users. The whole “can a hobbyist run this on one card?” question is decided here.
The same lever shows up in serving economics. Dense decode at small batch is memory-bandwidth bound — the GPU has to stream the model’s weights through the compute units to produce each token — so cutting weight bytes by 4× cuts the per-token bandwidth bill by something close to 4×, modulo metadata and kernel overhead. That’s why many public inference providers, on-device runtimes (llama.cpp, MLX, mobile chips), and frontier training labs treat quantization as default rather than exotic. NVIDIA’s Hopper generation even introduced FP8 in hardware so that training could move below 16-bit (NVIDIA H100 Transformer Engine announcement).
The short answer
quantization = lower-precision storage of weights + a calibrated map back to real numbers + a quality budget you spend carefully
You replace each weight (originally a 16-bit float) with a small integer plus a per-group scale factor that says “multiply by this to get back to roughly the original number.” You then accept that the roughly part introduces a little noise, and you bet — correctly, most of the time — that the model has enough redundancy that the noise gets averaged out before it reaches the output.
How it works
Four ideas, stacked. Strip out any one of them and quantization stops working.
Idea 1 — neural networks are wildly over-parameterized
A trained transformer has billions of weights, and individually most of them carry a tiny amount of information. The output of any given layer is a sum over thousands of weights times their inputs, and the intuition is that small rounding errors on each weight tend to average out before they reach the output — the central limit theorem is the shape of the argument, not a clean theorem about it (the errors aren’t IID and the model is learned, so the math isn’t strict). Round each weight to the nearest 4-bit value and most of the noise washes out inside that sum, most of the time.
The same fact shows up in the older neural-network literature as “overparameterization buys robustness” — pruning, distillation, and quantization all lean on it. Surveys like Gholami et al. (2021) frame the empirical version: networks tolerate weight noise far better than the bit-level math would predict.
I should name a gap here. There isn’t, as far as I know, a clean theory that says “a model with N parameters can survive log-X bits of weight noise per parameter.” The evidence is empirical: people quantize, they measure benchmark drop, and on big language models the drop from 16-bit to 4-bit on careful schemes is often within single-digit percentage points on standard benchmarks. The intuition above is the right shape of explanation, but the precise envelope — including whether bigger models really tolerate more relative noise, or just more absolute parameters worth of noise — is something you measure, not derive.
Idea 2 — the distribution of weights is friendly
Quantization works by picking a numerical range and slicing it into equal-sized buckets. If your weights were uniformly spread over a huge range, this would be wasteful — most buckets would sit empty between extremes.
In trained transformers, weights mostly cluster tightly around zero
in a roughly bell-shaped distribution. So a small fixed range
(say, the 99.9th percentile of magnitudes in a layer) covers almost
every weight, and you can give that range generous resolution. The
group_size=128 and per-channel-scale tricks common to libraries
like GPTQ and AWQ exploit this: cut the weights into small groups,
pick a tight range per group, quantize inside that range. Each group
gets its own scale factor, so a layer where one channel has wider
weights doesn’t drag down the resolution of the rest. (GPTQ
(Frantar et al., 2022) layers an
additional trick on top — approximate second-order error
compensation as it quantizes weight columns one at a time — but
grouped scales are the part that the friendly distribution lets you
get away with.)
This is the same reason JPEG works on photos: the signal is not adversarial, it has structure, and you can exploit the structure to spend bits where they matter.
Idea 3 — the bottleneck is bandwidth, not precision
Here’s the part that makes quantization especially worth it for LLMs as opposed to, say, classical scientific computing.
When a dense transformer generates a token at small batch size, the GPU effectively has to read the model’s weights out of HBM into the compute units, multiply, and stream activations back. For a dense 70B model in FP16, that’s reading ~140 GB of weights per token. With H100’s ~3.35 TB/s of HBM bandwidth, that puts a roofline of ~40 ms per token even if the math itself took zero time. (Batching, mixture-of-experts sparsity, and KV-cache reads change the picture — but for batch-1 dense decode, the math is genuinely not the bottleneck.)
This is why “weight-only quantization” is the dominant trick in practice. You store the weights in INT4, you read them out of HBM at 4 bits per weight (4× less data), then you dequantize on the fly back to FP16 (or BF16) inside the kernel and do the actual matmul in the higher precision. You pay for some extra compute on dequantization — but compute was sitting idle anyway because you were waiting on memory. So the speedup is roughly proportional to the bit reduction, and the quality cost is just the rounding error from idea 1.
This is the asymmetry that makes quantization shine for LLM inference specifically. In a compute-bound workload (like a small CNN doing image classification on a beefy GPU), shrinking the weights doesn’t help much because you weren’t bottlenecked on reading them. In LLM decode, you absolutely were, so the saving translates almost directly to throughput.
Idea 4 — outliers are the part that bites
If quantization were uniformly easy, you’d see no papers about it. The interesting part is the failure mode, and the failure mode has a name: outlier features.
Tim Dettmers and collaborators showed in LLM.int8() (2022) that as transformers cross a certain scale (around 6.7B parameters in the models they studied), a small number of feature dimensions in the activations start carrying values much larger than the rest — the paper reports magnitudes up to ~20× the typical range, concentrated in specific dimensions. If you quantize naively, those outliers force you to pick a numerical range wide enough to contain them, which wastes precision on the overwhelming majority of “normal” values, and the model’s quality collapses.
LLM.int8() handles this by detecting those outlier dimensions at runtime and routing them through a 16-bit matrix multiplication while quantizing the rest to 8-bit. The paper reports keeping more than 99.9% of values in 8-bit while preserving accuracy on models up to 175B parameters. Later activation-aware schemes — SmoothQuant shifts the difficulty between weights and activations, AWQ chooses per-channel scales that protect the most salient channels — respond to the same underlying observation: not all weights and activations are created equal, and the few that matter most have to be protected.
This is the seam in the story. Quantization “just works” on the average weight; the engineering is almost entirely about the few percent of weights and activations where it doesn’t.
Where the seams show
A few honest caveats:
- Quantization is post-hoc, mostly. The standard recipe is: train in BF16 or FP16, then quantize for inference. There are quantization-aware training schemes too, and FP8 training is now real on Hopper hardware, but the dominant industrial pattern is still “train high, serve low.” The story above is about that asymmetry; training has its own constraints.
- Aggressive quantization eventually does break. INT4 with careful schemes is often fine. INT3 often starts to bite. INT2 and 1-bit weights are an active research area, not a default. The cliff exists; you just hit it lower than your intuition suggested, and exactly where depends on the model, the scheme, and the benchmark.
- Benchmarks can mislead about quality. A model can score similarly on MMLU after quantization but feel measurably worse on long-context tasks, multilingual tasks, or code generation. The quality budget isn’t a single number, and a provider quietly moving from FP16 to INT8 to INT4 is betting that the slice of quality its users notice is the slice that survives.
- The KV cache is also being quantized. Most of this post is about weight quantization, but modern engines also store the KV cache in FP8 or INT8 to fit longer contexts. The mechanisms are similar but the failure modes differ — KV outliers behave differently from weight outliers.
Famous related terms
- GPTQ —
GPTQ = post-training INT4 weight quantization + second-order error correction per layer— the workhorse 4-bit method on open-weight models; from Frantar et al., 2022. - AWQ —
AWQ = activation-aware weight quantization + per-channel scaling that protects salient weights— alternative to GPTQ that uses activation statistics rather than Hessian information. - LLM.int8() —
LLM.int8() = 8-bit matmul + a 16-bit side-channel for outlier feature dimensions— the paper that named the outlier problem and made INT8 inference work at 175B scale. - FP8 (E4M3 / E5M2) —
FP8 = 8-bit floating point + two formats for activations vs gradients— Hopper’s hardware bet that even training can move below 16 bits; see NVIDIA Transformer Engine docs. - BF16 vs FP16 —
BF16 = FP32's exponent + truncated mantissa— the precision most modern training is already done at; quantization compresses below this. - Weight-only quantization —
weight-only quantization = compressed weights in HBM + dequantize-on-the-fly inside the kernel— the variant that exploits the memory-bandwidth bottleneck specifically.
Going deeper
- Frantar, Ashkboos, Hoefler, Alistarh — GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023). arXiv. The 4-bit-on-LLMs paper.
- Dettmers, Lewis, Belkada, Zettlemoyer — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (NeurIPS 2022). arXiv. The outlier-features paper.
- Lin et al. — AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024 best paper). arXiv, GitHub.
- Gholami et al. — A Survey of Quantization Methods for Efficient Neural Network Inference (2021). arXiv. The textbook overview of why neural nets tolerate this in the first place.
- NVIDIA — Using FP8 with Transformer Engine. Docs. The hardware side of the same story.