Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why quantization works

Stuffing a 70-billion-parameter model into 4-bit weights sounds like it should ruin it. It mostly doesn't — and the reason is more about how the model gets used at inference than about the math of rounding.

AI & ML intermediate Apr 29, 2026

Why it exists

The first time someone tells you that a 70-billion-parameter model trained in 16-bit precision can be squashed down to 4-bit weights and still answer almost as well, the natural reaction is that can’t be right. Sixteen bits to four bits is a 4× reduction in resolution. If you did that to a photograph, it would visibly fall apart. If you did it to audio, it would hiss. Why doesn’t the model just become gibberish?

It’s a fair question, and the answer is the whole reason quantization has gone from research curiosity to default deployment trick in maybe three years. The short version: a trained LLM turns out to be much more robust to weight noise than its precision suggests, and inference turns out to be bottlenecked on something quantization directly relieves — moving bytes, not crunching numbers. So the cost of quantization is small and the benefit is large, in a way that’s specific to how this kind of model gets used, not a universal law about neural networks.

This post is about why that asymmetry exists. The mechanics of any particular scheme (GPTQ, AWQ, bitsandbytes, FP8, INT4, MXFP4) you can look up; the load-bearing intuition is what makes them all keep working.

Why it matters now

Quantization is the lever that decides whether a given model actually runs on a given GPU. A 70B model in 16-bit weights is roughly 140 GB. An H100 SXM has 80 GB of VRAM. The model does not fit. In INT8 it’s around 70 GB and barely fits with no room for KV cache. In INT4 it’s around 35 GB and you have real headroom for users. The whole “can a hobbyist run this on one card?” question is decided here.

The same lever shows up in serving economics. Dense decode at small batch is memory-bandwidth bound — the GPU has to stream the model’s weights through the compute units to produce each token — so cutting weight bytes by 4× cuts the per-token bandwidth bill by something close to 4×, modulo metadata and kernel overhead. That’s why many public inference providers, on-device runtimes (llama.cpp, MLX, mobile chips), and frontier training labs treat quantization as default rather than exotic. NVIDIA’s Hopper generation even introduced FP8 in hardware so that training could move below 16-bit (NVIDIA H100 Transformer Engine announcement).

The short answer

quantization = lower-precision storage of weights + a calibrated map back to real numbers + a quality budget you spend carefully

You replace each weight (originally a 16-bit float) with a small integer plus a per-group scale factor that says “multiply by this to get back to roughly the original number.” You then accept that the roughly part introduces a little noise, and you bet — correctly, most of the time — that the model has enough redundancy that the noise gets averaged out before it reaches the output.

How it works

Four ideas, stacked. Strip out any one of them and quantization stops working.

Idea 1 — neural networks are wildly over-parameterized

A trained transformer has billions of weights, and individually most of them carry a tiny amount of information. The output of any given layer is a sum over thousands of weights times their inputs, and the intuition is that small rounding errors on each weight tend to average out before they reach the output — the central limit theorem is the shape of the argument, not a clean theorem about it (the errors aren’t IID and the model is learned, so the math isn’t strict). Round each weight to the nearest 4-bit value and most of the noise washes out inside that sum, most of the time.

The same fact shows up in the older neural-network literature as “overparameterization buys robustness” — pruning, distillation, and quantization all lean on it. Surveys like Gholami et al. (2021) frame the empirical version: networks tolerate weight noise far better than the bit-level math would predict.

I should name a gap here. There isn’t, as far as I know, a clean theory that says “a model with N parameters can survive log-X bits of weight noise per parameter.” The evidence is empirical: people quantize, they measure benchmark drop, and on big language models the drop from 16-bit to 4-bit on careful schemes is often within single-digit percentage points on standard benchmarks. The intuition above is the right shape of explanation, but the precise envelope — including whether bigger models really tolerate more relative noise, or just more absolute parameters worth of noise — is something you measure, not derive.

Idea 2 — the distribution of weights is friendly

Quantization works by picking a numerical range and slicing it into equal-sized buckets. If your weights were uniformly spread over a huge range, this would be wasteful — most buckets would sit empty between extremes.

In trained transformers, weights mostly cluster tightly around zero in a roughly bell-shaped distribution. So a small fixed range (say, the 99.9th percentile of magnitudes in a layer) covers almost every weight, and you can give that range generous resolution. The group_size=128 and per-channel-scale tricks common to libraries like GPTQ and AWQ exploit this: cut the weights into small groups, pick a tight range per group, quantize inside that range. Each group gets its own scale factor, so a layer where one channel has wider weights doesn’t drag down the resolution of the rest. (GPTQ (Frantar et al., 2022) layers an additional trick on top — approximate second-order error compensation as it quantizes weight columns one at a time — but grouped scales are the part that the friendly distribution lets you get away with.)

This is the same reason JPEG works on photos: the signal is not adversarial, it has structure, and you can exploit the structure to spend bits where they matter.

Idea 3 — the bottleneck is bandwidth, not precision

Here’s the part that makes quantization especially worth it for LLMs as opposed to, say, classical scientific computing.

When a dense transformer generates a token at small batch size, the GPU effectively has to read the model’s weights out of HBM into the compute units, multiply, and stream activations back. For a dense 70B model in FP16, that’s reading ~140 GB of weights per token. With H100’s ~3.35 TB/s of HBM bandwidth, that puts a roofline of ~40 ms per token even if the math itself took zero time. (Batching, mixture-of-experts sparsity, and KV-cache reads change the picture — but for batch-1 dense decode, the math is genuinely not the bottleneck.)

This is why “weight-only quantization” is the dominant trick in practice. You store the weights in INT4, you read them out of HBM at 4 bits per weight (4× less data), then you dequantize on the fly back to FP16 (or BF16) inside the kernel and do the actual matmul in the higher precision. You pay for some extra compute on dequantization — but compute was sitting idle anyway because you were waiting on memory. So the speedup is roughly proportional to the bit reduction, and the quality cost is just the rounding error from idea 1.

This is the asymmetry that makes quantization shine for LLM inference specifically. In a compute-bound workload (like a small CNN doing image classification on a beefy GPU), shrinking the weights doesn’t help much because you weren’t bottlenecked on reading them. In LLM decode, you absolutely were, so the saving translates almost directly to throughput.

Idea 4 — outliers are the part that bites

If quantization were uniformly easy, you’d see no papers about it. The interesting part is the failure mode, and the failure mode has a name: outlier features.

Tim Dettmers and collaborators showed in LLM.int8() (2022) that as transformers cross a certain scale (around 6.7B parameters in the models they studied), a small number of feature dimensions in the activations start carrying values much larger than the rest — the paper reports magnitudes up to ~20× the typical range, concentrated in specific dimensions. If you quantize naively, those outliers force you to pick a numerical range wide enough to contain them, which wastes precision on the overwhelming majority of “normal” values, and the model’s quality collapses.

LLM.int8() handles this by detecting those outlier dimensions at runtime and routing them through a 16-bit matrix multiplication while quantizing the rest to 8-bit. The paper reports keeping more than 99.9% of values in 8-bit while preserving accuracy on models up to 175B parameters. Later activation-aware schemes — SmoothQuant shifts the difficulty between weights and activations, AWQ chooses per-channel scales that protect the most salient channels — respond to the same underlying observation: not all weights and activations are created equal, and the few that matter most have to be protected.

This is the seam in the story. Quantization “just works” on the average weight; the engineering is almost entirely about the few percent of weights and activations where it doesn’t.

Where the seams show

A few honest caveats:

Going deeper