Why does GPU memory bandwidth matter more than FLOPS for LLM inference?
You bought the GPU for the teraflops. At inference time, almost none of them are doing anything. The bottleneck is moving the weights, not multiplying them.
Why it exists
Picture streaming a 4K movie on a slow Wi-Fi connection. Your laptop’s CPU is more than fast enough to play the video — it sits mostly idle. The movie stutters because frames can’t reach the laptop fast enough. Speeding up the CPU wouldn’t help; you need a fatter pipe. Running a large model on a GPU has exactly this shape. The GPU has tens of thousands of math units doing nothing most of the time. The bottleneck is dragging the model’s weights — tens of gigabytes — out of memory and into the math units, fast enough to keep them fed. That’s memory bandwidth.
If you’ve ever shopped for a GPU to run a local model, you’ve noticed something strange. A consumer card and a data-center card might have wildly different FLOPS numbers, but the tokens-per-second you actually get from a 70B model tracks something else almost perfectly: memory bandwidth, the number on the spec sheet measured in GB/s.
A consumer 4090 has roughly 1 TB/s of memory bandwidth. An H100 SXM has roughly 3.35 TB/s. Apple’s M2 Ultra has around 800 GB/s of unified memory bandwidth and is somehow competitive with discrete GPUs on local inference despite being a laptop chip lineage. Meanwhile the compute gap between these parts is much larger than the memory-bandwidth gap. So how come the slow-on-paper machines don’t fall further behind?
The answer is the question this post exists to answer: at inference time, your GPU is barely computing anything. It is reading. Generating a single token forces it to drag the entire model’s weights from VRAM through to its compute units, do a small amount of arithmetic, and throw the weights away. Token N+1 has to read all of them again. The throughput of that read pipe — bandwidth — is the actual ceiling.
Why it matters now
Almost every cost and performance question in modern LLM serving is secretly a memory-bandwidth question:
- “Why is my 7B model so much faster than my 70B model on the same GPU?” Roughly because there are 10× fewer bytes of weights to ship per token. The arithmetic is cheaper too, but the arithmetic was never the limit.
- “Why does quantization speed up inference so much, when it doesn’t reduce the number of multiplications?” Because halving the bytes per weight halves the bytes you have to move per token. Bandwidth-bound work scales with bytes, not with operations.
- “Why is speculative decoding a free lunch?” Because verifying ten proposed tokens reads the weights once — the same single trip through the model that one-token decoding costs. You amortize the bandwidth bill over more tokens.
- “Why is batching the trick every serving engine reaches for?” Same reason. One weight read, many sequences using it.
- “Why are H100s and Blackwells so expensive when their FLOPS-per-dollar isn’t that extreme?” Because they ship HBM, and HBM is what you’re actually paying for.
If your mental cost model for inference is “FLOPS in, tokens out,” it is predicting the wrong things. A bandwidth-first model predicts the right things, and explains a pile of otherwise mysterious engineering choices.
The short answer
LLM decode speed ≈ GPU memory bandwidth ÷ model size in bytes
To generate one token, the GPU has to stream every model weight from VRAM to its compute units. The compute itself is fast and finishes early; the weights take time to arrive. Tokens per second is, to a first approximation, how many times per second the GPU can drag the whole model across that pipe.
How it works
Look at what actually happens in a single decode step (one new token):
- A small input — the new token’s vector — enters the model.
- For every layer, the GPU multiplies that vector by the layer’s weight matrices.
- The result becomes the input for the next layer.
- At the top, you sample a token.
Step 2 is “matrix times vector” — matvec. This is the part that decides
the speed, and it has a property worth staring at: the matrix is huge (the
weights), and the vector is small (one token’s activations). Each weight is
read from memory, used in one multiply-add, and then never touched
again for this token.
That ratio — operations per byte loaded — is called arithmetic intensity. For matvec, it’s basically 1: one multiply-add per weight loaded. Modern GPUs need an arithmetic intensity in the dozens to hundreds before they become compute-bound rather than memory-bound. Decode is nowhere near that threshold. The compute units sit idle waiting for VRAM; the bandwidth pipe runs flat out.
You can sanity-check this with a back-of-envelope calculation. A 70B-param model at FP16 is ~140 GB of weights. On a single H100 with ~3.35 TB/s of HBM bandwidth, the absolute ceiling on per-token decode is roughly:
3.35 TB/s ÷ 140 GB ≈ 24 tokens/sec
That’s a hard upper bound from physics, ignoring all overhead. Real systems land somewhere below it. (A 70B model doesn’t even fit on a single H100 in FP16, so in practice you’d shard or quantize, but the shape of the calculation is what matters.) Notice what’s not in that calculation: the GPU’s FLOPS rating. It doesn’t appear because it isn’t the bottleneck.
Now contrast with prefill — processing the prompt before any tokens come out. Prefill multiplies weights against many tokens at once (the whole prompt), so each weight gets reused across all those tokens. Arithmetic intensity goes up, the compute units actually get used, and prefill is genuinely compute-bound. This is why “time to first token” (prefill-bound) and “tokens per second after that” (decode-bound) live on different curves and respond to different optimizations. The same GPU that is FLOPS-bound during prefill is bandwidth-bound during decode.
Why batching breaks the rule (and why it has limits)
Batching multiple users’ requests together is the closest thing to a free lunch the bandwidth model allows. If 32 users each want a token, you can load the weights once and do 32 independent matvecs against them, turning matvec into matmat. Arithmetic intensity goes up by 32×. Until, that is, something else fills up — usually the KV cache, which grows per request and per token and eventually pushes you back into a bandwidth wall, just on different data. Memory bandwidth is the dominant constraint; batching just lets you share the bill.
Where this gets fuzzy
- The ”≈” is doing real work. Real inference involves attention reads against the KV cache (also bandwidth-bound, but on cache bytes not weight bytes), kernel launch overhead, communication between GPUs in a sharded setup, and some genuinely compute-bound bits. The “bandwidth ÷ size” estimate is an upper bound, not a prediction.
- It’s a decode-time argument. Training, prefill, and very-large-batch serving all push toward compute-bound regimes. “Bandwidth matters more than FLOPS” is specifically a claim about single-request decode.
- Architectures change the picture. MoE models route each token through only some of the weights, which changes how many bytes per token you actually move. I don’t have hard numbers for the bandwidth/FLOPS balance on production MoE serving — the public treatment of this is patchy and the trade-offs depend on how routing, expert placement, and batching interact in ways that aren’t always documented.
- Apple’s unified memory is a quieter but related story. Big unified bandwidth plus big unified capacity makes laptops surprisingly good at inference for their compute class. The exact reasons it punches above its weight at specific model sizes involve memory layout and software maturity I’d be guessing about.
The point isn’t that FLOPS don’t matter. They matter for training, prefill, and image/video models with very different intensity profiles. The point is that the mental model “compute = speed” comes from a world that isn’t the one we’re in for LLM inference.
Famous related terms
- Arithmetic intensity —
arithmetic intensity = FLOPs ÷ bytes loaded— the dial that decides whether you’re compute-bound or memory-bound. - Roofline model —
roofline ≈ a chart with two ceilings: bandwidth and FLOPS— a one-page mental model from HPC for predicting which of the two is going to bite first. Worth knowing. - HBM —
HBM = stacked DRAM bonded to the GPU package— what makes a data-center GPU expensive, and the actual scarce resource. - Quantization —
quantization = same weights, fewer bits each— speeds up inference because bytes-moved drops, not because operations drop. - KV cache — the other big bandwidth consumer at decode time; your GPU is reading both weights and cache on every token.
- MoE (Mixture of Experts) —
MoE = many experts, route each token to a few— changes the bytes-per-token math by activating only some weights. - Speculative decoding —
speculative decoding = small model drafts, big model verifies in parallel— works by amortizing one weight read across many tokens.
Going deeper
- Horace He, Making Deep Learning Go Brrrr From First Principles — the clearest writeup I know on memory-bound vs compute-bound and arithmetic intensity for ML workloads. If you only read one thing, read this.
- The roofline model literature (Williams, Waterman, Patterson, 2009) — the original HPC framing this whole argument inherits.
- The vLLM and TensorRT-LLM design notes — a lot of their cleverness only makes sense once you accept that bandwidth, not compute, is what’s scarce.