Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does GPU memory bandwidth matter more than FLOPS for LLM inference?

You bought the GPU for the teraflops. At inference time, almost none of them are doing anything. The bottleneck is moving the weights, not multiplying them.

AI & ML intermediate Apr 29, 2026

Why it exists

Picture streaming a 4K movie on a slow Wi-Fi connection. Your laptop’s CPU is more than fast enough to play the video — it sits mostly idle. The movie stutters because frames can’t reach the laptop fast enough. Speeding up the CPU wouldn’t help; you need a fatter pipe. Running a large model on a GPU has exactly this shape. The GPU has tens of thousands of math units doing nothing most of the time. The bottleneck is dragging the model’s weights — tens of gigabytes — out of memory and into the math units, fast enough to keep them fed. That’s memory bandwidth.

If you’ve ever shopped for a GPU to run a local model, you’ve noticed something strange. A consumer card and a data-center card might have wildly different FLOPS numbers, but the tokens-per-second you actually get from a 70B model tracks something else almost perfectly: memory bandwidth, the number on the spec sheet measured in GB/s.

A consumer 4090 has roughly 1 TB/s of memory bandwidth. An H100 SXM has roughly 3.35 TB/s. Apple’s M2 Ultra has around 800 GB/s of unified memory bandwidth and is somehow competitive with discrete GPUs on local inference despite being a laptop chip lineage. Meanwhile the compute gap between these parts is much larger than the memory-bandwidth gap. So how come the slow-on-paper machines don’t fall further behind?

The answer is the question this post exists to answer: at inference time, your GPU is barely computing anything. It is reading. Generating a single token forces it to drag the entire model’s weights from VRAM through to its compute units, do a small amount of arithmetic, and throw the weights away. Token N+1 has to read all of them again. The throughput of that read pipe — bandwidth — is the actual ceiling.

Why it matters now

Almost every cost and performance question in modern LLM serving is secretly a memory-bandwidth question:

If your mental cost model for inference is “FLOPS in, tokens out,” it is predicting the wrong things. A bandwidth-first model predicts the right things, and explains a pile of otherwise mysterious engineering choices.

The short answer

LLM decode speed ≈ GPU memory bandwidth ÷ model size in bytes

To generate one token, the GPU has to stream every model weight from VRAM to its compute units. The compute itself is fast and finishes early; the weights take time to arrive. Tokens per second is, to a first approximation, how many times per second the GPU can drag the whole model across that pipe.

How it works

Look at what actually happens in a single decode step (one new token):

  1. A small input — the new token’s vector — enters the model.
  2. For every layer, the GPU multiplies that vector by the layer’s weight matrices.
  3. The result becomes the input for the next layer.
  4. At the top, you sample a token.

Step 2 is “matrix times vector” — matvec. This is the part that decides the speed, and it has a property worth staring at: the matrix is huge (the weights), and the vector is small (one token’s activations). Each weight is read from memory, used in one multiply-add, and then never touched again for this token.

That ratio — operations per byte loaded — is called arithmetic intensity. For matvec, it’s basically 1: one multiply-add per weight loaded. Modern GPUs need an arithmetic intensity in the dozens to hundreds before they become compute-bound rather than memory-bound. Decode is nowhere near that threshold. The compute units sit idle waiting for VRAM; the bandwidth pipe runs flat out.

You can sanity-check this with a back-of-envelope calculation. A 70B-param model at FP16 is ~140 GB of weights. On a single H100 with ~3.35 TB/s of HBM bandwidth, the absolute ceiling on per-token decode is roughly:

3.35 TB/s ÷ 140 GB ≈ 24 tokens/sec

That’s a hard upper bound from physics, ignoring all overhead. Real systems land somewhere below it. (A 70B model doesn’t even fit on a single H100 in FP16, so in practice you’d shard or quantize, but the shape of the calculation is what matters.) Notice what’s not in that calculation: the GPU’s FLOPS rating. It doesn’t appear because it isn’t the bottleneck.

Now contrast with prefill — processing the prompt before any tokens come out. Prefill multiplies weights against many tokens at once (the whole prompt), so each weight gets reused across all those tokens. Arithmetic intensity goes up, the compute units actually get used, and prefill is genuinely compute-bound. This is why “time to first token” (prefill-bound) and “tokens per second after that” (decode-bound) live on different curves and respond to different optimizations. The same GPU that is FLOPS-bound during prefill is bandwidth-bound during decode.

Why batching breaks the rule (and why it has limits)

Batching multiple users’ requests together is the closest thing to a free lunch the bandwidth model allows. If 32 users each want a token, you can load the weights once and do 32 independent matvecs against them, turning matvec into matmat. Arithmetic intensity goes up by 32×. Until, that is, something else fills up — usually the KV cache, which grows per request and per token and eventually pushes you back into a bandwidth wall, just on different data. Memory bandwidth is the dominant constraint; batching just lets you share the bill.

Where this gets fuzzy

The point isn’t that FLOPS don’t matter. They matter for training, prefill, and image/video models with very different intensity profiles. The point is that the mental model “compute = speed” comes from a world that isn’t the one we’re in for LLM inference.

Going deeper