Why VRAM is the bottleneck for LLM serving
It's not FLOPS, it's not network, it's not the CPU. The thing that decides whether your model fits and how many users you can serve is a number printed on the GPU's spec sheet — and three things fight to consume it.
Why it exists
The first time you try to run a real model on a real GPU, the
error you get is not “too slow.” It’s not “ran out of compute.” It’s
CUDA out of memory. You read the message twice, check nvidia-smi,
and discover that the box has plenty of CPU RAM, plenty of disk, plenty
of cores doing nothing — and the one number that mattered, the
VRAM
on the GPU itself, is full.
That experience scales. Whole serving stacks, pricing models, and research agendas exist because VRAM is small, expensive, and the thing every part of an LLM inference pipeline simultaneously wants more of. It’s not that VRAM is the only thing that matters — bandwidth, FLOPS, interconnect all matter — it’s that VRAM is the one that decides whether your job runs at all, and after that, it’s the one that decides how many users you can serve from one box. Compute you can wait for. Memory you can’t conjure.
To make this concrete: an NVIDIA H100 SXM ships with 80 GB of HBM3. Llama 3.1 70B in FP16 is roughly 140 GB just for weights. The flagship inference GPU of the last cycle cannot fit one copy of a 70B model at native precision. Every serving decision after that point — quantize? shard across two GPUs? offload to CPU? rent the H200 with 141 GB instead? — is a direct consequence of that mismatch.
Why it matters now
Three things compete for the same VRAM, and the three of them together explain almost every weird number you’ll see in an inference engine’s config:
- The model weights. Fixed cost, paid once per replica. Bigger model, more weights. Lower precision (FP16 → INT8 → INT4) shrinks this by integer factors but with quality risk.
- The KV cache. Per-token, per-request state. Grows with context length and number of concurrent users. This is the variable cost.
- Activations and workspace. The intermediate tensors a forward pass needs, plus framework overhead. Smaller than the other two, but non-zero.
Every one of those three eats from the same 80 GB (or 141 GB, or 192 GB) pool. The optimizations you’ve heard of — quantization, mixture of experts, paged attention, continuous batching, KV-cache compression, model parallelism — are all attacks on one of those three. There isn’t a separate “make it cheap” lever. The lever is “use less VRAM.”
This also shapes the public economics. Hosted LLM pricing is denominated in tokens, but the unit cost behind those tokens is “GPU-hour.” How many concurrent users a GPU-hour can serve is determined almost entirely by how much VRAM is left after the weights load. That’s why providers care so much about KV-cache efficiency: every byte you don’t spend on KV is a byte you can spend on another user’s KV.
The short answer
VRAM budget = weights + KV cache + activations — and the GPU runs at
full speed only on data that lives in VRAM.
GPUs can read their own VRAM at terabytes per second; reading anything else (CPU RAM over PCIe, disk, the next GPU over NVLink) is at least an order of magnitude slower per byte. Anything you spill out of VRAM, you pay for on every forward pass. So the practical rule is: whatever you want to serve has to fit, with all three components above summing to less than the card’s capacity, or you take a step-function penalty.
How it works
Three buckets, one budget. Let’s size them.
Bucket 1 — weights
A model with P parameters at b bytes per parameter takes
roughly P × b bytes. Llama 3.1 70B has about 70.6B parameters; in
FP16 that’s 70.6e9 × 2 ≈ 141 GB. In INT8: ~70 GB. In INT4: ~35 GB.
This is the floor of your VRAM cost. It does not depend on how many users are connected, how long their prompts are, or how hot it is in the datacenter. You pay it the moment you load the model and you keep paying it until the process exits.
The weights number is also why the jump from H100 (80 GB) to H200 (141 GB HBM3e, per NVIDIA’s spec sheet) isn’t a routine refresh — it’s the difference between “70B FP16 doesn’t fit on one card” and “70B FP16 fits, with about 1 GB to spare for a single short conversation, no batching.” Anything bigger than that and you’re sharding across GPUs anyway, which is its own cost.
Bucket 2 — KV cache
This is the bucket that grows.
For a transformer, each token in a sequence has to remember its key and value vectors at every layer so future tokens can attend back to it without recomputing. That memory is the KV cache. Per token, the size is roughly:
kv_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_elem
The leading 2 is for K and V. For Llama 2 7B in FP16, this works
out to about 0.5 MB per token (32 layers × 32 heads × 128 dims × 2 ×
2 bytes), which matches the
Baseten inference guide’s numbers.
A 4,096-token context for a single user therefore eats around
2 GB of VRAM in KV cache for a 7B model. Scale up to 70B-class
models and longer contexts and the per-user KV-cache footprint reaches
the multi-gigabyte range fast — even with the
GQA
trick that modern Llamas use to compress num_kv_heads.
Multiply by your concurrent user count and you can see the squeeze. On an H100 hosting a 70B model in INT8 (~70 GB of weights), there is maybe 8–10 GB of room left for everyone’s KV cache combined. That cap is exactly what determines how many simultaneous users a single GPU can hold. There’s no compute reason it couldn’t do more — there’s nowhere to put their state.
Bucket 3 — activations and overhead
The intermediate tensors of a forward pass — the residual stream, the attention scores, the MLP intermediates — also live in VRAM. For inference (no backward pass, no optimizer state) this is much smaller than weights or KV. Frameworks add their own overhead: CUDA context, allocator fragmentation, NCCL buffers, the inference engine’s own bookkeeping. Call it single-digit gigabytes on a big card; non-zero, not the headline.
The honest version: I don’t have a public, authoritative number for “activations + framework overhead per request” that holds across vLLM, TensorRT-LLM, and SGLang at once. It depends on max-batch-size config, chunked-prefill settings, and how aggressive the engine is about reusing scratch space. Treat this bucket as “small but not zero, and the part of your budget that surprises you.”
Why VRAM specifically, not “just use system RAM”
A modern datacenter GPU’s HBM bandwidth is in the multi-TB/s range — H100 SXM advertises ~3.35 TB/s, H200 about 4.8 TB/s (NVIDIA H200 datasheet). Host RAM over PCIe Gen5 x16 tops out around 64 GB/s peak, NVLink between GPUs is a few hundred GB/s. Decode is memory-bandwidth-bound: every token generated streams the entire model’s weights through the compute units once. If those weights live in VRAM you’re at HBM speed; if any fraction lives in host RAM you’re at PCIe speed for that fraction, and your tokens-per-second collapses by the ratio.
So “spill the model to CPU” is not a graceful degradation. It’s a cliff. This is why VRAM, specifically, is the bottleneck — not “memory” in the abstract. The compute units of the GPU can only run at speed on data that lives in their own backyard.
Where the seams show
A few honest caveats:
- Quantization is not free. INT4 quantization shrinks weights by 4× vs FP16, but the quality cost depends heavily on the model and the quantization scheme (GPTQ, AWQ, SmoothQuant, etc.). The right way to think about it: you are buying VRAM with a quality budget whose size depends on benchmarks you should actually run.
- MoE shifts the problem, doesn’t dissolve it. A mixture-of-experts model has a smaller active parameter count per token but a larger total parameter count, and all the experts have to live in VRAM somewhere — usually sharded across GPUs — because you don’t know ahead of time which expert will be picked.
- The KV cache numbers above assume FP16 KV. Modern engines often store KV in FP8 or INT8 to halve or quarter this, at some quality cost. If you’ve seen a “we doubled context length on the same GPU” headline, KV-cache quantization is usually the lever.
- VRAM capacity scales much slower than parameter counts. GPU memory has roughly doubled per generation; frontier-model parameter counts have grown faster than that for years. The gap is structural, not a marketing artifact, which is why HBM stacking and multi-GPU sharding are permanent features of the landscape, not transitional hacks.
Famous related terms
- KV cache —
KV cache = stored attention K/V tensors per token + per layer + per request— the variable cost in the VRAM budget; the reason concurrent-user count is capped. - HBM —
HBM = stacked DRAM dies + wide interface + on-package placement— the physical reason VRAM has TB/s bandwidth; also the reason it’s expensive and small. - Memory-bandwidth bound —
memory-bandwidth bound = compute waits on bytes arriving + not on math finishing— explains why “spill to system RAM” isn’t graceful. - Quantization —
quantization = lower-precision weights + carefully managed quality loss— the most direct way to buy VRAM headroom; INT8 / INT4 / FP8 variants. - PagedAttention —
PagedAttention = OS-style paging applied to the KV cache— kills the fragmentation that would otherwise cap how many KV bytes you can actually use; vLLM’s contribution. - Tensor parallelism —
tensor parallelism = split each weight matrix across N GPUs + sync after each layer— what you do when one card’s VRAM isn’t enough; pays for it in interconnect bandwidth. - Mixture of experts —
MoE = many experts + sparse routing per token— bigger total VRAM footprint, smaller per-token compute; reshapes the budget.
Going deeper
- NVIDIA — Mastering LLM Techniques: Inference Optimization. Blog post. Walks through the weights / KV / activations split with concrete numbers.
- Baseten — A guide to LLM inference and performance. Blog post. The clearest practitioner write-up of the VRAM math, including the per-token KV-cache formula.
- Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). arXiv. The vLLM paper; explains how much of the practical VRAM ceiling was actually fragmentation.
- NVIDIA — H100 and H200 product pages. H100, H200. The capacity and bandwidth numbers behind every “does it fit?” calculation.