What does 'X parameters' mean in an LLM?
Llama 3.1 70B, DeepSeek-V3 671B, Phi-4 14B — what is that number actually counting, and why is it the headline figure on every model release?
Why it exists
Pick any model on a leaderboard, a download page, or an AI news headline and there’s a number stapled to its name. Llama 3.1 70B. DeepSeek-V3 671B. Phi-4 14B. GPT-3 175B. The “B” — for billion — is doing real work: it’s how people compare models at a glance, why larger model SKUs cost more per token, how engineers decide whether the thing fits on their GPU. But the number is rarely explained. What is it counting?
It’s the parameter count: the total number of values inside the model that were learned during training, not written down by a human. When a release says “70B,” it means the model file on disk holds roughly 70 billion such numbers, each one shaped by gradients from billions or trillions of training tokens until next-token prediction got better.
The reason this number — instead of, say, lines of code or layer count — became the headline is that nearly every other interesting property of the model scales with it: how much memory you need to load the weights, how many floating-point operations per generated token, how well the model generalizes. The thing the field figured out around 2020 (see scaling laws) is that, for transformer LLMs trained on next-token prediction, the parameter count is mostly the dial. Architecture details — depth vs. width, head count, exact vocabulary — turn out to be second-order corrections inside a wide range. So everyone gravitated to printing the parameter count on the box.
Why it matters now
Three places the number lands as a real engineering constraint, not a marketing figure:
- Memory. Each parameter is one floating-point number. In bf16 (the standard training format) that’s 2 bytes. So a 70B model takes ~140 GB just to hold the weights — too big for one H100 (80 GB), fine across two. Quantization to int4 brings the same weights down to ~35 GB. The parameter count is what makes that arithmetic go.
- Inference cost. Each generated token costs roughly
2NFLOPs on a dense model withNparameters. At 70B, that’s ~140 billion FLOPs per token. This is why the same prompt is dramatically faster on an 8B model than a 70B one — and why “active” parameters matter so much for mixture-of-experts models. - Capability, loosely. Bigger models, trained well, learn more. The relationship is messy at the edges (a well-trained 8B beats a poorly trained 30B all day), but on broad public benchmarks 1B-class models still don’t usually match well-trained 70B-class ones. The parameter count is a rough proxy for capability — useful for ballpark expectations, misleading if you use it to compare two well-engineered models in different families.
The short answer
parameter = one learned number inside the model
A parameter count is the total number of values (weights and biases) inside the neural network that were set by training rather than by a human writing code. “Llama 3.1 70B” means there are about 70 billion such numbers in the model file. Most of them live in matrices used by the transformer’s feed-forward and attention layers; multiplying input vectors through those matrices is most of what running the model means.
How it works
Three questions worth separating: where the parameters live, what they cost, and what gets counted.
Where the parameters live
A transformer LLM is mostly stacks of matrix multiplications. For a model
with hidden dimension d and L layers, each layer contains roughly:
- Attention projections — four matrices (Q, K, V, output) each ~
d × din vanilla multi-head attention. ≈4d²parameters per layer. Modern variants like grouped-query attention shrink the K and V projections, so the real number is a bit less. - Feed-forward / MLP — two matrices, with hidden dim typically
~4d, giving~8d²parameters per layer. SwiGLU and similar variants use three matrices instead of two; the constant changes but the order is the same. - Layer norms and biases —
O(d)per layer. Basically rounding error.
Adding up: a “vanilla” transformer layer is roughly 12d² parameters. With
L layers, the body of the model is ~12 · L · d². Embeddings and output
head add ~2 · V · d where V is the vocab size; for big vocabularies
(100k+) this is a couple of percent of the total.
For Llama 3.1 70B (d = 8192, L = 80, V ≈ 128k), the back-of-envelope
is 12 × 80 × 8192² ≈ 64B from the body plus a couple of billion for
embeddings — not exactly 70B, but close enough that the math is recognizably
doing the right thing. The remaining gap comes from architecture specifics
(SwiGLU’s three matrices, the precise FFN width, untied output embeddings).
The full architecture table is in the Llama 3 paper; the formula above is
the part worth carrying in your head.
The takeaway: the FFN matrices hold most of the parameters — typically around two-thirds of the body. This is why the FFN is what mixture-of-experts replaces. That’s where the budget is.
What the parameters cost
Two physical costs scale linearly with parameter count:
- Storage / memory — bytes per param × N. In bf16, 2 bytes. In fp8, 1 byte. In int4, 0.5 bytes. The model file size and the VRAM needed to hold the weights are direct multiplications of N. During training you also pay for optimizer state, which scales directly with N (Adam keeps about 2 extra numbers per parameter). The KV cache and activations scale with batch and sequence length and the model’s hidden dim — not directly with N.
- FLOPs per token — roughly
2Nfor a forward pass on a dense model. Each weight participates in one multiply and one add per token of input. The backward pass during training is about another4N, which is where the rule of thumb “training a model on D tokens costs ≈6NDFLOPs” used in scaling-law papers comes from.
The 2N-per-token relationship is why inference cost is so legible. Doubling
the parameter count roughly doubles the per-token cost. It’s also why
mixture-of-experts is interesting: an MoE model only fires a subset of its
parameters per token, so total params and active params come apart.
What gets counted
This is where the number on the box gets slippery.
- Embeddings. Some early scaling-law work (Kaplan et al., 2020) reports non-embedding parameter counts, on the grounds that embeddings don’t scale the same way as the rest of the network. Most modern model cards report total parameters, embeddings included. The two numbers can differ by a few percent — more for small models with big vocabularies.
- Tied vs. untied embeddings. Some models share weights between the input embedding and the output projection (tied); some don’t (untied). Tied counts those parameters once, untied counts them twice. Two architecturally similar models can land at different totals just from this.
- Active vs. total params (MoE). A mixture-of-experts model has many expert FFNs but routes only a couple per token. DeepSeek-V3 has 671B total parameters but only ~37B active per token. “DeepSeek-V3 671B” looks like a 671B model on the shelf and costs more like a 37B model to run per token. The headline number stops being a single-axis comparison the moment MoE enters.
- Quantization. A 70B model run in int4 still has 70B parameters — the count doesn’t change — but each parameter takes fewer bytes (and may be slightly less accurate). Quantization changes bytes-per-param, not N.
So when you read “X B parameters,” ask: is it dense or MoE? (If MoE: total or active?) Total or non-embedding? For most dense model cards the headline is total parameters unless stated otherwise, and the math above lines up.
Famous related terms
- Weights —
weights = parameters in matrices, learned by training— often used as a synonym for “parameters.” Strictly, weights are the matrix entries and biases are the per-row offsets; both count toward the parameter total. - Active parameters (MoE) —
active params = params actually used per token— the relevant cost number for mixture-of-experts models, often much smaller than the total. - FLOPs per token —
FLOPs/token ≈ 2N(dense, forward pass) — the rule of thumb that turns parameter count into inference cost. - bf16 / fp8 / int4 —
bytes per param = how the parameter is stored— quantization shrinks bytes per param without changing the parameter count itself. - Scaling laws —
loss falls as a power law in (N, D, C)— the empirical fact that made parameter count the headline number in the first place. See why scaling laws exist. - VRAM —
VRAM needed ≈ bytes/param × N + KV cache + activations— why parameter count is the first thing checked when GPU memory is the bottleneck. - Mixture of experts —
MoE = many FFN experts + a router— decouples total params from active params. The model can be huge on disk and cheap per token at the same time.
Going deeper
- Attention Is All You Need — Vaswani et al., 2017, the canonical reference for which matrices a transformer layer is actually made of.
- Andrej Karpathy’s Let’s build GPT — the fastest way to turn
12·L·d²from a formula into something you’ve felt, by building a tiny transformer from scratch and watching the parameter count grow. - The Llama 3 Herd of Models — Meta, 2024, with the architecture tables for a real 8B/70B/405B model if you want to see where modern variants deviate from the rough formula.