Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What does 'X parameters' mean in an LLM?

Llama 3.1 70B, DeepSeek-V3 671B, Phi-4 14B — what is that number actually counting, and why is it the headline figure on every model release?

AI & ML intro May 4, 2026

Why it exists

Pick any model on a leaderboard, a download page, or an AI news headline and there’s a number stapled to its name. Llama 3.1 70B. DeepSeek-V3 671B. Phi-4 14B. GPT-3 175B. The “B” — for billion — is doing real work: it’s how people compare models at a glance, why larger model SKUs cost more per token, how engineers decide whether the thing fits on their GPU. But the number is rarely explained. What is it counting?

It’s the parameter count: the total number of values inside the model that were learned during training, not written down by a human. When a release says “70B,” it means the model file on disk holds roughly 70 billion such numbers, each one shaped by gradients from billions or trillions of training tokens until next-token prediction got better.

The reason this number — instead of, say, lines of code or layer count — became the headline is that nearly every other interesting property of the model scales with it: how much memory you need to load the weights, how many floating-point operations per generated token, how well the model generalizes. The thing the field figured out around 2020 (see scaling laws) is that, for transformer LLMs trained on next-token prediction, the parameter count is mostly the dial. Architecture details — depth vs. width, head count, exact vocabulary — turn out to be second-order corrections inside a wide range. So everyone gravitated to printing the parameter count on the box.

Why it matters now

Three places the number lands as a real engineering constraint, not a marketing figure:

The short answer

parameter = one learned number inside the model

A parameter count is the total number of values (weights and biases) inside the neural network that were set by training rather than by a human writing code. “Llama 3.1 70B” means there are about 70 billion such numbers in the model file. Most of them live in matrices used by the transformer’s feed-forward and attention layers; multiplying input vectors through those matrices is most of what running the model means.

How it works

Three questions worth separating: where the parameters live, what they cost, and what gets counted.

Where the parameters live

A transformer LLM is mostly stacks of matrix multiplications. For a model with hidden dimension d and L layers, each layer contains roughly:

Adding up: a “vanilla” transformer layer is roughly 12d² parameters. With L layers, the body of the model is ~12 · L · d². Embeddings and output head add ~2 · V · d where V is the vocab size; for big vocabularies (100k+) this is a couple of percent of the total.

For Llama 3.1 70B (d = 8192, L = 80, V ≈ 128k), the back-of-envelope is 12 × 80 × 8192² ≈ 64B from the body plus a couple of billion for embeddings — not exactly 70B, but close enough that the math is recognizably doing the right thing. The remaining gap comes from architecture specifics (SwiGLU’s three matrices, the precise FFN width, untied output embeddings). The full architecture table is in the Llama 3 paper; the formula above is the part worth carrying in your head.

The takeaway: the FFN matrices hold most of the parameters — typically around two-thirds of the body. This is why the FFN is what mixture-of-experts replaces. That’s where the budget is.

What the parameters cost

Two physical costs scale linearly with parameter count:

The 2N-per-token relationship is why inference cost is so legible. Doubling the parameter count roughly doubles the per-token cost. It’s also why mixture-of-experts is interesting: an MoE model only fires a subset of its parameters per token, so total params and active params come apart.

What gets counted

This is where the number on the box gets slippery.

So when you read “X B parameters,” ask: is it dense or MoE? (If MoE: total or active?) Total or non-embedding? For most dense model cards the headline is total parameters unless stated otherwise, and the math above lines up.

Going deeper