Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What are 'weights' in an LLM?

When Meta releases 'open weights' for Llama, what's actually in that file? A giant table of numbers and nothing else — so how does a pile of numbers know things?

AI & ML intro May 16, 2026

Why it exists

You go to Hugging Face, click Download on Llama 3.1 70B, and end up with a folder of .safetensors files totalling roughly 140 gigabytes. That’s it. No source code that says “when asked about Paris, mention the Eiffel Tower.” No database of facts. No rules. Just an enormous table of numbers.

Those numbers are the weights. When people say “Meta released the weights” or “the model is 140 GB” or “open-weights model,” they’re talking about exactly that file. The architecture — how the layers connect — is a few hundred lines of code that anyone can write. The weights are what makes the running program Llama rather than a random untrained network outputting gibberish.

The word survives from the original mental picture of a neural network: each connection between artificial neurons has a strength — a weight — that says how much one neuron’s output feeds into the next. Adjust the weights and you change the function the network computes. Train on enough text and the weights settle into a configuration that, when you feed in tokens and multiply them through, produces output that looks like a fluent answer.

Why it matters now

Three places the weight file shows up as more than a definition:

The short answer

weights = the numbers inside the model's matrices, set by training

A neural network is mostly matrix multiplications. The weights are the entries in those matrices. They’re set during training by gradient descent and then frozen. Running the model means pushing token vectors through those matrices: multiply, add, repeat. Everything the model “knows” is encoded in those numbers and nowhere else.

How it works

Three questions worth separating: what a single weight is, how a pile of weights stores knowledge, and what the difference between “weights” and parameters actually is.

What a single weight is

A weight is one floating-point number. Usually 16 bits (bf16), sometimes quantized down to 8 or 4 bits for inference.

It lives at a fixed position in a specific matrix in a specific layer. Say row 4096, column 2071, of the down-projection matrix in layer 42’s feed-forward block. The number might be -0.00347. By itself, that number means nothing.

What it does is mechanical: when an input vector passes through that matrix, the value at column 2071 of the input gets multiplied by -0.00347 and added into row 4096 of the output. That tiny contribution combines with millions of others to produce the next layer’s input. There is no “this weight means cat” — every weight participates in millions of dot products, and every output is a weighted sum of all of them.

How weights store knowledge

This is the part that feels suspicious until you sit with it: there is no lookup table inside the model. No row that says “France → Paris.” Training adjusts weights until the act of running them — multiplying a sequence of vectors through every layer — produces output probabilities that match the training distribution.

The result is that knowledge ends up distributed. A fact like “Paris is the capital of France” isn’t stored in one weight or one neuron. It’s spread across many weights in many layers, all of which also participate in encoding millions of other facts. You can usually zero out a single weight and the model’s behavior barely changes; degrade enough of them and behavior falls apart, but rarely in a clean “it forgot France” way.

The closest thing to “where a concept lives” is the attention heads and feed-forward circuits that activate when that concept shows up. Mechanistic interpretability is the research program trying to reverse-engineer those circuits — naming subsets of weights that, together, implement something a human can describe. Progress is real but partial. The honest summary in 2026: for any specific weight in a frontier-size model, we generally cannot say what it does in isolation.

Weights vs parameters

The two words get used interchangeably and most of the time that’s fine. The technical distinction:

Biases and norm scales are a small fraction of the total — on the order of a percent or less in a modern transformer, because they scale with hidden dimension d while weight matrices scale with . So the “X B parameters” headline on a model card is dominated by weights, and “weight count” and “parameter count” come out close enough that people use them interchangeably. The headline number is the count; the weights is the contents.

What’s literally in the file

A .safetensors file (the modern standard format) is three parts laid end to end:

That’s the learned state of the model. You still need the architecture code (a few hundred lines) and the tokenizer files (vocabulary + merge rules) to turn the file into something that takes a prompt and emits text — but training produced exactly what’s in the .safetensors.

Going deeper