What are 'weights' in an LLM?
When Meta releases 'open weights' for Llama, what's actually in that file? A giant table of numbers and nothing else — so how does a pile of numbers know things?
Why it exists
You go to Hugging Face, click Download on Llama 3.1 70B, and end up with a
folder of .safetensors
files totalling roughly 140 gigabytes. That’s it. No source code that says
“when asked about Paris, mention the Eiffel Tower.” No database of facts. No
rules. Just an enormous table of numbers.
Those numbers are the weights. When people say “Meta released the weights” or “the model is 140 GB” or “open-weights model,” they’re talking about exactly that file. The architecture — how the layers connect — is a few hundred lines of code that anyone can write. The weights are what makes the running program Llama rather than a random untrained network outputting gibberish.
The word survives from the original mental picture of a neural network: each connection between artificial neurons has a strength — a weight — that says how much one neuron’s output feeds into the next. Adjust the weights and you change the function the network computes. Train on enough text and the weights settle into a configuration that, when you feed in tokens and multiply them through, produces output that looks like a fluent answer.
Why it matters now
Three places the weight file shows up as more than a definition:
- “Open weights” vs “open source.” Meta ships Llama under a license it calls open source, but the OSI’s Open Source AI Definition rejects Llama-style releases because the training data and full training pipeline aren’t published. You can run, fine-tune, and quantize the released weights — but you can’t reproduce them from scratch. The dispute over what “open” should mean in AI falls almost exactly along the line between releasing the weights and releasing what made them.
- What gets backed up, copied, leaked. A model release is a weight release. The frontier labs treat their weight files as crown jewels in part because shipping a copy of the file is, functionally, shipping the model.
- What fine-tuning, quantization, distillation, and merging actually do. All of these modify or compress the weights. Holding “the weight file is a big array of numbers in named matrices” in your head is what makes the rest of the toolkit legible.
The short answer
weights = the numbers inside the model's matrices, set by training
A neural network is mostly matrix multiplications. The weights are the entries in those matrices. They’re set during training by gradient descent and then frozen. Running the model means pushing token vectors through those matrices: multiply, add, repeat. Everything the model “knows” is encoded in those numbers and nowhere else.
How it works
Three questions worth separating: what a single weight is, how a pile of weights stores knowledge, and what the difference between “weights” and parameters actually is.
What a single weight is
A weight is one floating-point number. Usually 16 bits (bf16), sometimes quantized down to 8 or 4 bits for inference.
It lives at a fixed position in a specific matrix in a specific layer. Say
row 4096, column 2071, of the down-projection matrix in layer 42’s
feed-forward block. The number might be -0.00347. By itself, that number
means nothing.
What it does is mechanical: when an input vector passes through that
matrix, the value at column 2071 of the input gets multiplied by -0.00347
and added into row 4096 of the output. That tiny contribution combines with
millions of others to produce the next layer’s input. There is no “this
weight means cat” — every weight participates in millions of dot products,
and every output is a weighted sum of all of them.
How weights store knowledge
This is the part that feels suspicious until you sit with it: there is no lookup table inside the model. No row that says “France → Paris.” Training adjusts weights until the act of running them — multiplying a sequence of vectors through every layer — produces output probabilities that match the training distribution.
The result is that knowledge ends up distributed. A fact like “Paris is the capital of France” isn’t stored in one weight or one neuron. It’s spread across many weights in many layers, all of which also participate in encoding millions of other facts. You can usually zero out a single weight and the model’s behavior barely changes; degrade enough of them and behavior falls apart, but rarely in a clean “it forgot France” way.
The closest thing to “where a concept lives” is the attention heads and feed-forward circuits that activate when that concept shows up. Mechanistic interpretability is the research program trying to reverse-engineer those circuits — naming subsets of weights that, together, implement something a human can describe. Progress is real but partial. The honest summary in 2026: for any specific weight in a frontier-size model, we generally cannot say what it does in isolation.
Weights vs parameters
The two words get used interchangeably and most of the time that’s fine. The technical distinction:
- Weights are the entries of the matrices — the multipliers in
output = W · input + b. - Biases are the per-row additive offsets — the
bin the same equation. - Parameters = weights + biases + any other learned scalars (e.g. layer-norm scales).
Biases and norm scales are a small fraction of the total — on the order of a
percent or less in a modern transformer, because they scale with hidden
dimension d while weight matrices scale with d². So the “X B parameters”
headline on a model card is dominated by weights, and “weight count” and
“parameter count” come out close enough that people use them
interchangeably. The headline number is the count; the weights is the
contents.
What’s literally in the file
A .safetensors file (the modern standard format) is three parts laid
end to end:
- An 8-byte little-endian integer giving the length of the JSON header that follows.
- A JSON header with one entry per tensor — name, shape, dtype,
and a
data_offsetspair pointing into the byte buffer. E.g.model.layers.42.mlp.down_proj.weight, shape[8192, 28672], dtypebf16. - The raw tensor bytes, packed contiguously.
That’s the learned state of the model. You still need the architecture code
(a few hundred lines) and the tokenizer files (vocabulary + merge rules) to
turn the file into something that takes a prompt and emits text — but
training produced exactly what’s in the .safetensors.
Famous related terms
- Parameters —
parameters = weights + biases + other learned scalars— the count of all trainable numbers. See what ‘X parameters’ means for why that number is the headline on every model card. - Open-weights model —
open-weights = the trained weight file is downloadable— distinct from open-source, which would also require the training code and data needed to reproduce the weights. - Checkpoint —
checkpoint = a snapshot of the weights at one point in training— what gets saved every few thousand steps so a long training run can resume if a node dies. - Gradient descent —
gradient descent = nudge each weight against the slope of the loss— the algorithm that produces the weights from training data. - Fine-tuning —
fine-tuning = continue training a model's weights on new data— modifies the same weight file rather than starting from scratch. See why fine-tuning is cheap. - Distillation —
distillation = train a small model to copy a big model's outputs— produces a smaller weight file that mimics a larger one. See why distillation exists. - Quantization —
quantization = store each weight in fewer bits— doesn’t change which weights exist, only how they’re encoded. See why quantization works.
Going deeper
- The safetensors format spec —
the primary source for what’s literally in the file you downloaded: the
header-length prefix, the JSON header with
data_offsets, then the raw bytes. - Andrej Karpathy’s Neural Networks: Zero to Hero — the best explainer for what a weight does: builds a network from scratch and shows what gradient descent does to it step by step, until “a weight is just a number in a matrix” stops feeling abstract.
- Anthropic’s A Mathematical Framework for Transformer Circuits — the rabbit hole for “where does knowledge live in those numbers”: the entry point into mechanistic interpretability and treating weights as circuits you can reverse-engineer.