Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why LoRA exists

Full fine-tuning a 70B model means storing optimizer state for 70 billion weights. LoRA trains under 1% of the parameters and, on the tasks people have tested, often matches the result. The trick is a hypothesis about the shape of the update.

AI & ML intermediate Apr 29, 2026

Why it exists

Imagine you have a 70-billion-parameter base model and a few thousand examples of how you want it to behave. The naive plan is to load the model, run gradient descent on every weight, and save the result.

That plan is brutal. Adam, the optimizer everyone reaches for, keeps two extra FP32 numbers per parameter (the running mean and variance of gradients). On a 70B model, that’s ~560 GB of optimizer state alone, on top of the weights and gradients. You also have to checkpoint the result — and if you want five different fine-tunes for five customers, that’s five copies of a 70B-parameter model sitting on disk.

LoRA exists because someone noticed the whole setup was wasteful. The update you’re trying to learn during fine-tuning — the difference between “base model” and “base model that does my task” — turns out to have a very particular shape. You don’t need 70 billion degrees of freedom to express it. A few million is usually enough.

That’s the entire pitch. Freeze the base. Train a tiny side-channel. Match the quality of full fine-tuning at a fraction of the cost.

Why it matters now

LoRA isn’t a clever optimization buried in some lab’s training code. It became the standard PEFT baseline for fine-tuning open-weight models, and it shapes the infrastructure around it.

If you’re choosing between fine-tuning, prompting, and RAG, you’re really choosing whether to ship a LoRA or not. Knowing why LoRA works tells you when it’s the right tool.

The short answer

LoRA = freeze base weights + add a low-rank update ΔW = BA + only train A and B

Instead of learning a new full-size weight matrix, you learn two skinny matrices whose product approximates the update. If W is 4096×4096 and your rank r is 8, then B is 4096×8 and A is 8×4096. That’s ~65k trainable numbers in place of ~16.8M — about 256× fewer parameters, matching most of the quality.

How it works

The intrinsic-rank hypothesis

The whole technique rests on a guess that turned out to be right in practice: the adaptation a fine-tune performs lives in a low-dimensional subspace, even though the model itself is huge.

This wasn’t pulled out of nowhere. Aghajanyan et al. (2020), in Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, showed you can fine-tune RoBERTa to ~90% of full performance on MRPC by optimizing only ~200 parameters projected randomly back into the full weight space. One reading of that result: the pretrained model already has the right features, and fine-tuning just steers them. Hu et al. (2021) took the next step: if the effective update is low-dimensional, why not bake that constraint into the parameterization itself? Don’t sample a random subspace — learn one, by writing the update as a product of two low-rank matrices.

So the hypothesis is: ΔW (the change to a weight matrix during fine-tuning) is well-approximated by BA, where B ∈ ℝ^{d×r} and A ∈ ℝ^{r×k} and r is much smaller than d or k.

The empirical result: yes, often it is. r=8 frequently matches r=64 on downstream tasks. The QLoRA paper, in an Alpaca-style sweep on LLaMA-7B with LoRA applied to all layers, reported that rank had effectively no effect on final task performance — a narrower observation than “rank doesn’t matter,” but a striking one.

What this doesn’t prove: it doesn’t show that a full fine-tune “actually” only moves weights in a low-rank subspace. It only shows that if you constrain the update to be low-rank, you don’t lose much. Which is what you wanted operationally; the deeper claim is still under active debate.

The mechanics

For a frozen pretrained weight matrix W₀ ∈ ℝ^{d×k}, the LoRA parameterization replaces

y = W₀ x

with

y = W₀ x + (α/r) · B A x

where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, and α is a scalar scaling factor. In the original paper’s scheme, B is initialized to zero and A to random Gaussian, so BA = 0 at step 0 and the adapted model starts as an exact copy of the base. (Library defaults differ in details — e.g. Hugging Face PEFT uses Kaiming-uniform for A and zeros for B — but the “starts at zero update” property is preserved.) Training updates only A and B; W₀ never moves.

A few details worth knowing:

Why the savings are this dramatic

Stack a few effects:

QLoRA: stacking the trick on quantization

QLoRA (Dettmers et al., May 2023) is the most consequential follow-up. The base model is quantized to 4-bit (NF4, a data type tailored for normally-distributed weights), kept frozen, and dequantized on the fly to FP16/BF16 to compute forward and backward passes. Only the LoRA adapters and the optimizer state for them live in 16-bit. The headline: fine-tune a 65B model on one 48 GB GPU, recovering full 16-bit fine-tuning quality on the benchmarks they tested. This is what made “fine-tune a frontier-scale model on a workstation” actually true.

Where it gets subtle

The whole technique is a bet on a structural observation about fine-tuning, not a clever optimizer trick. That’s why it generalizes — the same idea works for transformer LLMs, diffusion image models, and just about anything else where you start from a strong pretrained base.

Going deeper