Why LoRA exists
Full fine-tuning a 70B model means storing optimizer state for 70 billion weights. LoRA trains under 1% of the parameters and, on the tasks people have tested, often matches the result. The trick is a hypothesis about the shape of the update.
Why it exists
Imagine you have a 70-billion-parameter base model and a few thousand examples of how you want it to behave. The naive plan is to load the model, run gradient descent on every weight, and save the result.
That plan is brutal. Adam, the optimizer everyone reaches for, keeps two extra FP32 numbers per parameter (the running mean and variance of gradients). On a 70B model, that’s ~560 GB of optimizer state alone, on top of the weights and gradients. You also have to checkpoint the result — and if you want five different fine-tunes for five customers, that’s five copies of a 70B-parameter model sitting on disk.
LoRA exists because someone noticed the whole setup was wasteful. The update you’re trying to learn during fine-tuning — the difference between “base model” and “base model that does my task” — turns out to have a very particular shape. You don’t need 70 billion degrees of freedom to express it. A few million is usually enough.
That’s the entire pitch. Freeze the base. Train a tiny side-channel. Match the quality of full fine-tuning at a fraction of the cost.
Why it matters now
LoRA isn’t a clever optimization buried in some lab’s training code. It became the standard PEFT baseline for fine-tuning open-weight models, and it shapes the infrastructure around it.
- One base model, many adapters. A LoRA adapter for a 7B model is often tens of megabytes. You can keep hundreds of them on a single server and hot-swap between customers, languages, or styles without reloading the base weights. Multi-tenant fine-tune serving (think per-customer adapters) is only economical because of this.
- Single-GPU fine-tuning. Combined with 4-bit quantization (the QLoRA recipe, Dettmers et al., 2023), you can fine-tune a 65B-parameter model on a single 48 GB GPU. That number was unimaginable in 2022.
- The open-weight ecosystem leans on this. A large share of the community fine-tunes on Hugging Face are LoRA or LoRA-derived adapters — full fine-tunes of a 70B model are out of reach for hobbyists, but a LoRA isn’t. (I don’t have a reliable count, just an impression from browsing.)
- Image-model land too. Stable Diffusion’s “civitai-style” character and style packs are commonly distributed as LoRA adapters. Same trick, different domain.
If you’re choosing between fine-tuning, prompting, and RAG, you’re really choosing whether to ship a LoRA or not. Knowing why LoRA works tells you when it’s the right tool.
The short answer
LoRA = freeze base weights + add a low-rank update ΔW = BA + only train A and B
Instead of learning a new full-size weight matrix, you learn two skinny
matrices whose product approximates the update. If W is 4096×4096 and
your rank r is 8, then B is 4096×8 and A is 8×4096. That’s ~65k
trainable numbers in place of ~16.8M — about 256× fewer parameters,
matching most of the quality.
How it works
The intrinsic-rank hypothesis
The whole technique rests on a guess that turned out to be right in practice: the adaptation a fine-tune performs lives in a low-dimensional subspace, even though the model itself is huge.
This wasn’t pulled out of nowhere. Aghajanyan et al. (2020), in Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, showed you can fine-tune RoBERTa to ~90% of full performance on MRPC by optimizing only ~200 parameters projected randomly back into the full weight space. One reading of that result: the pretrained model already has the right features, and fine-tuning just steers them. Hu et al. (2021) took the next step: if the effective update is low-dimensional, why not bake that constraint into the parameterization itself? Don’t sample a random subspace — learn one, by writing the update as a product of two low-rank matrices.
So the hypothesis is: ΔW (the change to a weight matrix during
fine-tuning) is well-approximated by BA, where B ∈ ℝ^{d×r} and
A ∈ ℝ^{r×k} and r is much smaller than d or k.
The empirical result: yes, often it is. r=8 frequently matches r=64 on downstream tasks. The QLoRA paper, in an Alpaca-style sweep on LLaMA-7B with LoRA applied to all layers, reported that rank had effectively no effect on final task performance — a narrower observation than “rank doesn’t matter,” but a striking one.
What this doesn’t prove: it doesn’t show that a full fine-tune “actually” only moves weights in a low-rank subspace. It only shows that if you constrain the update to be low-rank, you don’t lose much. Which is what you wanted operationally; the deeper claim is still under active debate.
The mechanics
For a frozen pretrained weight matrix W₀ ∈ ℝ^{d×k}, the LoRA
parameterization replaces
y = W₀ x
with
y = W₀ x + (α/r) · B A x
where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, and α is a scalar scaling factor.
In the original paper’s scheme, B is initialized to zero and A to
random Gaussian, so BA = 0 at step 0 and the adapted model starts as
an exact copy of the base. (Library defaults differ in details — e.g.
Hugging Face PEFT uses Kaiming-uniform for A and zeros for B — but
the “starts at zero update” property is preserved.) Training updates
only A and B; W₀ never moves.
A few details worth knowing:
- Why two matrices? Rank-
rupdates can be written as the sum ofrouter products.BAis just that, factored. Theris the bottleneck — it’s the only place rank can be lost. - The
α/rscaling factor. It decouples the scale of the update from the choice of rank. The paper itself just callsαa constant they didn’t tune; in practice many recipes tieαtor(oftenα = rorα = 2r) so you can sweep rank without re-tuning the learning rate. The referenceloralibactually defaultsα = 1— the “tie alpha to rank” convention is a community norm, not a paper prescription. - Where you put adapters matters. The original paper applied LoRA
only to the attention projection matrices (
W_q,W_v). Later practice often extends adapters to the feed-forward blocks too; QLoRA, for instance, applies LoRA to all linear layers in a transformer block. I don’t have a clean meta-analysis to cite for “all-layers always wins,” so treat this as common practice rather than a settled result.
Why the savings are this dramatic
Stack a few effects:
- Trainable parameters drop ~100–1000×. ~65k vs. ~16.8M per matrix in the example above. Across a whole model, well under 1% of weights are trainable.
- Optimizer state drops by the same factor. Adam’s two extra FP32 numbers per trainable parameter are now negligible. This is the real memory win — bigger than the parameter-count win for most setups, because optimizer state is what blows up VRAM in full fine-tuning.
- Adapters are tiny on disk. A 7B LoRA adapter is on the order of tens of MB; the base model is ~14 GB at FP16. Distributing per-customer fine-tunes becomes feasible.
- No inference penalty if you merge. At serving time you can compute
W = W₀ + (α/r) BAonce and use the merged weights. Same forward-pass cost as the base model. Or you can keep the adapter separate and hot-swap, paying a small overhead per layer.
QLoRA: stacking the trick on quantization
QLoRA (Dettmers et al., May 2023) is the most consequential follow-up. The base model is quantized to 4-bit (NF4, a data type tailored for normally-distributed weights), kept frozen, and dequantized on the fly to FP16/BF16 to compute forward and backward passes. Only the LoRA adapters and the optimizer state for them live in 16-bit. The headline: fine-tune a 65B model on one 48 GB GPU, recovering full 16-bit fine-tuning quality on the benchmarks they tested. This is what made “fine-tune a frontier-scale model on a workstation” actually true.
Where it gets subtle
- LoRA is not literally identical to full fine-tuning. A late-2024 paper, LoRA vs Full Fine-tuning: An Illusion of Equivalence, argues the two reach different solutions in weight space even when downstream metrics look similar — different generalization, different forgetting behavior. The practical advice from the LoRA community has long been: treat LoRA as the default but evaluate against full fine-tuning when the task is hard or the data is large. (I haven’t independently re-derived the paper’s claims; flagging it as a real ongoing debate, not a settled question.)
- Rank isn’t the only knob. Which layers you adapt, the
αscaling, learning rate, and whether you also tune embeddings all matter. The rank-vs-quality curve is famously flat for many tasks, which is why r=8 keeps working. - It’s parameter-efficient, not knowledge-efficient. LoRA shrinks how many parameters update, not how much data you need. If your fine-tune is bad, more rank won’t save you; better data will.
- Knowledge injection is still hard. As with full fine-tuning, LoRA is a worse tool for stuffing new factual knowledge into a model than for adjusting style, format, or task framing. Retrieval is usually the right answer for “here are the facts I want it to know.”
The whole technique is a bet on a structural observation about fine-tuning, not a clever optimizer trick. That’s why it generalizes — the same idea works for transformer LLMs, diffusion image models, and just about anything else where you start from a strong pretrained base.
Famous related terms
- Full fine-tuning —
full fine-tuning = unfreeze every weight + run gradient descent on the whole model. The baseline LoRA is compared against. Best quality when you can afford it; rarely worth it. - PEFT —
PEFT = umbrella for "fine-tune by training a tiny number of extra parameters". LoRA is the most popular member; adapters, prefix tuning, IA³, and prompt tuning are siblings. - QLoRA —
QLoRA = 4-bit quantized frozen base + 16-bit LoRA adapters on top. Dettmers et al., 2023. The reason single-GPU fine-tuning of 65B models is on the table. - Adapters (Houlsby et al., 2019) —
adapter = small bottleneck MLP inserted between transformer layers + only train it. The pre-LoRA PEFT idea; LoRA’s main practical advantage is no extra inference latency when you merge. - DoRA —
DoRA ≈ LoRA + decompose into magnitude and direction. A 2024 variant that often beats LoRA at the same parameter budget; whether it’s worth the added complexity depends on the workload. - Why fine-tuning is cheap — the broader story LoRA is one ingredient of.
- Why VRAM is the bottleneck — explains why the optimizer-state savings are the win that matters.
- Intrinsic dimension —
intrinsic dimension = the smallest subspace you can fine-tune in and still do well. The Aghajanyan/Li line of work LoRA built its hypothesis on.
Going deeper
- LoRA: Low-Rank Adaptation of Large Language Models (Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen — arXiv 2106.09685, June 2021; ICLR 2022). The paper.
- Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (Aghajanyan, Zettlemoyer, Gupta — 2020). The empirical result LoRA’s hypothesis sits on top of.
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers, Pagnoni, Holtzman, Zettlemoyer — arXiv 2305.14314, May 2023; NeurIPS 2023). The follow-up that made fine-tuning a 65B model on one GPU real.
- The Hugging Face PEFT library — five minutes of reading the LoRA example code is the fastest way to internalize how small the trainable footprint actually is.
- LoRA vs Full Fine-tuning: An Illusion of Equivalence (arXiv 2410.21228, 2024) — pushback on the strongest version of the “LoRA = full fine-tune” claim. Useful for calibrating how much to trust LoRA on hard tasks.