Why model merging works at all
Take two fine-tunes of the same model, average their weights element-wise, and you often get a model better than either parent. Naively, this shouldn't work — neural net loss surfaces are wildly non-convex. The reason it works tells you something deep about where fine-tuning actually lives.
Why it exists
Try this thought experiment. You have two fine-tuned versions of the same base LLM — one tuned for code, one tuned for medical Q&A. Each is a giant tensor of weights. You want a model that’s good at both. The textbook moves are: train a third model on a mixture of both datasets, or use a router that picks one expert per query, or fine-tune on top of one of them with the other’s data.
Now consider the absurd move: just take the two weight tensors and average them, element by element. merged = (model_A + model_B) / 2. No retraining, no router, no extra data. Just arithmetic on the parameters of two completely separate fine-tunes.
The naive prediction is that this should produce garbage. Neural networks are highly non-linear functions of their weights. The loss surface they’re trained on is famously non-convex — that is, full of hills, valleys, and saddle points, where the midpoint between two low-loss configurations could easily be a high-loss configuration. Average two reasonable solutions to a hard optimization problem and, in the general case, you get an unreasonable one. That’s why nobody used to bother trying.
But empirically, on fine-tunes that share a pretrained base, weight averaging works. Wortsman et al.’s “Model soups” paper (2022, arXiv:2203.05482) showed that averaging the weights of dozens of independently fine-tuned CLIP models produced a single model that beat every individual ingredient on ImageNet, with no extra inference cost. Since then, the trick has spread: task-vector arithmetic, TIES-merging, DARE, and a sprawling ecosystem of merge recipes on Hugging Face producing models that often top the leaderboards their components couldn’t crack.
The interesting question is why this works. The answer turns out to be specific and load-bearing: fine-tunes don’t actually go very far from where they started.
Why it matters now
Model merging is now a routine production technique. A meaningful fraction of the open-weight models ranking high on community leaderboards are merges, not full training runs — produced for thousands of dollars in compute (sometimes much less) instead of millions. The economics are striking: if you have ten fine-tunes already trained, you can produce a hundred candidate merges for the cost of inference-time experimentation, then ship the best one.
The technique is also load-bearing for federated learning, where you can’t share data across silos but you can share weights, and for continual learning, where you want to add a capability without retraining from scratch. In both cases, the question “is averaging weights a sensible operation?” used to have an embarrassed answer (“kind of, sometimes, if you squint”). The post-2022 literature converted that into a real engineering tool — bounded, but real.
The short answer
model merging = element-wise weight averaging of fine-tunes that share a pretrained starting point
It works because fine-tuning, despite the name, doesn’t actually move the model very far. Fine-tunes from the same pretrained checkpoint stay inside a roughly flat, connected region of the loss surface — the same loss basin their parent lived in. The straight line between two points inside one basin stays inside the basin. So the average isn’t a leap into the void; it’s a step inside the neighborhood the fine-tunes were already exploring.
How it works
The mechanism is best understood through three observations, in order.
1. Fine-tuning is a small perturbation
Pretraining a frontier LLM involves trillions of tokens and weeks on thousands of GPUs. Fine-tuning involves, typically, a few thousand to a few million tokens for a few hours. The gradients are smaller, the learning rate is lower, and you stop early. Whatever the metaphor “fine-tuning” suggests, the literal answer is that the weights barely move — the L2 distance between the pretrained checkpoint and a typical fine-tune is tiny relative to the scale of the weights themselves.
This is the same observation that makes LoRA work: the change induced by fine-tuning is low-rank and small. If fine-tuning genuinely traversed the loss landscape, LoRA wouldn’t be a good approximation — but it is.
2. Linear mode connectivity
Frankle, Dziugaite, Roy, and Carbin (2020, arXiv:1912.05671) — building on earlier work by Garipov, Izmailov, Podoprikhin, Vetrov, and Wilson on loss-surface geometry — established a result called linear mode connectivity: two networks trained from the same initialization (with the same data ordering up to some point, then diverging) tend to be connected by a straight line of low loss in weight space. You can interpolate between them and the loss along the path doesn’t spike.
This is the load-bearing fact for merging. If A and B are linearly mode-connected, then (A + B)/2 has loss comparable to A and B — not somewhere on the other side of a barrier. The mean is in the basin.
The shared-initialization condition turns out to be the crucial caveat. Two networks trained from different random inits typically don’t connect linearly; the line between them passes through high-loss regions. This is why you can’t just merge any two models — the recipe explicitly requires a shared parent.
3. Pretraining as the basin selector
Here’s the synthesis. Pretraining is enormously expensive partly because it does the work of finding a deep, wide loss basin in the absurdly high-dimensional weight space — a basin that generalizes well, where many directions of small perturbation still produce a working language model. Fine-tuning then does the much smaller job of relocating to a particular point inside that basin that happens to be good at the fine-tuning task.
When you have two fine-tunes sharing a pretrained parent, you have two points inside the same basin. Linear mode connectivity says the straight line between them stays in the basin. Averaging is just picking the midpoint of that line. The averaged model inherits whatever properties the basin has — including, often, the union of the capabilities the two endpoints learned, because the directions in weight space that encode “knows about code” and “knows about medical Q&A” turn out to be largely orthogonal at this scale, and orthogonal updates compose by addition.
That last clause is where the deeper interpretive work happens, and it’s where I want to be careful.
What goes wrong, and where the seams are
The story above is clean. Reality is messier:
- Permutation symmetry. Neural networks have a vast symmetry group — you can permute the neurons in a hidden layer (and correspondingly permute the rows of the next layer’s weight matrix) without changing the function. Two networks trained from different inits may compute similar functions but live in different “permutation copies” of the same basin. Naive averaging then destroys both. There’s a research line (Ainsworth, Hayase, Srinivasa 2022, Git Re-Basin) on permutation-aligning networks before averaging, with partial success on small-scale models.
- Interference between fine-tunes. When two fine-tunes both modify the same parameters in opposite directions, averaging cancels them. TIES-merging (Yadav et al., 2023) and DARE (Yu et al., 2023) are recipes that explicitly handle this — drop small or conflicting updates, keep the dominant ones. They consistently outperform plain averaging on multi-task merges.
- Catastrophic forgetting at the boundary. A fine-tune that drifts far from the pretrained parent — e.g. heavily RL’d reasoning models — may have left the original basin. Merging those doesn’t enjoy the linear-mode-connectivity guarantee. The community pattern is that SFT-style fine-tunes merge cleanly; aggressive RL post-training merges less cleanly.
- Capability superposition is fragile. The “orthogonal directions add up” intuition is approximate. In practice, merging too many fine-tunes degrades performance on each — the basin is connected but it isn’t infinite. Most successful merge recipes top out at a small number of ingredients, or use weighted combinations rather than uniform averages.
- It’s not magic. A merged model is bounded above by what’s reachable inside the basin. It can’t acquire a capability neither parent had. (You can sometimes appear to — but that’s usually because the capability was latent in the pretrained base and a parent unlocked it; the merge inherits the unlock.)
So the honest version of “model merging works” is: weight-space averaging of shared-parent fine-tunes is a real, useful, surprisingly cheap operation that exploits a specific geometric fact about the loss surface around pretrained checkpoints. It’s not arbitrage. The pretraining run did the expensive work — finding the basin — and merging just rearranges what’s inside it.
What I’m not sure about
The “fine-tunes share a basin” framing is well-supported empirically but the theoretical understanding is partial. The exact width of the basin, why pretraining produces wide basins (versus narrow ones that wouldn’t tolerate this), and the precise scaling of merge quality with number of ingredients — these are active research questions, not settled.
I also can’t tell you definitively which merge recipe is best in 2026. Linear averaging, TIES, DARE, task arithmetic, SLERP, and various weighted hybrids each have papers showing they win on some benchmark. The community-favored recipes shift; the safest claim is that some merge recipe usually beats naive averaging, not that any one recipe dominates.
Famous related terms
- Linear mode connectivity —
LMC = property that two networks are connected by a low-loss straight line in weight space— the geometric fact that licenses merging. - Task arithmetic —
task vector = (fine-tuned weights) − (pretrained weights); compose tasks by adding/subtracting these vectors— Ilharco et al. 2022, Editing Models with Task Arithmetic. The “merging is just addition” framing made explicit. - Model soups —
model soup = average of many fine-tunes of the same architecture— Wortsman et al. 2022; the paper that popularized the technique at scale. - TIES-merging —
TIES = trim small updates + resolve sign conflicts + average survivors— Yadav et al. 2023, the standard “do better than uniform averaging” recipe. - LoRA — adjacent story: also exploits “fine-tuning is small,” but compresses the update rather than averaging multiple updates.
- Permutation symmetry —
permutation symmetry = swap neurons in a layer + the function is unchanged— the obstacle to merging across random inits. Solved partially by Git Re-Basin.
Going deeper
- Wortsman et al., Model soups (2022) — the paper that put weight averaging on the map for fine-tunes; results on CLIP and ImageNet are still the cleanest demonstration.
- Frankle et al., Linear Mode Connectivity and the Lottery Ticket Hypothesis (2020) — for the geometric “why,” paired with the earlier Garipov et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs (2018).
- Ilharco et al., Editing Models with Task Arithmetic (2022) — for the arithmetic-on-task-vectors view, which clarifies what merging is doing compositionally.
- Yadav et al., TIES-Merging (2023) — the canonical “beat naive averaging” recipe; understanding the failure mode it addresses (sign conflicts) is more useful than memorizing the formula.