Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why model merging works at all

Take two fine-tunes of the same model, average their weights element-wise, and you often get a model better than either parent. Naively, this shouldn't work — neural net loss surfaces are wildly non-convex. The reason it works tells you something deep about where fine-tuning actually lives.

AI & ML intermediate May 4, 2026

Why it exists

Try this thought experiment. You have two fine-tuned versions of the same base LLM — one tuned for code, one tuned for medical Q&A. Each is a giant tensor of weights. You want a model that’s good at both. The textbook moves are: train a third model on a mixture of both datasets, or use a router that picks one expert per query, or fine-tune on top of one of them with the other’s data.

Now consider the absurd move: just take the two weight tensors and average them, element by element. merged = (model_A + model_B) / 2. No retraining, no router, no extra data. Just arithmetic on the parameters of two completely separate fine-tunes.

The naive prediction is that this should produce garbage. Neural networks are highly non-linear functions of their weights. The loss surface they’re trained on is famously non-convex — that is, full of hills, valleys, and saddle points, where the midpoint between two low-loss configurations could easily be a high-loss configuration. Average two reasonable solutions to a hard optimization problem and, in the general case, you get an unreasonable one. That’s why nobody used to bother trying.

But empirically, on fine-tunes that share a pretrained base, weight averaging works. Wortsman et al.’s “Model soups” paper (2022, arXiv:2203.05482) showed that averaging the weights of dozens of independently fine-tuned CLIP models produced a single model that beat every individual ingredient on ImageNet, with no extra inference cost. Since then, the trick has spread: task-vector arithmetic, TIES-merging, DARE, and a sprawling ecosystem of merge recipes on Hugging Face producing models that often top the leaderboards their components couldn’t crack.

The interesting question is why this works. The answer turns out to be specific and load-bearing: fine-tunes don’t actually go very far from where they started.

Why it matters now

Model merging is now a routine production technique. A meaningful fraction of the open-weight models ranking high on community leaderboards are merges, not full training runs — produced for thousands of dollars in compute (sometimes much less) instead of millions. The economics are striking: if you have ten fine-tunes already trained, you can produce a hundred candidate merges for the cost of inference-time experimentation, then ship the best one.

The technique is also load-bearing for federated learning, where you can’t share data across silos but you can share weights, and for continual learning, where you want to add a capability without retraining from scratch. In both cases, the question “is averaging weights a sensible operation?” used to have an embarrassed answer (“kind of, sometimes, if you squint”). The post-2022 literature converted that into a real engineering tool — bounded, but real.

The short answer

model merging = element-wise weight averaging of fine-tunes that share a pretrained starting point

It works because fine-tuning, despite the name, doesn’t actually move the model very far. Fine-tunes from the same pretrained checkpoint stay inside a roughly flat, connected region of the loss surface — the same loss basin their parent lived in. The straight line between two points inside one basin stays inside the basin. So the average isn’t a leap into the void; it’s a step inside the neighborhood the fine-tunes were already exploring.

How it works

The mechanism is best understood through three observations, in order.

1. Fine-tuning is a small perturbation

Pretraining a frontier LLM involves trillions of tokens and weeks on thousands of GPUs. Fine-tuning involves, typically, a few thousand to a few million tokens for a few hours. The gradients are smaller, the learning rate is lower, and you stop early. Whatever the metaphor “fine-tuning” suggests, the literal answer is that the weights barely move — the L2 distance between the pretrained checkpoint and a typical fine-tune is tiny relative to the scale of the weights themselves.

This is the same observation that makes LoRA work: the change induced by fine-tuning is low-rank and small. If fine-tuning genuinely traversed the loss landscape, LoRA wouldn’t be a good approximation — but it is.

2. Linear mode connectivity

Frankle, Dziugaite, Roy, and Carbin (2020, arXiv:1912.05671) — building on earlier work by Garipov, Izmailov, Podoprikhin, Vetrov, and Wilson on loss-surface geometry — established a result called linear mode connectivity: two networks trained from the same initialization (with the same data ordering up to some point, then diverging) tend to be connected by a straight line of low loss in weight space. You can interpolate between them and the loss along the path doesn’t spike.

This is the load-bearing fact for merging. If A and B are linearly mode-connected, then (A + B)/2 has loss comparable to A and B — not somewhere on the other side of a barrier. The mean is in the basin.

The shared-initialization condition turns out to be the crucial caveat. Two networks trained from different random inits typically don’t connect linearly; the line between them passes through high-loss regions. This is why you can’t just merge any two models — the recipe explicitly requires a shared parent.

3. Pretraining as the basin selector

Here’s the synthesis. Pretraining is enormously expensive partly because it does the work of finding a deep, wide loss basin in the absurdly high-dimensional weight space — a basin that generalizes well, where many directions of small perturbation still produce a working language model. Fine-tuning then does the much smaller job of relocating to a particular point inside that basin that happens to be good at the fine-tuning task.

When you have two fine-tunes sharing a pretrained parent, you have two points inside the same basin. Linear mode connectivity says the straight line between them stays in the basin. Averaging is just picking the midpoint of that line. The averaged model inherits whatever properties the basin has — including, often, the union of the capabilities the two endpoints learned, because the directions in weight space that encode “knows about code” and “knows about medical Q&A” turn out to be largely orthogonal at this scale, and orthogonal updates compose by addition.

That last clause is where the deeper interpretive work happens, and it’s where I want to be careful.

What goes wrong, and where the seams are

The story above is clean. Reality is messier:

So the honest version of “model merging works” is: weight-space averaging of shared-parent fine-tunes is a real, useful, surprisingly cheap operation that exploits a specific geometric fact about the loss surface around pretrained checkpoints. It’s not arbitrage. The pretraining run did the expensive work — finding the basin — and merging just rearranges what’s inside it.

What I’m not sure about

The “fine-tunes share a basin” framing is well-supported empirically but the theoretical understanding is partial. The exact width of the basin, why pretraining produces wide basins (versus narrow ones that wouldn’t tolerate this), and the precise scaling of merge quality with number of ingredients — these are active research questions, not settled.

I also can’t tell you definitively which merge recipe is best in 2026. Linear averaging, TIES, DARE, task arithmetic, SLERP, and various weighted hybrids each have papers showing they win on some benchmark. The community-favored recipes shift; the safest claim is that some merge recipe usually beats naive averaging, not that any one recipe dominates.

Going deeper