Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does mixture-of-experts exist?

A 671B-parameter model whose per-token compute is closer to a 37B one. The trick isn't compression — it's that most of the weights sit out most of the time.

AI & ML intermediate Apr 29, 2026

Why it exists

There’s a frustrating asymmetry at the heart of LLM scaling. Bigger models are smarter — that part has held up depressingly well. But every parameter you add gets paid for twice: once at training time (more flops to push gradients through) and again at inference time (more bytes to stream through GPU memory for every single token). At some point you can’t afford the model you’d actually like to have.

The frustrating part is that you suspect, looking at how the model behaves, that you don’t need every parameter for every token. The bit of the network that knows Python syntax probably isn’t doing much when you ask about French grammar. The bit that has memorized obscure chemistry doesn’t fire on small talk. A dense model — one where every weight participates in every forward pass — is paying full price for capacity it isn’t using on this particular input.

Mixture-of-experts (MoE) is the architecture that takes that intuition seriously. Instead of one big feed-forward block in some or all transformer layers, you have N parallel feed-forward blocks (“experts”), and a tiny router that, per token, picks the k it thinks are most relevant. The other N − k don’t run. You get the capacity of a much bigger model and the per-token compute of a much smaller one — at the cost of a lot more memory and a lot more engineering.

Why it matters now

It matters because the frontier has quietly gone sparse.

If you’re an engineer reasoning about cost, latency, or where the next models are going, “active parameters” vs “total parameters” is the distinction that matters. A 671B MoE and a 671B dense model are not the same kind of object — they have wildly different active compute, bandwidth profiles, and serving topologies. (The total parameter footprint is comparable; what differs is how much of it runs per token and how it has to be laid out across hardware.)

The short answer

MoE = N expert FFNs + a router that picks k of them per token

In MoE layers, the feed-forward block is replaced by a bank of N parallel feed-forward sub-networks. A small router (usually a single linear layer plus a top-k selection) reads the token’s hidden state and decides which k experts get to process it. Only those k run. The output is a weighted combination of their results, then the rest of the transformer continues as normal. Training jointly learns the experts and the router. (Not every transformer block has to be an MoE layer — many architectures interleave dense and MoE layers — but for simplicity assume every FFN is replaced.)

How it works

Picture the standard transformer block:

hidden -> attention -> add+norm -> FFN -> add+norm -> hidden'

The FFN is a fat two-layer MLP, and in a dense model it’s where the bulk of the parameters live. MoE swaps that single FFN for N of them in parallel:

hidden -> attention -> add+norm -> router picks k of N FFNs -> combine -> add+norm -> hidden'

The router for a token x computes scores router(x) -> R^N, takes the top-k (often k=2, sometimes k=1 in Switch Transformer, k=8 in DeepSeek-V3), normalizes those k scores into weights (the exact normalization varies — softmax is common; DeepSeek-V3 uses sigmoid affinities normalized over the selected experts), runs only those k experts, and combines their outputs by the weighted sum. The other N − k experts contribute nothing for this token. Different tokens take different paths through the same layer. That’s the whole idea.

A few things follow from that picture, and most of MoE engineering is dealing with them.

Why “active” vs “total” parameters

The router activates a fixed k of N experts per token. So per-token compute scales with k, not N. The capacity of the model — how much knowledge it can encode — scales with N. So you decouple the two:

DeepSeek-V3 is the cleanest demonstration: 671B total / 37B active is roughly an 18× ratio. You’re getting the knowledge-soaking capacity of something near 671B with the per-token compute closer to 37B. Whether the quality matches a hypothetical dense 671B is a different question, and there’s no public like-for-like dense-671B baseline to compare against. The honest claim is “much more capacity per active flop than dense,” not “free lunch.”

The router is a load-balancing nightmare

The first thing that goes wrong if you’re not careful: the router collapses. A few experts win every routing decision, the rest are never picked, never get gradient, and atrophy. You’ve trained an N-expert model that effectively uses 2.

The fix is an auxiliary load-balancing loss added to training: a penalty term that nudges the router toward using all experts roughly equally over a batch. Different MoE papers use different versions (Switch Transformer’s auxiliary load-balancing loss; DeepSeek-V3’s auxiliary-loss-free balancing scheme plus a small sequence-wise balance loss). They don’t fully solve the problem — a lot of recent MoE routing research is in schemes that balance better with less explicit pressure — but they keep training from degenerating into “one expert does everything.”

There’s also capacity factor: each expert can only process so many tokens per batch. If too many tokens want the same expert in the same step, somebody loses. Many systems set a capacity factor and drop overflowed tokens; others engineer the routing and balancing carefully enough to avoid drops in practice.

The hidden bill: memory

Here’s the seam in the “MoE is cheaper” story. Active params are what scales the compute per token. But for inference you have to keep all N experts somewhere in the serving system — on a single GPU, across a multi-GPU node, or sharded across hosts — because the router might pick any of them on any token. So while MoE reduces active compute compared to an equally large dense model, it adds parameter-residency and cross-device routing costs that the memory bandwidth wall of LLM serving already makes painful.

This is why MoE inference is mostly a story about distributed serving:

So the MoE pitch isn’t “cheaper to deploy.” It’s “potentially cheaper per quality unit to deploy, if you can afford the extra memory and the routing overhead.” Whether that math works depends a lot on your traffic pattern and your hardware.

Where MoE breaks the textbook intuitions

A few things that surprise people the first time:

Going deeper