Why does mixture-of-experts exist?
A 671B-parameter model whose per-token compute is closer to a 37B one. The trick isn't compression — it's that most of the weights sit out most of the time.
Why it exists
There’s a frustrating asymmetry at the heart of LLM scaling. Bigger models are smarter — that part has held up depressingly well. But every parameter you add gets paid for twice: once at training time (more flops to push gradients through) and again at inference time (more bytes to stream through GPU memory for every single token). At some point you can’t afford the model you’d actually like to have.
The frustrating part is that you suspect, looking at how the model behaves, that you don’t need every parameter for every token. The bit of the network that knows Python syntax probably isn’t doing much when you ask about French grammar. The bit that has memorized obscure chemistry doesn’t fire on small talk. A dense model — one where every weight participates in every forward pass — is paying full price for capacity it isn’t using on this particular input.
Mixture-of-experts (MoE) is the architecture that takes that intuition seriously. Instead of one big feed-forward block in some or all transformer layers, you have N parallel feed-forward blocks (“experts”), and a tiny router that, per token, picks the k it thinks are most relevant. The other N − k don’t run. You get the capacity of a much bigger model and the per-token compute of a much smaller one — at the cost of a lot more memory and a lot more engineering.
Why it matters now
It matters because the frontier has quietly gone sparse.
- Mixtral 8x7B (Mistral, Jan 2024) — ~47B total parameters, ~13B active per token, two of eight experts routed per layer. (paper)
- DeepSeek-V3 (Dec 2024) — 671B total parameters, 37B active per token, with one shared expert plus 8 of 256 routed experts active. (technical report)
- GPT-4 is, per persistent industry rumor (originally from George Hotz, echoed by others), an MoE with on the order of eight experts. OpenAI has not confirmed the architecture, so treat the specifics as unverified — but the direction of the rumor is consistent with the rest of the field.
If you’re an engineer reasoning about cost, latency, or where the next models are going, “active parameters” vs “total parameters” is the distinction that matters. A 671B MoE and a 671B dense model are not the same kind of object — they have wildly different active compute, bandwidth profiles, and serving topologies. (The total parameter footprint is comparable; what differs is how much of it runs per token and how it has to be laid out across hardware.)
The short answer
MoE = N expert FFNs + a router that picks k of them per token
In MoE layers, the feed-forward block is replaced by a bank of N parallel feed-forward sub-networks. A small router (usually a single linear layer plus a top-k selection) reads the token’s hidden state and decides which k experts get to process it. Only those k run. The output is a weighted combination of their results, then the rest of the transformer continues as normal. Training jointly learns the experts and the router. (Not every transformer block has to be an MoE layer — many architectures interleave dense and MoE layers — but for simplicity assume every FFN is replaced.)
How it works
Picture the standard transformer block:
hidden -> attention -> add+norm -> FFN -> add+norm -> hidden'
The FFN is a fat two-layer MLP, and in a dense model it’s where the bulk of the parameters live. MoE swaps that single FFN for N of them in parallel:
hidden -> attention -> add+norm -> router picks k of N FFNs -> combine -> add+norm -> hidden'
The router for a token x computes scores router(x) -> R^N, takes the top-k (often k=2, sometimes k=1 in Switch Transformer, k=8 in DeepSeek-V3), normalizes those k scores into weights (the exact normalization varies — softmax is common; DeepSeek-V3 uses sigmoid affinities normalized over the selected experts), runs only those k experts, and combines their outputs by the weighted sum. The other N − k experts contribute nothing for this token. Different tokens take different paths through the same layer. That’s the whole idea.
A few things follow from that picture, and most of MoE engineering is dealing with them.
Why “active” vs “total” parameters
The router activates a fixed k of N experts per token. So per-token compute scales with k, not N. The capacity of the model — how much knowledge it can encode — scales with N. So you decouple the two:
- Total params ≈ what fits in GPU memory and what the model can know.
- Active params ≈ what runs per token and what you pay for in flops.
DeepSeek-V3 is the cleanest demonstration: 671B total / 37B active is roughly an 18× ratio. You’re getting the knowledge-soaking capacity of something near 671B with the per-token compute closer to 37B. Whether the quality matches a hypothetical dense 671B is a different question, and there’s no public like-for-like dense-671B baseline to compare against. The honest claim is “much more capacity per active flop than dense,” not “free lunch.”
The router is a load-balancing nightmare
The first thing that goes wrong if you’re not careful: the router collapses. A few experts win every routing decision, the rest are never picked, never get gradient, and atrophy. You’ve trained an N-expert model that effectively uses 2.
The fix is an auxiliary load-balancing loss added to training: a penalty term that nudges the router toward using all experts roughly equally over a batch. Different MoE papers use different versions (Switch Transformer’s auxiliary load-balancing loss; DeepSeek-V3’s auxiliary-loss-free balancing scheme plus a small sequence-wise balance loss). They don’t fully solve the problem — a lot of recent MoE routing research is in schemes that balance better with less explicit pressure — but they keep training from degenerating into “one expert does everything.”
There’s also capacity factor: each expert can only process so many tokens per batch. If too many tokens want the same expert in the same step, somebody loses. Many systems set a capacity factor and drop overflowed tokens; others engineer the routing and balancing carefully enough to avoid drops in practice.
The hidden bill: memory
Here’s the seam in the “MoE is cheaper” story. Active params are what scales the compute per token. But for inference you have to keep all N experts somewhere in the serving system — on a single GPU, across a multi-GPU node, or sharded across hosts — because the router might pick any of them on any token. So while MoE reduces active compute compared to an equally large dense model, it adds parameter-residency and cross-device routing costs that the memory bandwidth wall of LLM serving already makes painful.
This is why MoE inference is mostly a story about distributed serving:
- Expert parallelism — different experts live on different GPUs. Tokens get routed across the network to wherever their experts are. This works at scale but introduces all-to-all communication on every MoE layer, which is its own performance cliff.
- Big-host inference — a single machine with enough VRAM (or a tightly-coupled multi-GPU node) holds the whole model. Cleaner, but the hardware is expensive.
- Offloading — keep cold experts on CPU memory or NVMe, page them in on demand. Only viable for small batch sizes; the latency penalty when you guess wrong is brutal.
So the MoE pitch isn’t “cheaper to deploy.” It’s “potentially cheaper per quality unit to deploy, if you can afford the extra memory and the routing overhead.” Whether that math works depends a lot on your traffic pattern and your hardware.
Where MoE breaks the textbook intuitions
A few things that surprise people the first time:
- Experts don’t specialize the way the name suggests. The “expert” naming is more aspirational than mechanistic. Reported analyses tend to find experts firing on subtler statistical patterns — token-position regularities, syntactic structures, punctuation — rather than neat human categories like “medicine” or “code.” Some MoE work (DeepSeekMoE, for example) deliberately encourages finer-grained specialization, and there are interpretability results showing real category-like structure in some setups, but the default is messier than the name implies.
- MoE training instability is real. Sparse routing creates discrete, non-differentiable decisions in the forward pass (you literally drop experts), and the gradients flowing through the router are noisy. Sparse-gated MoE was introduced by Shazeer et al. in 2017; Switch Transformer (2021) simplified routing to top-1 and reported stable training at much larger scale. A lot of post-2021 MoE work is about taming this further.
- Batching can help less. In a dense model, batching multiple requests amortizes weight-loading costs across them. In an MoE, different tokens in the batch route to different experts, so the same expert might be needed for only a fraction of the batch — fragmenting the work. The “free lunch” of batch-amortized memory bandwidth shrinks. This is part of why MoE inference engines look different from dense ones.
Famous related terms
- Sparsely-gated mixture-of-experts —
sparse MoE = dense MoE + top-k routing— the Shazeer et al. 2017 paper that introduced the modern formulation: thousands of experts, only a few active per example, applied between LSTM layers. Pre-transformer, but it laid down the sparse-gating template widely reused in later work. - Switch Transformer —
Switch ≈ MoE with k=1— Fedus, Zoph, Shazeer (2021). Routes each token to exactly one expert, simpler and more stable than top-2. The paper reports training sparse models up to a trillion parameters. - Mixtral 8x7B —
Mixtral = 8 experts + top-2 routing in MoE layers— Mistral’s first MoE release (paper). 47B total, 13B active. Note: not “8 separate 7B models bolted together” — only the FFN blocks are replicated; attention layers are shared. - DeepSeekMoE / DeepSeek-V3 —
DeepSeekMoE = many fine-grained experts + shared experts + balancing tweaks— pushes toward a much larger N (256 routed + 1 shared) with a small k, aiming to increase combinatorial flexibility and isolate shared/common knowledge in dedicated experts. (V3 report) - Router / gating network —
router = small linear layer + top-k selection per token. The (usually) tiny layer that picks experts per token. The single most fragile component in an MoE; much of the routing research is about making it better-behaved. - Active vs total parameters —
active params = what runs per token;total params = what fits in memory. The distinction the whole architecture exists to create. When you see “37B active / 671B total,” now you know what it’s saying. - Expert parallelism —
EP = "shard the experts across GPUs"— the distributed-systems half of MoE serving, and where the all-to-all communication shows up. - Auxiliary load-balancing loss —
aux loss = main loss + a penalty that nudges the router toward using all experts. The training-time hack that keeps the router from collapsing onto a few favorites. Every MoE paper has its own version.
Going deeper
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer — Shazeer et al., 2017. The paper that put modern MoE on the map. Pre-transformer (the experts sit between LSTM layers), but the routing-and-balancing skeleton is the one we still use.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus, Zoph, Shazeer, 2021. The “actually we can train these stably” paper.
- Mixtral of Experts — Jiang et al., 2024. A clear architecture paper for an open frontier MoE; useful if you want to read the design top-to-bottom in one place.
- DeepSeek-V3 Technical Report — DeepSeek-AI, 2024. A large public open MoE report (671B total / 37B active as of late 2024) with a candid description of the engineering it took.
- DeepSeekMoE: Towards Ultimate Expert Specialization — the architecture paper behind the V3 expert design.
- For GPT-4: there is no official architecture paper. The “8-expert MoE, ~1.76T total” claim traces back to George Hotz’s 2023 leak; it’s been repeated widely but never formally confirmed by OpenAI, so it belongs in the “industry rumor consistent with the broader trend” bucket rather than the cited-fact bucket.