Why SwiGLU replaced ReLU in transformers
Modern LLMs ditched the simplest activation function in deep learning for a multiplicative gate nobody can fully explain. Here's why.
Why it exists
If you opened a deep-learning textbook in 2016, the activation function on every page
was ReLU.
It is the simplest nonlinearity that works: a kink at zero, a straight line above
it, no exponentials. ResNets used it. The original Transformer paper used it.
For a long time it was the default for the same reason printf is the default
debug tool — it’s cheap, it’s understood, it gets out of the way.
Then something quietly happened between 2017 and 2022. BERT and GPT-2 swapped ReLU for GeLU. Then PaLM and LLaMA swapped GeLU for SwiGLU. By 2024, if you opened a frontier open-weights model, the feed-forward layer had three matrices instead of two and a multiplication you’d never seen in a textbook. Something pushed the field to abandon a thing that was famously fine.
Why it matters now
Most modern LLM serving stacks — vLLM, TensorRT-LLM,
llama.cpp — ship a fused kernel for the SwiGLU feed-forward block, because
the three-matrix shape is what nearly every open-weights frontier model uses.
Quantization schemes and tensor-parallel splits have to handle that shape
specifically. If you’re reading LLaMA or Qwen or Mistral source code and the
mlp block has gate_proj, up_proj, and down_proj, that’s a gated FFN —
in those models specifically, SwiGLU. Knowing why that shape won is the
difference between “the FFN is a black box” and “I can predict how this
kernel allocates memory.”
The short answer
SwiGLU = Swish(xW) ⊙ (xV) → multiplied through W₂
In words: instead of the classic feed-forward layer f(xW₁) · W₂, you compute
two projections of the input, pass one through a smooth activation
(Swish),
multiply them elementwise (the “gate”), and then project back down with a
third matrix W₂. The gate lets the network decide, per coordinate, how much
of the other branch’s signal to let through.
How it works
Walk through what each piece is actually doing.
The classic FFN block (Vaswani et al., 2017). Given input x, compute
h = ReLU(x W₁ + b₁), then y = h W₂ + b₂. Two matmuls, one nonlinearity. The
hidden dimension is typically 4× the model dimension — that’s where most of the
parameter count of a transformer actually lives, more than attention.
Step 1: ReLU → GeLU (around 2018). ReLU has two annoyances. Below zero it is
exactly flat — the gradient is zero, so a neuron stuck in the negative region
gets no learning signal (“dying ReLU”). And the kink at zero means the function
isn’t differentiable there, which is fine in practice but ugly in theory.
Hendrycks & Gimpel’s GeLU paper (arXiv:1606.08415) proposed x · Φ(x) where
Φ is the Gaussian CDF.
Same shape as ReLU at the extremes, but smooth, and crucially with non-vanishing
gradient on the negative side, so neurons there still learn. BERT and GPT-2 picked it up and it became the new default. The
empirical gains were small but consistent.
Step 2: GeLU → SwiGLU (around 2020-2022). This is the weirder jump. Noam Shazeer’s “GLU Variants Improve Transformer” (arXiv:2002.05202, 2020) tested a family of GLU variants in the FFN sublayer. The pattern is:
SwiGLU(x) = (Swish(x W) ⊙ (x V)) W₂
Two input projections (W and V) instead of one. One goes through Swish, the
other stays linear. They get multiplied elementwise (⊙). Then W₂ projects
back. The gate is the new ingredient — the linear branch can amplify or
suppress each coordinate of the activated branch.
Why “free performance”? To keep the parameter count comparable to the
classic two-matrix FFN, SwiGLU implementations shrink the hidden dimension by
2/3 (so roughly 8/3 · d_model instead of 4 · d_model). With matched
parameters, SwiGLU still outperforms GeLU and ReLU on perplexity and
downstream benchmarks. PaLM used it, LLaMA 1/2/3 use it, and most open-weights
models published after late 2022 use it.
Why does the gate help? Honestly — and this is the seam — nobody really knows. Shazeer’s paper closes with a now-famous line attributing the improvement to “divine benevolence.” The hand-wavy story is that multiplicative interactions let the network represent things linear-plus-ReLU can’t easily represent (e.g. quadratic functions of the input), and the gate gives it a learnable per-coordinate dial. But the honest answer is that we have an empirical result that replicates across scales, and a mechanism story that’s plausible but not proven. The field adopted it because the loss curves were better, not because someone derived it from first principles.
One more thing — the cost. SwiGLU has three big matmuls in the FFN
(gate_proj, up_proj, down_proj) instead of two. That changes how kernels
fuse, how TP
splits the weights, and how quantization schemes group the channels. If you’ve
ever wondered why every LLM inference engine has a dedicated swiglu kernel,
that’s why.
Famous related terms
- Swish —
Swish(x) = x · sigmoid(βx)(oftenβ=1) — a smooth, self-gating activation found via neural architecture search at Google in 2017. Looks like ReLU at the extremes but dips slightly negative around zero, and outperformed ReLU on the architectures the original paper tested. - GeLU —
GeLU(x) = x · Φ(x)— the smooth ReLU that BERT and GPT-2 used. Still the default in many vision transformers. - GLU —
GLU(x) = (xW) ⊙ sigmoid(xV)— the original gated linear unit from Dauphin et al. (2017), built for convolutional language models. SwiGLU is the same shape with Swish swapped in for sigmoid. - GeGLU —
GeGLU(x) = GeLU(xW) ⊙ (xV)— Shazeer’s other top GLU variant. Roughly tied with SwiGLU on benchmarks; SwiGLU won the popularity contest largely because PaLM and LLaMA picked it. - Feed-forward layer —
FFN ≈ matmul → activation → matmul— where most of a transformer’s parameters live. The activation function here is what this whole post is about.
Going deeper
- GLU Variants Improve Transformer (Shazeer, 2020) — three pages, a table, and “divine benevolence.”
- Gaussian Error Linear Units (Hendrycks & Gimpel, 2016) — the GeLU paper.
- PaLM (Chowdhery et al., 2022) — early high-profile frontier model published with SwiGLU as part of the architecture.
- LLaMA paper (Touvron et al., 2023) — describes the SwiGLU FFN with the 2/3 hidden-dim trick.