Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why SwiGLU replaced ReLU in transformers

Modern LLMs ditched the simplest activation function in deep learning for a multiplicative gate nobody can fully explain. Here's why.

AI & ML intermediate Apr 29, 2026

Why it exists

If you opened a deep-learning textbook in 2016, the activation function on every page was ReLU. It is the simplest nonlinearity that works: a kink at zero, a straight line above it, no exponentials. ResNets used it. The original Transformer paper used it. For a long time it was the default for the same reason printf is the default debug tool — it’s cheap, it’s understood, it gets out of the way.

Then something quietly happened between 2017 and 2022. BERT and GPT-2 swapped ReLU for GeLU. Then PaLM and LLaMA swapped GeLU for SwiGLU. By 2024, if you opened a frontier open-weights model, the feed-forward layer had three matrices instead of two and a multiplication you’d never seen in a textbook. Something pushed the field to abandon a thing that was famously fine.

Why it matters now

Most modern LLM serving stacks — vLLM, TensorRT-LLM, llama.cpp — ship a fused kernel for the SwiGLU feed-forward block, because the three-matrix shape is what nearly every open-weights frontier model uses. Quantization schemes and tensor-parallel splits have to handle that shape specifically. If you’re reading LLaMA or Qwen or Mistral source code and the mlp block has gate_proj, up_proj, and down_proj, that’s a gated FFN — in those models specifically, SwiGLU. Knowing why that shape won is the difference between “the FFN is a black box” and “I can predict how this kernel allocates memory.”

The short answer

SwiGLU = Swish(xW) ⊙ (xV) → multiplied through W₂

In words: instead of the classic feed-forward layer f(xW₁) · W₂, you compute two projections of the input, pass one through a smooth activation (Swish), multiply them elementwise (the “gate”), and then project back down with a third matrix W₂. The gate lets the network decide, per coordinate, how much of the other branch’s signal to let through.

How it works

Walk through what each piece is actually doing.

The classic FFN block (Vaswani et al., 2017). Given input x, compute h = ReLU(x W₁ + b₁), then y = h W₂ + b₂. Two matmuls, one nonlinearity. The hidden dimension is typically 4× the model dimension — that’s where most of the parameter count of a transformer actually lives, more than attention.

Step 1: ReLU → GeLU (around 2018). ReLU has two annoyances. Below zero it is exactly flat — the gradient is zero, so a neuron stuck in the negative region gets no learning signal (“dying ReLU”). And the kink at zero means the function isn’t differentiable there, which is fine in practice but ugly in theory. Hendrycks & Gimpel’s GeLU paper (arXiv:1606.08415) proposed x · Φ(x) where Φ is the Gaussian CDF. Same shape as ReLU at the extremes, but smooth, and crucially with non-vanishing gradient on the negative side, so neurons there still learn. BERT and GPT-2 picked it up and it became the new default. The empirical gains were small but consistent.

Step 2: GeLU → SwiGLU (around 2020-2022). This is the weirder jump. Noam Shazeer’s “GLU Variants Improve Transformer” (arXiv:2002.05202, 2020) tested a family of GLU variants in the FFN sublayer. The pattern is:

SwiGLU(x) = (Swish(x W) ⊙ (x V)) W₂

Two input projections (W and V) instead of one. One goes through Swish, the other stays linear. They get multiplied elementwise (). Then W₂ projects back. The gate is the new ingredient — the linear branch can amplify or suppress each coordinate of the activated branch.

Why “free performance”? To keep the parameter count comparable to the classic two-matrix FFN, SwiGLU implementations shrink the hidden dimension by 2/3 (so roughly 8/3 · d_model instead of 4 · d_model). With matched parameters, SwiGLU still outperforms GeLU and ReLU on perplexity and downstream benchmarks. PaLM used it, LLaMA 1/2/3 use it, and most open-weights models published after late 2022 use it.

Why does the gate help? Honestly — and this is the seam — nobody really knows. Shazeer’s paper closes with a now-famous line attributing the improvement to “divine benevolence.” The hand-wavy story is that multiplicative interactions let the network represent things linear-plus-ReLU can’t easily represent (e.g. quadratic functions of the input), and the gate gives it a learnable per-coordinate dial. But the honest answer is that we have an empirical result that replicates across scales, and a mechanism story that’s plausible but not proven. The field adopted it because the loss curves were better, not because someone derived it from first principles.

One more thing — the cost. SwiGLU has three big matmuls in the FFN (gate_proj, up_proj, down_proj) instead of two. That changes how kernels fuse, how TP splits the weights, and how quantization schemes group the channels. If you’ve ever wondered why every LLM inference engine has a dedicated swiglu kernel, that’s why.

Going deeper