Why matrix multiplication is the bottleneck of modern ML

Modern ML is mostly one operation in a trench coat. Understanding why matmul dominates explains hardware, software, and why GPUs eat the world.

Math intermediate Apr 29, 2026

Why it exists

Every time ChatGPT writes a word, Midjourney paints a pixel, or your phone unlocks with Face ID, the hidden grunt work behind the scenes is the same one operation: multiplying enormous tables of numbers together. That’s matrix multiplication — “matmul.” It’s why a single high-end GPU costs more than the rest of a gaming PC combined: the GPU is essentially a factory built to do this one calculation millions of times in parallel. Take this operation away and modern AI doesn’t just slow down — it disappears.

Open the profiler on almost any modern model — a transformer, a diffusion model, a ResNet from a decade ago — and you’ll see the same picture. One operation eats 80–95% of the FLOPs, most of the memory traffic, and most of the wall-clock time: matrix multiplication. Everything else — the activations, the normalizations, the softmaxes — is rounding error in the budget.

That’s a strange thing to discover. Why would a field as varied as “vision, language, audio, robotics, protein folding” all collapse onto a single linear-algebra primitive? It’s not because researchers picked it; it’s because everything else stopped scaling, and matmul didn’t.

Why it matters now

If you write code in the AI era, this one fact silently shapes almost every decision around you:

GPUs dominate ML because they’re matmul firehoses. Their matrix-multiply units are most of the silicon.
The reason an LLM bills you per token, not per “thought,” is that each token costs roughly one matmul-pass through the model’s weights.
Quantization (FP16 → INT8 → INT4) is interesting because matmul’s cost goes down with bit-width. Nothing else really does.
“Why is my model slow?” answers almost always reduce to: a matmul is the wrong shape, the weights don’t fit in cache, or you spend more time moving them than multiplying.

The short answer

modern neural net ≈ a stack of (matmul + cheap nonlinearity)

A neural network layer, stripped of branding, is: take a vector of inputs, multiply it by a learned weight matrix, add a bias, apply a cheap pointwise function (ReLU, GELU, softmax). Stack a hundred of those. That’s it. Even attention is matmul wearing a different hat — attention(Q, K, V) = softmax(Q · Kᵀ / √d) · V is three matmuls glued by a softmax.

So when you train or run a model, you are mostly asking your hardware to multiply matrices. Faster matmul = faster everything.

How it works

There are two reasons matmul ate the field. One is architectural; one is physical.

The architectural reason. Around the late 2000s and early 2010s, deep learning beat hand-engineered features at vision, then speech, then language. The thing the winning models had in common was: lots of layers, each layer a big matmul. Convnets (Conv2D unrolls into a matmul under the hood, via im2col), RNNs, transformers — different shapes, same primitive. Researchers stopped designing exotic operations because exotic operations didn’t fit on the hardware they had.

The physical reason. Matmul has a property hardware loves: it does a lot of FLOPs per byte of memory it touches. Multiplying two N×N matrices is O(N³) arithmetic and O(N²) memory. So if N is big, you do many multiplies for each value you load from HBM. That ratio — arithmetic intensity — is exactly what modern accelerators are built to exploit. They can do tens of teraflops, but only if you feed them work that doesn’t drown in memory traffic.

Most other operations don’t have this property. Adding two vectors does one FLOP per two loads — the chip sits idle waiting for memory. So as compute got cheaper faster than memory got faster (the memory wall), every operation that wasn’t matmul-shaped fell off the Pareto frontier.

There’s a second-order effect. Once GPUs added dedicated matmul units (NVIDIA shipped tensor cores in Volta, 2017), the gap widened: a tensor-core matmul can be 8–16× faster than the same FLOPs done as generic vector math. So a model built out of matmuls runs an order of magnitude faster than the same FLOP budget spent on anything else. Researchers followed the speed.

The seam. Matmul’s dominance isn’t a law of nature — it’s a feedback loop between hardware and architectures. There are operations (sparse attention, structured matrices, state-space models) that are mathematically cheaper but currently slower in practice because the hardware isn’t shaped for them. Whether that loop ever breaks — whether something dethrones matmul — is genuinely open. I don’t have a confident prediction. Mamba and friends are the most credible recent challengers, but as of early 2026, transformers still win at the frontier, and the frontier still runs on matmul.

One more thing worth naming: the theoretical exponent of matrix multiplication is below 3 — Strassen’s algorithm is O(N^2.807), and there’s a long line of asymptotically faster algorithms going down toward ~2.37. Almost none of them are used in practice for ML, because they have worse constants, worse numerical stability, and worse cache behavior at the sizes we actually run. Real GPU matmul is the naive O(N³) algorithm, hand-tuned to within an inch of its life.

GEMM — GEMM = General Matrix Multiply — the BLAS-3 routine C = αAB + βC. Almost every ML framework’s matmul call eventually lands in a hand-tuned GEMM kernel.
Arithmetic intensity — arithmetic intensity = FLOPs / bytes moved — the number that decides whether a kernel is compute-bound or memory-bound. Matmul wins by being high.
Roofline model — roofline ≈ plot of arithmetic intensity vs achievable FLOPs — a back-of-envelope diagram for “is this kernel as fast as it can be?”.
Tensor cores / MXUs — tensor core = matmul ASIC bolted onto a GPU SM — why a 2026 GPU does ML matmul much faster than its raw FP32 number suggests.
im2col — im2col = unroll a convolution into a matmul — the trick that turned convnets into a matmul workload, which is why GPUs ate computer vision.

Going deeper

The roofline model paper (Williams, Waterman, Patterson, 2008) is the cleanest articulation of why arithmetic intensity decides everything.
Horace He’s writeups on “making deep learning go brrr” are the best practical introduction to matmul-shape thinking.
For the algorithmic frontier: search for recent results on the matrix-multiplication exponent ω. They keep inching it down; none of them have hit production ML.