Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why GPU kernels are still hand-tuned

A modern GPU can do tens of teraflops of matrix math. A naive, correct implementation of the same math leaves most of that on the floor. Here's why moving the bytes — not doing the FLOPs — is the actual job.

AI & ML intermediate Apr 29, 2026

Why it exists

If you’ve ever wondered why an industry of very smart people spends its time writing fresh CUDA kernels for what is, mathematically, the same matrix multiply that’s been in textbooks since the 1800s — this post is for you.

The puzzle goes like this. A modern data-center GPU advertises something like hundreds of teraflops of matrix throughput. You write a textbook-correct matrix multiply on it: three nested loops, the standard operation. You measure. You get a small fraction of the advertised number. Maybe 5%, maybe 10%. The GPU is not broken. The math is not wrong. The kernel is wrong — in a sense that has nothing to do with correctness and everything to do with where the bytes were when you needed them.

The deep reason is the same reason memory hierarchy exists at all: compute got faster much faster than main memory got faster, and the gap is huge. The job of a “good” kernel is mostly not to do the FLOPs faster — you can’t, the silicon picks the rate. The job is to keep the FLOP units fed, by carefully arranging which bytes live in which level of the memory hierarchy at which moment. That arranging is what hand-tuning is.

Why it matters now

Every LLM you use runs on top of a stack of these hand-tuned kernels. The canonical example is FlashAttention (Dao, Fu, Ermon, Rudra, Ré, May 2022, arXiv:2205.14135). It computes the exact same attention output as the textbook formulation — not an approximation — and reports a 15% end-to-end speedup on BERT-large at sequence length 512, a 3x speedup on GPT-2 at sequence length 1K, and 2.4x on long-range arena at 1K–4K. The mathematics didn’t change. Only where the intermediate values lived changed.

That pattern repeats everywhere. cuBLAS ships many GEMM kernels and uses runtime heuristics (via cuBLASLt) to pick one based on the matrix shapes, GPU model, and data type. Compiler stacks like Triton exist so people other than NVIDIA’s own kernel team can write kernels that land in the same ballpark. When a new GPU generation ships (Ampere → Hopper → Blackwell), the top-end kernels often need retuning and sometimes substantial rewrites, because the hierarchy changed shape.

This is also why “just port it to the GPU” rarely gets you the headline number. The headline number assumes someone hand-tuned the kernel.

The short answer

good kernel = correct math + data placement that keeps the FLOP units fed

A GPU’s compute units run far faster than its main memory can deliver bytes. A naive kernel reads each operand from main memory once per use, which starves the compute. A hand-tuned kernel rearranges the work so each byte loaded from main memory gets used many times before being evicted — by tiling, fusing adjacent operations, and carefully choosing what lives in registers vs. on-chip cache vs. main memory. Same FLOPs, very different wall-clock.

How it works

The whole story is captured by one ratio.

The bandwidth wall, in one number

A modern GPU has a memory hierarchy roughly like this:

registers           : tiny, per-thread, ~instant
shared memory/SRAM  : ~tens of KB to ~hundreds of KB per SM, very fast
L2 cache            : ~tens of MB, fast
HBM (main memory)   : tens of GB, slow (relatively)

HBM is fast in absolute terms — an H100 SXM has roughly 3.35 TB/s of HBM bandwidth, per public NVIDIA specs. The problem is that the compute side is even faster. H100’s tensor cores deliver about 1,000 BF16 TFLOPS dense (roughly 2,000 with the 2:4 sparsity feature, which most workloads don’t use). Divide dense compute by bandwidth and you get the arithmetic intensity break-even point: about 295 FLOPs per byte loaded. Below that ridge, you’re memory-bound and the tensor cores are idle waiting for bytes. Above it, you’re compute-bound and the silicon is actually busy.

That number — “you must do hundreds of FLOPs per byte you load” — is the whole game. The roofline model just draws this as a graph: there’s a sloped ceiling on the left (bandwidth-limited) and a flat ceiling on the right (compute-limited), and your kernel sits somewhere underneath. Naive kernels live on the sloped part. Good kernels claw their way to the flat part.

Why naive matmul is bandwidth-bound

Multiply two N×N matrices the textbook way. Total FLOPs: 2N³. If every multiply-add reloads both inputs from HBM, the ratio is constant — and miserably small. About 0.25 FLOP/byte for fp32, 0.5 FLOP/byte for fp16 or bf16. Two orders of magnitude below the 295-FLOP/byte ridge. Memory-bound isn’t even the right word; you’d be HBM-bound to the point of caricature.

Tiling fixes this. Load a B×B block of A and a B×B block of B into shared memory once. Do B³ multiply-adds against those blocks. Now arithmetic intensity scales with B, the tile size. Make B big enough — limited by how much shared memory you have — and you cross the roofline ridge. cuBLAS does this. So does every reasonable matmul library. siboehm’s CUDA-MMM worklog walks through twelve progressively more careful versions and lands within ~95% of cuBLAS for square matrices (siboehm.com/articles/22/CUDA-MMM).

The catch: the right tile size depends on the matrix shape, the GPU model, the data type, and how much shared memory the surrounding kernels are using. This is why cuBLAS is hundreds of kernels, not one — different shapes want different tilings. There is no single best kernel.

Why FlashAttention is the canonical example

Standard attention does softmax(QK^T / √d) V. Implemented naively: compute the N×N matrix QK^T, write it to HBM, read it back to apply softmax, write it again, read it again to multiply by V. The N×N intermediate doesn’t fit anywhere except HBM, and you traverse it multiple times.

FlashAttention’s insight is that you don’t need the whole N×N matrix at any one moment. You can tile Q, K, V into blocks, hold a block in SRAM, and compute a chunk of the output incrementally — provided you can do softmax incrementally too. That last bit is the online softmax trick (which predates FlashAttention; the contribution was wiring it into a tiled GPU kernel). Result: the N×N matrix never gets written to HBM at all. The kernel is bandwidth-bound and there is just much less bandwidth needed.

Same output. Same numerical answer (modulo non-associative floating-point reordering, which is a real but small caveat). Several times faster.

Why this has to be redone every GPU generation

Each GPU generation moves the cliff. Hopper (H100) added the TMA, which extends Ampere’s async-copy model with a hardware unit that can move whole tiles in the background, freeing registers and simplifying the producer/consumer pipelines kernels rely on. Ampere kernels could overlap data movement with compute too — Hopper just makes it cheaper, easier, and faster to express. FlashAttention-3 (Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao, July 2024) is largely a rewrite of FlashAttention-2 to exploit TMA and warp-specialization, and reports up to ~75% of H100’s theoretical peak BF16 throughput vs. ~35% for FlashAttention-2 on the same hardware (PyTorch blog).

The kernel didn’t get smarter about the math. It got smarter about the machine.

The seam: this is genuinely hard, and getting harder

A few honest caveats.

Going deeper