Why GPU kernels are still hand-tuned
A modern GPU can do tens of teraflops of matrix math. A naive, correct implementation of the same math leaves most of that on the floor. Here's why moving the bytes — not doing the FLOPs — is the actual job.
Why it exists
If you’ve ever wondered why an industry of very smart people spends its time writing fresh CUDA kernels for what is, mathematically, the same matrix multiply that’s been in textbooks since the 1800s — this post is for you.
The puzzle goes like this. A modern data-center GPU advertises something like hundreds of teraflops of matrix throughput. You write a textbook-correct matrix multiply on it: three nested loops, the standard operation. You measure. You get a small fraction of the advertised number. Maybe 5%, maybe 10%. The GPU is not broken. The math is not wrong. The kernel is wrong — in a sense that has nothing to do with correctness and everything to do with where the bytes were when you needed them.
The deep reason is the same reason memory hierarchy exists at all: compute got faster much faster than main memory got faster, and the gap is huge. The job of a “good” kernel is mostly not to do the FLOPs faster — you can’t, the silicon picks the rate. The job is to keep the FLOP units fed, by carefully arranging which bytes live in which level of the memory hierarchy at which moment. That arranging is what hand-tuning is.
Why it matters now
Every LLM you use runs on top of a stack of these hand-tuned kernels. The canonical example is FlashAttention (Dao, Fu, Ermon, Rudra, Ré, May 2022, arXiv:2205.14135). It computes the exact same attention output as the textbook formulation — not an approximation — and reports a 15% end-to-end speedup on BERT-large at sequence length 512, a 3x speedup on GPT-2 at sequence length 1K, and 2.4x on long-range arena at 1K–4K. The mathematics didn’t change. Only where the intermediate values lived changed.
That pattern repeats everywhere. cuBLAS ships many GEMM kernels and uses runtime heuristics (via cuBLASLt) to pick one based on the matrix shapes, GPU model, and data type. Compiler stacks like Triton exist so people other than NVIDIA’s own kernel team can write kernels that land in the same ballpark. When a new GPU generation ships (Ampere → Hopper → Blackwell), the top-end kernels often need retuning and sometimes substantial rewrites, because the hierarchy changed shape.
This is also why “just port it to the GPU” rarely gets you the headline number. The headline number assumes someone hand-tuned the kernel.
The short answer
good kernel = correct math + data placement that keeps the FLOP units fed
A GPU’s compute units run far faster than its main memory can deliver bytes. A naive kernel reads each operand from main memory once per use, which starves the compute. A hand-tuned kernel rearranges the work so each byte loaded from main memory gets used many times before being evicted — by tiling, fusing adjacent operations, and carefully choosing what lives in registers vs. on-chip cache vs. main memory. Same FLOPs, very different wall-clock.
How it works
The whole story is captured by one ratio.
The bandwidth wall, in one number
A modern GPU has a memory hierarchy roughly like this:
registers : tiny, per-thread, ~instant
shared memory/SRAM : ~tens of KB to ~hundreds of KB per SM, very fast
L2 cache : ~tens of MB, fast
HBM (main memory) : tens of GB, slow (relatively)
HBM is fast in absolute terms — an H100 SXM has roughly 3.35 TB/s of HBM bandwidth, per public NVIDIA specs. The problem is that the compute side is even faster. H100’s tensor cores deliver about 1,000 BF16 TFLOPS dense (roughly 2,000 with the 2:4 sparsity feature, which most workloads don’t use). Divide dense compute by bandwidth and you get the arithmetic intensity break-even point: about 295 FLOPs per byte loaded. Below that ridge, you’re memory-bound and the tensor cores are idle waiting for bytes. Above it, you’re compute-bound and the silicon is actually busy.
That number — “you must do hundreds of FLOPs per byte you load” — is the whole game. The roofline model just draws this as a graph: there’s a sloped ceiling on the left (bandwidth-limited) and a flat ceiling on the right (compute-limited), and your kernel sits somewhere underneath. Naive kernels live on the sloped part. Good kernels claw their way to the flat part.
Why naive matmul is bandwidth-bound
Multiply two N×N matrices the textbook way. Total FLOPs: 2N³. If every multiply-add reloads both inputs from HBM, the ratio is constant — and miserably small. About 0.25 FLOP/byte for fp32, 0.5 FLOP/byte for fp16 or bf16. Two orders of magnitude below the 295-FLOP/byte ridge. Memory-bound isn’t even the right word; you’d be HBM-bound to the point of caricature.
Tiling fixes this. Load a B×B block of A and a B×B block of B into shared memory once. Do B³ multiply-adds against those blocks. Now arithmetic intensity scales with B, the tile size. Make B big enough — limited by how much shared memory you have — and you cross the roofline ridge. cuBLAS does this. So does every reasonable matmul library. siboehm’s CUDA-MMM worklog walks through twelve progressively more careful versions and lands within ~95% of cuBLAS for square matrices (siboehm.com/articles/22/CUDA-MMM).
The catch: the right tile size depends on the matrix shape, the GPU model, the data type, and how much shared memory the surrounding kernels are using. This is why cuBLAS is hundreds of kernels, not one — different shapes want different tilings. There is no single best kernel.
Why FlashAttention is the canonical example
Standard attention does softmax(QK^T / √d) V. Implemented naively: compute the N×N matrix QK^T, write it to HBM, read it back to apply softmax, write it again, read it again to multiply by V. The N×N intermediate doesn’t fit anywhere except HBM, and you traverse it multiple times.
FlashAttention’s insight is that you don’t need the whole N×N matrix at any one moment. You can tile Q, K, V into blocks, hold a block in SRAM, and compute a chunk of the output incrementally — provided you can do softmax incrementally too. That last bit is the online softmax trick (which predates FlashAttention; the contribution was wiring it into a tiled GPU kernel). Result: the N×N matrix never gets written to HBM at all. The kernel is bandwidth-bound and there is just much less bandwidth needed.
Same output. Same numerical answer (modulo non-associative floating-point reordering, which is a real but small caveat). Several times faster.
Why this has to be redone every GPU generation
Each GPU generation moves the cliff. Hopper (H100) added the TMA, which extends Ampere’s async-copy model with a hardware unit that can move whole tiles in the background, freeing registers and simplifying the producer/consumer pipelines kernels rely on. Ampere kernels could overlap data movement with compute too — Hopper just makes it cheaper, easier, and faster to express. FlashAttention-3 (Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao, July 2024) is largely a rewrite of FlashAttention-2 to exploit TMA and warp-specialization, and reports up to ~75% of H100’s theoretical peak BF16 throughput vs. ~35% for FlashAttention-2 on the same hardware (PyTorch blog).
The kernel didn’t get smarter about the math. It got smarter about the machine.
The seam: this is genuinely hard, and getting harder
A few honest caveats.
- Auto-tuning is real but partial. Triton, CUTLASS, TVM, and reinforcement-learning approaches (e.g. CUDA-L2, arXiv:2512.02551) automate parts of the search. They’ve narrowed the gap, especially for matmul-shaped problems. They have not eliminated the need for human insight on novel ops.
- “Beating cuBLAS” headlines are usually shape-specific. Worklogs that beat cuBLAS on H100 (cudaforfun.substack.com) typically pick a matrix shape where cuBLAS’s heuristics chose a suboptimal kernel. cuBLAS as a library is very hard to beat across the whole shape space.
- The cost shifted, not disappeared. Models scaled, sequence lengths exploded, and the bandwidth gap widened. Hand-tuning matters more now, not less, even though tools have gotten better.
- I don’t have a clean public number for what fraction of total inference cost on, say, a frontier-model API call lands in hand-tuned kernels vs. framework overhead. The qualitative claim “almost all of it” is what every engineer who’s profiled one will tell you, but I’m not citing a specific public breakdown because I don’t have one I’d defend.
Famous related terms
- Roofline model —
roofline = bandwidth ceiling on the left + FLOPs ceiling on the right— the diagram every kernel engineer keeps in their head. Williams, Waterman, Patterson, 2009. - FlashAttention —
FlashAttention = standard attention + IO-aware tiling + online softmax— the kernel that made long-context transformers practical. - Tensor core —
tensor core = matrix-multiply unit + native low-precision input + fp32 accumulate— the hardware whose hunger for bytes is the whole reason this matters. - HBM —
HBM = stacked DRAM + wide on-package interposer— fast relative to DDR, slow relative to the compute it feeds. - Triton —
Triton ≈ CUDA with a less painful programming model— Python-embedded DSL from OpenAI for writing kernels at the block level. - cuBLAS / CUTLASS —
cuBLAS = closed library of many GEMM kernels + runtime heuristic to pick one,CUTLASS ≈ cuBLAS as templates you can extend. Both NVIDIA.
Going deeper
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., arXiv:2205.14135, May 2022). Read this once even if you never write a kernel — the framing in section 2 (“Standard Attention Implementation”) is the cleanest illustration of “the math wasn’t the bottleneck” you’ll find.
- Simon Boehm’s How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (siboehm.com) is the worklog. Twelve kernels, each annotated. If the roofline model feels abstract, this makes it concrete.
- Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures (CACM, April 2009) — the original roofline paper. Predates the GPU AI era and aged extremely well.
- FlashAttention-3 (Shah et al., July 2024, PyTorch blog) for the Hopper-specific story (TMA, warp-specialization, FP8). Useful as evidence that “the kernel” is a moving target, not a finished artifact.