Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why is attention quadratic?

Doubling the context length makes attention 4× more expensive, not 2×. That single fact shapes every trade-off in modern LLM serving — and explains what FlashAttention actually changed (it's not what most people think).

AI & ML intermediate Apr 29, 2026

Why it exists

There’s a question every engineer who works with LLMs eventually trips over: why is context length so expensive?

Memory is cheap. FLOPs are cheap. The model weights don’t grow when you feed in a longer prompt. So why does doubling a 4k prompt to 8k make the request feel four times slower, instead of twice? Why do providers charge per token instead of per request? Why did “1M context” take so much engineering effort to ship when “100k context” already worked?

The reason has nothing to do with the model being big. It’s a property of attention itself, the operation at the heart of every transformer. Attention asks every token in the sequence to look at every other token. Every pair. A sequence of N tokens produces an N×N matrix of pairwise interactions, and the compute needed to evaluate that matrix grows as N² — quadratically. The memory footprint is N² in the naive implementation, but this turns out to be fixable, which is most of what FlashAttention is about. The compute isn’t.

That single fact is the source of nearly every cost-and-latency story in modern LLM serving. It’s why context length is rationed. It’s why prompt caching exists. It’s why long context models still feel sluggish even on H100s. The N² is not a bug or a sloppy implementation; it’s baked into what self-attention is.

Why it matters now

For most of the early transformer era, sequences were short. Translation tasks were a few hundred tokens. BERT’s default was 512. Quadratic was annoying but cheap in absolute terms.

Three things changed:

If attention were linear in N, none of this would be a problem. The entire ecosystem of tricks — sliding windows, sparse attention, KV caching, prompt caching, FlashAttention, linear-attention research — exists because it isn’t.

The short answer

attention cost ∝ N² · d (where N = sequence length, d = head dimension)

Self-attention computes a score for every pair of tokens (a query for token i times a key for token j, for all i, j). That’s an N×N matrix. Building it, softmaxing it, and using it to weight the values is unavoidably quadratic in N. You can hide where the matrix lives in memory, but you cannot avoid computing all N² entries — at least not without changing what attention means.

How it works

Let’s actually do the arithmetic, because it makes the rest of the post obvious.

For one attention head with sequence length N and head dimension d_h, you have three matrices:

The math is:

scores  = Q · Kᵀ      # shape N×N
weights = softmax(scores / √d_h)   # still N×N
output  = weights · V # shape N×d_h

Look at the shapes. Q · Kᵀ produces an N×N matrix. Each cell (i, j) is the dot product between query i and key j — how much should token i attend to token j. There are N² cells, each costing O(d_h) to compute, so total work is O(N²·d_h). The output weights · V is another O(N²·d_h) matmul. The softmax is O(N²) — one normalization per row. Summing across H heads, total per-layer attention compute is roughly O(N²·d_model) where d_model = H·d_h is the hidden size.

Compare this to a feed-forward layer, which is O(N·d_model²) — linear in N, quadratic in the hidden size. The standard transformer expands the FFN inner dimension to ~4·d_model, so the FFN constant is bigger than attention’s by roughly that factor. Working through the algebra, attention starts dominating the FFN once N is on the order of a few times d_model — for a typical d_model ≈ 4096, that’s somewhere around N ≈ 16k tokens, very roughly. Below that, the FFN is the expensive part and attention is cheap; above it, attention dominates and grows hard.

Why “every pair” is non-negotiable

You might ask: do we really need all N² scores? Couldn’t every token just look at, say, the last few tokens?

You can. That’s what sliding-window attention does, and what most “long context” architectures eventually fall back on for the bulk of the work. But it’s not free — you’ve changed the model. A token that only attends to a local window cannot directly route information from a distant token in one layer; the information has to hop, layer by layer, like a bucket brigade. Vanilla attention’s whole appeal was that every token could route to every other token in a single layer. Take that away and you’ve broken the property that made transformers work.

This is the actual source of the quadratic: the modeling claim that any pair of tokens might matter. The cost is the price of the claim. Approximations that drop the N² are all making some bet about which pairs you can safely ignore.

What about the softmax?

The softmax in softmax(QKᵀ / √d) is the other thing you can’t easily avoid. It’s a row-wise normalization across N entries — to compute weight (i, j), you need the sum of exp(score_{i, k}) over all k. That’s a global dependency: every column matters for every row. This is why “linear attention” research (which replaces softmax with a kernel that factors as a product) is its own subfield — without softmax, you can sometimes avoid materializing the N×N matrix at all, but you also lose some of the expressive sharpness that makes attention useful in practice. There’s no consensus on whether the trade is worth it; production frontier models still use softmax attention. (My read, not a sourced claim.)

KV cache makes decode linear, not prefill

A common confusion: “doesn’t the KV cache make this linear?” Sort of, but only for decode.

When you generate token N+1 autoregressively, you reuse the K and V tensors you already computed for tokens 1…N. The new token’s query attends against those cached keys, which is O(N·d) per step — linear in the cached length. Generating N tokens one at a time is therefore O(N²·d) total (the sum 1 + 2 + … + N), but each individual step is linear.

Prefill is different. When the model first reads your 50,000-token prompt, there is no cache yet. Every token’s query has to be scored against every other token’s key in one giant matmul — the full N² hit. This is why time-to-first-token grows so badly with prompt length and why prompt caching, which lets the provider skip the prefill on a reused prefix, is such a big deal.

What FlashAttention actually changed (and what it didn’t)

This is the part most people get wrong, so it’s worth being precise.

FlashAttention — the 2022 paper by Tri Dao and collaborators — did not make attention subquadratic. Its FLOP count is still O(N²·d). The paper is open about this; the contribution is “IO-awareness,” not a complexity reduction.

What it changed was where the N×N matrix lives. Standard attention writes the full N×N scores matrix out to GPU HBM, then reads it back to softmax it, then writes the result, then reads it again to multiply by V. Those reads and writes are the actual bottleneck on a GPU, because SRAM on-chip is much faster than HBM but much smaller. The original FlashAttention paper’s A100 example puts the gap at roughly 19 TB/s SRAM (20 MB total) versus 1.5 TB/s HBM (40 GB) — about 13× the bandwidth in about 1/2000th the capacity. The arithmetic units finish quickly and then sit idle waiting for memory. (Same theme as the memory bandwidth story for inference more broadly.)

FlashAttention’s trick is tiling: split Q, K, V into blocks small enough to fit in SRAM, compute the partial attention for each block fused into one kernel (scores → softmax → output, all in SRAM), and never materialize the full N×N matrix in HBM at all. Total FLOPs are the same. Total HBM traffic drops dramatically. The headline number from the paper is up to ~7.6× speedup on the attention computation itself on GPT-2; end-to-end the reported speedups were more modest (the abstract mentions ~3× on GPT-2 and ~15% on BERT-large). The bigger speedups arrive once sequences are long enough that attention is most of the work.

So: attention is still quadratic in compute. It’s also now linear in memory (you don’t store the N×N matrix), which is what made very long contexts feasible at all. The bottleneck shifted from memory I/O to actual arithmetic — which is where the N² lives, and which we don’t know how to remove without changing the model.

Where the math actually breaks down

A few subtleties worth seeing the seams on:

The thing to take away: when someone says “attention is quadratic,” they mean a specific, narrow, mathematically-forced thing — the N×N pairwise score matrix is intrinsic to softmax self-attention. Every long-context optimization in production is either (a) hiding where that matrix lives so it doesn’t blow up memory (FlashAttention), (b) skipping the prefill entirely when the prefix repeats (prompt caching), or (c) changing what attention means and accepting some loss of expressivity (sparse / sliding / linear variants). Knowing which bucket a given optimization sits in is most of understanding modern serving.

Going deeper