Why is attention quadratic?

Doubling the context length makes attention 4× more expensive, not 2×. That single fact shapes every trade-off in modern LLM serving — and explains what FlashAttention actually changed (it's not what most people think).

AI & ML intermediate Apr 29, 2026

Why it exists

There’s a question every engineer who works with LLMs eventually trips over: why is context length so expensive?

Memory is cheap. FLOPs are cheap. The model weights don’t grow when you feed in a longer prompt. So why does doubling a 4k prompt to 8k make the request feel four times slower, instead of twice? Why do providers charge per token instead of per request? Why did “1M context” take so much engineering effort to ship when “100k context” already worked?

The reason has nothing to do with the model being big. It’s a property of attention itself, the operation at the heart of every transformer. Attention asks every token in the sequence to look at every other token. Every pair. A sequence of N tokens produces an N×N matrix of pairwise interactions, and the compute needed to evaluate that matrix grows as N² — quadratically. The memory footprint is N² in the naive implementation, but this turns out to be fixable, which is most of what FlashAttention is about. The compute isn’t.

That single fact is the source of nearly every cost-and-latency story in modern LLM serving. It’s why context length is rationed. It’s why prompt caching exists. It’s why long context models still feel sluggish even on H100s. The N² is not a bug or a sloppy implementation; it’s baked into what self-attention is.

Why it matters now

For most of the early transformer era, sequences were short. Translation tasks were a few hundred tokens. BERT’s default was 512. Quadratic was annoying but cheap in absolute terms.

Three things changed:

Long context became a product feature. Frontier models advertise 200k–1M token context windows. The N² in N=1,000,000 is a trillion attention entries per layer. You cannot materialize that matrix.
Agents send long prompts on every turn. A coding agent’s system prompt plus tool definitions plus scrollback can easily reach tens of thousands of tokens of prefix (my own ballpark from looking at production agent harnesses, not a sourced number). Quadratic prefill on that prefix is one of the most expensive things the model does each step.
Prefill dominates time-to-first-token for long prompts. The “thinking…” pause before a long-context model responds is mostly the quadratic part. Decode (one token at a time) is cheap relative to prefill on a long prompt; that asymmetry is why so many serving optimizations target prefill specifically. For short prompts none of this matters — decode dominates.

If attention were linear in N, none of this would be a problem. The entire ecosystem of tricks — sliding windows, sparse attention, KV caching, prompt caching, FlashAttention, linear-attention research — exists because it isn’t.

The short answer

attention cost ∝ N² · d (where N = sequence length, d = head dimension)

Self-attention computes a score for every pair of tokens (a query for token i times a key for token j, for all i, j). That’s an N×N matrix. Building it, softmaxing it, and using it to weight the values is unavoidably quadratic in N. You can hide where the matrix lives in memory, but you cannot avoid computing all N² entries — at least not without changing what attention means.

How it works

Let’s actually do the arithmetic, because it makes the rest of the post obvious.

For one attention head with sequence length N and head dimension d_h, you have three matrices:

Q (queries): shape N×d_h
K (keys): shape N×d_h
V (values): shape N×d_h

The math is:

scores  = Q · Kᵀ      # shape N×N
weights = softmax(scores / √d_h)   # still N×N
output  = weights · V # shape N×d_h

Look at the shapes. Q · Kᵀ produces an N×N matrix. Each cell (i, j) is the dot product between query i and key j — how much should token i attend to token j. There are N² cells, each costing O(d_h) to compute, so total work is O(N²·d_h). The output weights · V is another O(N²·d_h) matmul. The softmax is O(N²) — one normalization per row. Summing across H heads, total per-layer attention compute is roughly O(N²·d_model) where d_model = H·d_h is the hidden size.

Compare this to a feed-forward layer, which is O(N·d_model²) — linear in N, quadratic in the hidden size. The standard transformer expands the FFN inner dimension to ~4·d_model, so the FFN constant is bigger than attention’s by roughly that factor. Working through the algebra, attention starts dominating the FFN once N is on the order of a few times d_model — for a typical d_model ≈ 4096, that’s somewhere around N ≈ 16k tokens, very roughly. Below that, the FFN is the expensive part and attention is cheap; above it, attention dominates and grows hard.

Why “every pair” is non-negotiable

You might ask: do we really need all N² scores? Couldn’t every token just look at, say, the last few tokens?

You can. That’s what sliding-window attention does, and what most “long context” architectures eventually fall back on for the bulk of the work. But it’s not free — you’ve changed the model. A token that only attends to a local window cannot directly route information from a distant token in one layer; the information has to hop, layer by layer, like a bucket brigade. Vanilla attention’s whole appeal was that every token could route to every other token in a single layer. Take that away and you’ve broken the property that made transformers work.

This is the actual source of the quadratic: the modeling claim that any pair of tokens might matter. The cost is the price of the claim. Approximations that drop the N² are all making some bet about which pairs you can safely ignore.

What about the softmax?

The softmax in softmax(QKᵀ / √d) is the other thing you can’t easily avoid. It’s a row-wise normalization across N entries — to compute weight (i, j), you need the sum of exp(score_{i, k}) over all k. That’s a global dependency: every column matters for every row. This is why “linear attention” research (which replaces softmax with a kernel that factors as a product) is its own subfield — without softmax, you can sometimes avoid materializing the N×N matrix at all, but you also lose some of the expressive sharpness that makes attention useful in practice. There’s no consensus on whether the trade is worth it; production frontier models still use softmax attention. (My read, not a sourced claim.)

KV cache makes decode linear, not prefill

A common confusion: “doesn’t the KV cache make this linear?” Sort of, but only for decode.

When you generate token N+1 autoregressively, you reuse the K and V tensors you already computed for tokens 1…N. The new token’s query attends against those cached keys, which is O(N·d) per step — linear in the cached length. Generating N tokens one at a time is therefore O(N²·d) total (the sum 1 + 2 + … + N), but each individual step is linear.

Prefill is different. When the model first reads your 50,000-token prompt, there is no cache yet. Every token’s query has to be scored against every other token’s key in one giant matmul — the full N² hit. This is why time-to-first-token grows so badly with prompt length and why prompt caching, which lets the provider skip the prefill on a reused prefix, is such a big deal.

What FlashAttention actually changed (and what it didn’t)

This is the part most people get wrong, so it’s worth being precise.

FlashAttention — the 2022 paper by Tri Dao and collaborators — did not make attention subquadratic. Its FLOP count is still O(N²·d). The paper is open about this; the contribution is “IO-awareness,” not a complexity reduction.

What it changed was where the N×N matrix lives. Standard attention writes the full N×N scores matrix out to GPU HBM, then reads it back to softmax it, then writes the result, then reads it again to multiply by V. Those reads and writes are the actual bottleneck on a GPU, because SRAM on-chip is much faster than HBM but much smaller. The original FlashAttention paper’s A100 example puts the gap at roughly 19 TB/s SRAM (20 MB total) versus 1.5 TB/s HBM (40 GB) — about 13× the bandwidth in about 1/2000th the capacity. The arithmetic units finish quickly and then sit idle waiting for memory. (Same theme as the memory bandwidth story for inference more broadly.)

FlashAttention’s trick is tiling: split Q, K, V into blocks small enough to fit in SRAM, compute the partial attention for each block fused into one kernel (scores → softmax → output, all in SRAM), and never materialize the full N×N matrix in HBM at all. Total FLOPs are the same. Total HBM traffic drops dramatically. The headline number from the paper is up to ~7.6× speedup on the attention computation itself on GPT-2; end-to-end the reported speedups were more modest (the abstract mentions ~3× on GPT-2 and ~15% on BERT-large). The bigger speedups arrive once sequences are long enough that attention is most of the work.

So: attention is still quadratic in compute. It’s also now linear in memory (you don’t store the N×N matrix), which is what made very long contexts feasible at all. The bottleneck shifted from memory I/O to actual arithmetic — which is where the N² lives, and which we don’t know how to remove without changing the model.

Where the math actually breaks down

A few subtleties worth seeing the seams on:

Multi-head attention multiplies the constant, not the exponent. With H heads each of dimension d_h, the per-head cost is O(N²·d_h), times H heads, equals O(N²·d_model). Heads don’t change the asymptotic story.
Causal masking doesn’t save you asymptotically. A decoder-only model only attends to past tokens, so about half the N×N entries are masked out. That’s at most a ~2× saving in score entries (the exact count is N(N+1)/2 vs N²); how much of that translates into wall-clock depends on the kernel, since a kernel that doesn’t know about the mask still does the work. (FlashAttention does know about it and skips fully-masked tiles, which is where the win actually shows up.)
GQA changes the KV cache, not the asymptotic attention compute. The N² in Q·Kᵀ is determined by sequence length, and at the leading order GQA leaves it alone. What it does change — meaningfully — is KV-cache size and the memory bandwidth needed to read those keys and values during decode, which is why GQA noticeably speeds up long-context inference in practice. Asymptotic and practical aren’t the same thing here.
“Sub-quadratic” architectures exist but trade fidelity. Linear attention, state-space models (Mamba), sliding-window-only models — all real and shipping in some form. The honest version, which is the most I can defend: among the open-weights frontier-class general-purpose models I can inspect as of early 2026, softmax attention (often combined with GQA and sometimes with sliding-window layers in a hybrid) is still the dominant choice. The closed frontier models are closed; I’m inferring, not reporting.
The N² is per layer. A 60-layer model pays the quadratic cost 60 times. That linear-in-depth factor is also load-bearing in practice but doesn’t change the per-layer story.

The thing to take away: when someone says “attention is quadratic,” they mean a specific, narrow, mathematically-forced thing — the N×N pairwise score matrix is intrinsic to softmax self-attention. Every long-context optimization in production is either (a) hiding where that matrix lives so it doesn’t blow up memory (FlashAttention), (b) skipping the prefill entirely when the prefix repeats (prompt caching), or (c) changing what attention means and accepting some loss of expressivity (sparse / sliding / linear variants). Knowing which bucket a given optimization sits in is most of understanding modern serving.

KV cache — KV cache = per-layer K and V tensors for past tokens, reused on the next decode step. Turns the per-step cost of decode from O(N²) to O(N). Doesn’t help prefill.
Prompt caching — prompt caching = KV cache + persisted across requests. The way to dodge the quadratic prefill on a repeated prefix entirely.
FlashAttention — FlashAttention = exact softmax attention + tiling so the N×N matrix never hits HBM. Same FLOPs, far fewer memory accesses, dramatically faster wall-clock.
Sliding-window attention — sliding-window attention ≈ vanilla attention but each token only sees a fixed local window. Linear in N at the cost of breaking single-hop long-range routing.
Linear attention — linear attention = drop softmax + use a kernel that factors so you never form Q·Kᵀ. O(N) in time and memory, but the model is no longer the same model.
State-space models / Mamba — Mamba ≈ a different sequence layer with O(N) cost and selective gating. An alternative to attention, not a faster version of it.
Softmax — softmax(x_i) = exp(x_i) / Σ exp(x_j). The row-wise normalization that forces the N×N matrix to be globally coupled.
Memory bandwidth — memory bandwidth = the actual bottleneck for most LLM operations, not FLOPs. FlashAttention’s whole pitch is reframing attention as a bandwidth problem rather than a compute problem.

Going deeper

Vaswani et al., Attention Is All You Need (2017) — the original transformer paper. Section 3.2 has the attention math; the quadratic falls out of it directly.
Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NeurIPS 2022) — the canonical “we didn’t reduce FLOPs, we reduced HBM traffic” paper. Reading the introduction alone is enough to internalize the IO-awareness framing.
Dao, FlashAttention-2 (2023) and Shah et al., FlashAttention-3 (2024) — successive constant-factor wins on the same idea, with FA-3 specifically targeting Hopper-generation GPUs.
Tay et al., Efficient Transformers: A Survey (2020, updated) — the broader landscape of “how do we kill the N²” attempts. Useful for seeing why so few have stuck.