Why CPUs have three levels of cache

Look at a CPU die shot and you'll find more area spent on memory than on math — and that memory is split into L1, L2, and L3. The split exists because no single cache can be both big and fast, so the chip builds a ladder instead.

Science intermediate May 14, 2026

Why it exists

Think about how you actually work at a desk. The thing you’re touching right now is in your hand. The papers you keep reaching for are spread on the desk. The rest is in a drawer, and the bulk of it is in a filing cabinet down the hall. Nobody designs it that way on purpose — it just falls out of two facts that fight each other: the surface you can reach instantly is small, and the storage that holds everything is far away.

A CPU has the exact same problem, and it’s not a small one. A modern processor core can do arithmetic in well under a nanosecond. But fetching a value from main memory — the DRAM sticks on your motherboard — takes on the order of a hundred nanoseconds. That’s not a rounding error. That’s the core sitting on its hands for hundreds of arithmetic operations it could have done, waiting for one number to arrive. If every instruction had to wait for memory, a multi-gigahertz CPU would crawl.

So chip designers put a small, fast memory on the chip itself, close to the core, and call it a cache. The obvious next question — the one this post is about — is: if a fast on-chip cache is so good, why not just make it huge and skip slow DRAM entirely? The answer is that you can’t. A cache that’s big is, for unavoidable physical reasons, also slow. So instead of one cache, the chip builds a ladder of them: tiny and fast at the top, large and sluggish at the bottom. The rungs of that ladder are conventionally called L1, L2, and L3.

Why it matters now

This isn’t museum-piece architecture — it’s still where a lot of real-world performance is won or lost. Single-thread speed stopped climbing two decades ago (see the power wall), so for memory-bound code the way you make it faster now is largely by not waiting on memory. A loop that walks an array in order flies, because the hardware prefetcher can see the pattern and pull the data in ahead of time; the same loop chasing pointers around a linked list stalls constantly, because every step is a fresh gamble against that hundred-nanosecond DRAM latency. Similar amounts of source-level work, wildly different speed — and the cache hierarchy is the whole reason why.

It also shows up the moment you go parallel. On a typical x86 chip, L1 and L2 are private to each core while L3 is shared across cores (the exact layout varies by vendor and generation — more on that below). That sharing line is where multi-threaded performance gets subtle: if two cores keep touching the same cache line and at least one of them writes to it, the line ping-pongs between their caches in a way two cores touching separate lines never suffer.

The short answer

cache hierarchy = a stack of caches, each trading size for speed, because no single memory can be both big and fast

You’d love one memory that is enormous and answers instantly. Physics won’t sell you that. What it will sell you is a spectrum: at one end, a tiny store that answers in a clock tick or two; at the other, big DRAM that takes ~100× longer. The hierarchy is the engineering move of buying several points along that spectrum and stacking them — L1 tiny and instant, L2 bigger and slower, L3 bigger still and slower still — so that most requests get caught near the top and only the rare miss pays the full price of going to DRAM.

How it works

Start with the thing that makes the whole idea work, because without it the hierarchy would be pointless: locality. Real programs don’t touch memory randomly. They touch the same data again soon (temporal locality — a loop counter, a hot object), and they touch data near what they just touched (spatial locality — the next element of an array). Because of locality, a small cache holding “the stuff used recently and its neighbours” tends to catch the large majority of accesses for typical workloads even though it holds a tiny fraction of the data. If programs were truly random, no cache would help and none of this would exist.

Now, why can’t one cache be both big and fast? The pressures are physical:

Distance. Signals move at a finite speed, and a bigger memory is physically larger, so on average its cells sit farther from the core. More wire to cross means more time. A cache small enough to sit right next to the execution units can answer in a handful of cycles; one big enough to hold megabytes simply cannot be that close to everything.
Array delay. A bigger storage array means longer wires, bigger address decoders, and longer bitlines inside the cache itself — all of which add delay to every single lookup. Making a cache big makes its own internals slower, before you even count the wire distance to the core.

So you don’t pick one. You build a ladder, where each rung trades size for speed:

flowchart LR
    Core[Core] -->|asks first: a few cycles| L1[L1 cache]
    L1 -->|on miss: slower| L2[L2 cache]
    L2 -->|on miss: slower still| L3[L3 cache]
    L3 -->|on miss: ~100+ cycles| RAM[Main memory / DRAM]

Read it left-to-right as a series of bets. The core asks L1 first. L1 is tiny — on the order of tens of kilobytes per core — and that’s the point: small enough to answer in a few cycles. Miss in L1 and you ask L2: bigger (commonly hundreds of KB to a few MB per core), so it takes longer, but still on-chip and far faster than DRAM. Miss in L2 and you ask L3: bigger again (often tens of MB), commonly shared by a group of cores, slower still. Only if L3 misses do you pay the full trip to DRAM.

The numbers above are deliberately given as orders of magnitude, not exact figures — the precise cycle counts and sizes vary by chip generation, vendor, and even product tier, and I’m not going to invent specific values. What’s robust across all of them is the ratios: each level down is roughly several-fold to an order-of-magnitude slower and bigger than the one above, and the jump from L3 to DRAM is the biggest cliff in the whole stack.

Why does stacking actually pay off, instead of just averaging out to “medium speed for everything”? Because of locality, the hit rates near the top are high for typical workloads. If L1 catches the large majority of accesses in a few cycles, and L2 catches most of the rest, then the average access time is dragged close to L1’s speed — even though the rare miss-all-the-way-to-DRAM access is brutally slow. The hierarchy works because it’s a weighted average where the fast levels carry most of the weight.

Two more details that matter and surprise people:

Caches fill in lines, not bytes. A byte-load instruction can ask for a single byte, but when that access misses, the cache pulls in the whole aligned block — a cache line — around it. That’s spatial locality cashed in: your neighbours come along for free. It’s also why walking an array in order is so much faster than jumping around it.
L1 is usually split, L2 and L3 usually aren’t. L1 is typically two separate caches, one for instructions and one for data, because the core needs to fetch an instruction and its operands in the same cycle without them competing. Lower levels merge back into one unified cache.

One honest caveat on all of this: “three levels, private L1/L2, shared L3” describes the mainstream desktop and server x86 layout, and it’s the right mental model to start from — but it isn’t a law. Vendors differ. Some chips share L3 only within a cluster of cores rather than the whole die; some share L2 across a small group; some add a further system-level cache below L3. The principle — a ladder trading size for speed because you can’t have both — is universal; the exact number of rungs and who shares which one is a design choice.

So the answer to the opening question — why not one huge fast cache — is that the request “huge and fast” is physically incoherent. The hierarchy is what you build when you can’t have both: a ladder that’s mostly-fast on average, with a slow bottom rung you visit as rarely as locality lets you.

Cache line — cache line = the fixed-size block a cache moves as one unit — usually 64 bytes on modern CPUs; the reason sequential access beats random access.
Cache miss — cache miss = asked a level + it didn't have it — forces the request down to the next, slower level; a miss all the way to DRAM is the expensive case the whole hierarchy exists to make rare.
Locality of reference — locality = programs reuse data soon + touch nearby data — the empirical property that makes a small cache catch most accesses; without it caching wouldn’t work.
Prefetching — prefetching = guess the next access + fetch it before it's asked for — hardware (and compilers) try to hide latency by loading cache lines ahead of demand.
TLB — TLB = a cache, but for virtual-to-physical address translations — a separate small cache that keeps virtual memory address lookups from themselves becoming a memory stall.
False sharing — false sharing = two cores + two unrelated variables + same cache line — a multicore performance trap: a write to one variable invalidates the whole line in the other core’s cache, even though it never touched that variable. Exists because the cache’s unit of sharing is the line, not the variable.

Going deeper

Hennessy & Patterson, Computer Architecture: A Quantitative Approach — the memory-hierarchy chapter is the canonical treatment of why the levels exist and how hit rates and miss penalties combine into average access time.
Your CPU vendor’s own optimization manual (Intel’s 64 and IA-32 Architectures Optimization Reference Manual, or AMD’s software optimization guides) — for the actual cache sizes, line size, and latency characteristics of a specific chip, instead of the order-of-magnitude figures used here.
Ulrich Drepper, “What Every Programmer Should Know About Memory” (2007) — a long, detailed walk through real cache behaviour and how to write code that respects it; dated in specifics but still the explainer people point to.
For the rabbit hole: look up cache associativity and replacement policies — the question of where in a cache a given address is allowed to live, and what gets evicted to make room, which this post deliberately skipped over.