Why AI accelerators are wrapped in stacks of HBM
Open any photo of a modern AI GPU and you'll see the giant compute die in the middle, ringed by short, fat towers of memory soldered millimeters away. Those towers are HBM, and they exist because regular DRAM physically cannot feed a matrix engine fast enough.
Why it exists
Picture a busy restaurant where the chef can cook a dish in 5 seconds, but the pantry is in a building two blocks away. It doesn’t matter how fast the chef is — the kitchen runs at the speed of the runner fetching ingredients. To go faster, you don’t hire a faster chef. You move the pantry into the kitchen. HBM is the AI-chip version of that move. Instead of making the GPU compute faster, designers physically picked up the memory chips and stacked them millimeters away from the processor, on the same package. The chef and the pantry now share a counter. That short distance is the entire reason modern AI is possible at this scale.
Look at a die-shot of an GPU meant for AI — an H100, an MI300, a TPU package — and the visual is almost always the same. There’s a big square of compute logic in the middle, and right up against it, separated by a millimeter or two of silicon interposer, sit four to eight short rectangular towers. Those towers are HBM stacks. They look out of place — like someone glued extra chips to the GPU — and that visual oddity is the whole story.
Regular computer memory doesn’t sit there. DDR sticks live a few centimeters away on the motherboard, connected by long copper traces. GDDR chips, used in gaming GPUs, sit a centimeter away on the same PCB. HBM sits next to the compute die on the same package, glued in by a special interposer, with thousands of wires running between them.
The reason is brutally simple: AI workloads don’t need more compute as much as they need more bandwidth — bytes per second from memory into the matrix multiplier — and the only way to get that many bytes that fast is to put the memory practically inside the chip.
Why it matters now
Modern frontier model training and inference is, to a first approximation, a memory-bandwidth problem dressed up as a compute problem. A transformer forward pass on a single token spends most of its time reading weights and the KV cache out of memory and feeding them into matrix multipliers. The matrix multipliers are barely the bottleneck; the wires between memory and multipliers usually are. This is why a chip’s HBM bandwidth — measured in terabytes per second, not gigabytes — ends up being one of the most-quoted numbers in any new AI silicon launch, sometimes ahead of the FLOP count.
It’s also why chip supply has gotten weird. The HBM market is dominated by a handful of DRAM manufacturers (publicly: SK hynix, Samsung, Micron), and constraints there now constrain who can build AI accelerators at all. The compute logic isn’t usually the gating part. The stacked memory is.
The short answer
HBM = stacked DRAM dies + through-silicon vias + silicon interposer + very wide, slow bus
HBM is just regular DRAM cells, rearranged. You take eight or twelve DRAM dies, stack them physically on top of each other, drill vertical wires straight through the silicon to connect them (those are TSVs), and then sit that whole tower on an interposer right next to the compute die. The bus to the compute die isn’t fast per wire — each wire runs at modest GDDR-ish speeds — but the bus is very wide, often 1024 bits per stack. Bandwidth = width × clock, and when you can’t push the clock further, you push the width.
How it works
Three pieces have to be true at once.
1. Bandwidth is set by physics, not cleverness.
Memory bandwidth for any DRAM technology is roughly pins × bits per pin per clock × clock. Push the clock too hard and signal integrity falls apart on the long PCB traces — DDR5 sticks max out around 6–8 GT/s per pin in practice, GDDR pushes higher because the traces are shorter, but every step up the clock costs more power for diminishing returns. The physics is set by capacitance, inductance, and how loudly a wire couples to its neighbors. You can’t just print “10 GHz” on the box.
So if you want more bandwidth and you can’t run faster, you run wider. A DDR5 channel is 64 bits. A modern HBM stack is 1024 bits (split into 16 channels of 64 bits each). That’s the trick: instead of trying to win the GHz race, HBM wins the parallel-pins race. A single H100 with five HBM stacks moves a few terabytes per second. The same compute die paired with sticks of DDR would move maybe a tenth of that.
2. You can’t have 1024 wires per stack on a normal PCB.
This is where the interposer matters. A printed circuit board can route maybe a few hundred fast signals to a chip before you run out of layers and space. A silicon interposer — basically a thin extra slab of silicon underneath both the compute die and the HBM stacks — is fabricated with the same lithography used for chips, so it can carry thousands of microscopic traces packed tightly together. That’s the only way a 1024-bit-per-stack bus is physically buildable.
The interposer is also why HBM has to live so close. Beyond a few millimeters those tiny traces start losing signal integrity and burning power. So HBM towers ring the compute die — they have nowhere else to go.
3. Stacking is how you get density without growing the package.
A single DRAM die only stores so many bits. To hit the tens of gigabytes per stack that AI workloads want, the dies are physically stacked — eight high, twelve high, sometimes more — and connected vertically through TSVs. Stacking trades cost and yield (you have to bond and test each layer; one bad die can ruin the stack) for density and very short vertical wires.
The standard account is that this stacking is genuinely hard manufacturing. Yield on a 12-high stack is worse than yield on a single die, and thermals are awkward — heat from the bottom die has to leave through layers of memory above it. I don’t have a reliable public number for current HBM stack yield, and the major manufacturers don’t publish one, so take “it’s hard” as the qualitative claim, not a specific figure.
Why this favors AI workloads specifically.
A transformer doing inference reads gigabytes of weights, runs them through a matrix multiply once, and throws them away. The arithmetic-to-memory ratio (the “arithmetic intensity”) is low — you don’t reuse each loaded weight much before you need the next one. That makes the workload memory-bandwidth bound, not compute-bound. HBM exists for exactly this regime. It’s overkill for a normal CPU running a database, where caches and a few channels of DDR are fine. It’s the right size for a chip that has to feed thousands of multiply-accumulate units every nanosecond.
Training has different bottlenecks (gradients, activations, optimizer state) but the memory-bandwidth dependence is, if anything, sharper. The roofline plot for a typical transformer kernel sits firmly on the memory-bandwidth slope, not the compute ceiling.
Famous related terms
- GDDR —
GDDR ≈ DDR + shorter traces + more aggressive clocks— what gaming GPUs use; bandwidth between DDR and HBM, but cheap and on a normal PCB. - DDR / LPDDR —
DDR = standard PC main memory + 64-bit channels + commodity sticks— what your laptop and server CPU talk to. Plenty for general computing, far too narrow for a matrix engine. - Through-silicon via (TSV) —
TSV = vertical hole + plated metal + die-to-die connection— the physical trick that makes a stack act like one chip electrically. - Silicon interposer —
interposer = thin silicon slab + microscopic traces + a way to wire two dies together at chip density— what lets HBM sit next to the compute die instead of across a motherboard. - Memory bandwidth — see memory-bandwidth — the metric HBM exists to maximize.
- Roofline model —
roofline = peak FLOPs ceiling + memory-bandwidth slope— the chart that says, for a given kernel, whether compute or memory is your wall.
Going deeper
- The JEDEC HBM standard documents (HBM, HBM2, HBM2E, HBM3, HBM3E) are the canonical specs for bus widths, stack heights, and per-pin rates. Successive revisions are mostly “wider and a bit faster.”
- AMD’s 2015 paper “High Bandwidth Memory: The New Standard for Graphics” (published with SK hynix) is the early-era pitch; it lays out the interposer-and-stack argument cleanly.
- For the workload side, Williams, Waterman, and Patterson’s “Roofline: An Insightful Visual Performance Model” (2009) is the source for why memory-bandwidth-bound is a useful category at all.
- I don’t have a reliable public source for current HBM stack yields or per-stack manufacturing costs; if you find one, it’ll usually come from analyst firms (TrendForce, SemiAnalysis) rather than the manufacturers themselves.