Why GPUs ended up running AI even though they were built for graphics

GPUs were designed to shade pixels. Then the same hardware turned out to be the best general-purpose neural-net engine on the planet. The reason isn't an accident — graphics and deep learning happen to need the same thing.

Science intro Apr 29, 2026

Why it exists

If you’d told a graphics engineer in 1999 that the chip they were designing to draw triangles in Quake would, twenty-five years later, be the most expensive component in a hyperscaler datacenter and the bottleneck of human civilization’s compute budget, they would have laughed. GPUs were not built for science. They were built so that the same shader program could run, independently, on every pixel of a screen at sixty frames per second.

And yet today essentially every frontier neural network — every LLM, every diffusion model, every recommender — is trained and served on GPUs (or things that look like GPUs, like TPUs). The interesting question isn’t that this happened, it’s why the fit is so absurdly good. Graphics and deep learning look like completely different problems. Why does the same hardware win at both?

The short version: both problems are, deep down, the same shape — do the same arithmetic to a giant pile of numbers, all at once, with no branching. The hardware that solved one already solved the other. We just didn’t notice for a while.

Why it matters now

If you’re a software engineer in 2026, the cost curve of your product probably has a GPU in the denominator. The reason an inference call costs what it costs, the reason fine-tuning a 70B model is expensive, the reason a startup’s runway is partly a function of NVIDIA’s gross margin — it all comes back to the fact that the only chips on Earth that can do dense linear algebra at scale, cheaply, are descendants of pixel shaders.

Understanding why graphics hardware became AI hardware also tells you what an alternative would have to look like. Custom AI silicon (TPUs, Trainium, MTIA, Cerebras wafers) isn’t trying to “be a better GPU” — it’s trying to keep the parts of the GPU that matter for neural nets and drop the parts that exist purely because of legacy graphics.

The short answer

GPU = thousands of slow arithmetic lanes + wide memory + "everyone runs the same instruction" execution model

A GPU is not a fast computer. A single GPU “core” is much weaker than a CPU core — slower clock, dumber branch prediction, smaller caches per lane. What a GPU has is many of them, all forced to run the same instruction at the same time, fed by an unusually fat pipe to memory. That’s exactly the recipe for shading a million pixels. It’s also exactly the recipe for multiplying two big matrices, which is what a neural network mostly is.

How it works

Three threads have to come together to see the fit clearly.

1. Graphics is embarrassingly parallel, and it’s all matmul under the hood.

To render a 3D scene, the GPU does roughly: take a list of vertices, multiply each one by a 4×4 transformation matrix to get screen coordinates; then for every pixel covered, run a small program (a shader) to decide its color. The work on one pixel doesn’t depend on the work on the next pixel. There are millions of them. They all run the same program. This is the textbook definition of data parallelism.

Underneath, the operations are dominated by tiny matrix multiplies and dot products: transforming vertices, lighting calculations, texture sampling, color blending. By the late 1990s, GPU vendors had built silicon that could do thousands of these in parallel, every frame, forever. The architecture that fell out of this is called SIMT: many threads, all running the same instruction in lockstep on different data.

2. Neural networks turn out to be the same shape.

A forward pass through a transformer is, to a first approximation, a stack of large matrix multiplications interleaved with cheap element-wise operations (additions, nonlinearities, normalizations). Backprop is more matrix multiplications. Training a model for months is, mostly, doing matmul forever. (See why matmul is the bottleneck.)

Matmul is the most embarrassingly parallel thing in numerical computing. Every output element is an independent dot product. There are no branches, no sequential dependencies inside the multiply, no need for elaborate control flow. It’s the exact workload SIMT was built for — except instead of “one instruction over a million pixels,” it’s “one instruction over a million matrix tiles.”

The realization that GPUs could run general numerical code crystallized around 2007 with NVIDIA’s CUDA, which exposed the GPU as a programmable parallel machine instead of a fixed-function pixel pipeline. Deep learning’s takeoff moment — the AlexNet ImageNet result in 2012 — ran on two consumer GeForce GTX 580 cards and CUDA. That wasn’t a coincidence; it was the first widely visible case of “the graphics chip is also the math chip.”

3. Wide memory, not fast clocks, is the actual moat.

Modern training is mostly limited by how fast you can move weights and activations between memory and the arithmetic units, not by how fast the units themselves can multiply. (See memory bandwidth.) GPUs ship with HBM — DRAM stacks bonded next to the chip — delivering terabytes per second of bandwidth. That bandwidth exists because graphics, too, is bandwidth-bound: you’re streaming textures and framebuffers constantly. The same memory subsystem that fed pixel shaders now feeds matrix multiply units.

The CPU world chose latency: small, fast caches, complex prediction, lots of silicon spent on making one thread fast. The GPU world chose throughput: many slow lanes, very wide memory, no per-thread cleverness. AI happens to want throughput, not latency.

What’s vestigial, and where the seams show.

Not everything on a GPU is useful for AI. Texture units, raster operators, ray-tracing cores, video encoders — all dead weight for neural-net work. NVIDIA has gradually paved over this by adding Tensor Cores, which are essentially “matmul accelerators bolted into the same SIMT envelope.” A modern training GPU is mostly a matmul engine wearing a graphics chip’s skin. Custom AI chips like TPUs go further and drop the graphics legacy entirely: a TPU is, roughly, a giant systolic array with a tiny bit of control logic around it.

I should be honest about gaps. The exact die-area split between “graphics legacy” and “AI-relevant” silicon on, say, an H100 or a B200 isn’t something I have a reliable public number for, and the boundary is fuzzy because units like memory controllers and schedulers serve both. The standard account — that AI workloads are now the design target and graphics is increasingly the side gig — is well-supported by NVIDIA’s own roadmap statements, but I don’t want to pretend I have a precise breakdown.

The deeper reason this all worked is older than either field: when a workload is “the same arithmetic, repeated, with no dependencies,” the hardware that wins is the hardware that gives up everything else for parallel arithmetic and bandwidth. Graphics demanded that shape first. Deep learning showed up later and discovered the seat was already warm.

CUDA — CUDA = GPU + general-purpose programming model — the 2007 software layer that turned graphics chips into compute chips. Most of NVIDIA’s moat is CUDA, not silicon.
SIMT — SIMT = many threads + one shared instruction stream — the GPU execution model: branches that diverge between threads serialize and waste lanes, which is why GPU code avoids data-dependent control flow.
Tensor Cores — Tensor Cores = small matmul units + mixed precision + on-die — added to GPUs specifically because the AI workload outgrew the general-purpose lanes. Most of an H100’s FLOPs come from these.
TPU — TPU ≈ giant systolic array + minimal control — Google’s bet on “drop the graphics legacy entirely.”
HBM — HBM = DRAM stacks + bonded next to the chip + huge bus — the memory technology that makes “stream weights at TB/s” affordable. Without it, you can’t feed the matmul units.

Going deeper

The AlexNet paper (Krizhevsky, Sutskever, Hinton, 2012) is the moment GPUs visibly became AI hardware. Worth reading just for the matter-of-fact “we used two GTX 580s” framing.
NVIDIA’s CUDA documentation and the original 2008 Lindholm et al. paper on the Tesla architecture lay out the SIMT model in plain terms.
Jouppi et al. “In-Datacenter Performance Analysis of a Tensor Processing Unit” (2017) is the canonical pitch for “what a chip looks like if you start from neural nets instead of from pixels.”
For the broader framing — that throughput architectures beat latency architectures whenever the workload allows it — Hennessy & Patterson’s later editions and the “Computer Architecture: A Quantitative Approach” chapters on data-level parallelism are the standard reference.