Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why image generation went diffusion, not autoregressive

LLMs are autoregressive: predict the next token. Image models could have been the same — predict the next pixel. Almost none of the dominant ones are. Here's why the field walked away from that approach.

AI & ML intermediate Apr 29, 2026

Why it exists

If you came to generative AI through LLMs, the architecture story sounds settled. You take a giant transformer, you train it to predict the next token in a sequence, you sample from it. Text, code, JSON, anything that can be tokenized — same recipe.

So the obvious question, the first time you look at how Stable Diffusion or Midjourney or DALL·E or Sora actually work, is: why isn’t this the same trick? Pixels are just numbers. You could flatten an image into a sequence of pixel values and predict the next one. Some early models did exactly that — PixelRNN and PixelCNN in 2016, Image GPT in 2020. They worked. They just didn’t win. For most of the last five years the dominant open-weights and publicly documented image/video systems — DDPM, Stable Diffusion (1.x/2.x/SDXL), Imagen, and Sora on the video side — have been diffusion models (or close cousins like flow matching). Different machine, different objective, different sampling procedure. For closed systems like Midjourney, Runway, and Veo the exact internals aren’t public, so treat “diffusion won” as a claim about the cluster of papers and open weights, not a per-product attribution.

This post is about why. The short version is that “predict the next pixel” is a real architecture, but it solves a problem images don’t have (a natural ordering) while ignoring problems they do have (every pixel matters at once, and tiny independent errors compound). Diffusion gave up on sequence prediction entirely and replaced it with something that fits the geometry of images much better: start from noise, denoise toward an image, in many small steps.

Engineers integrating image or video generation into products keep running into the consequences — cost per image, latency, why a single sampling step looks blurry, why guidance scale matters, why there’s no “streaming first pixels” the way there’s streaming first tokens. Those are all downstream of this choice.

Why it matters now

Diffusion is the production form factor for visual generation:

If you’re shipping anything visual, the cost model, the failure modes, and the controllability primitives all come from this design choice.

The short answer

diffusion model = a denoiser + a fixed noise schedule + sample by reversing the schedule from pure noise

You train one neural network to do one job: given a noisy image and a number telling it how noisy, predict the noise (or equivalently, a slightly-cleaner version). To generate a new image, you start with pure random noise and apply that denoiser many times in sequence, each step removing a little more noise, until what’s left is an image. The “generative model” is the entire reverse-noising trajectory, not a single forward pass.

That’s it. The rest of the post is why this beat autoregressive pixels for images, and where the seams are.

How it works

To see why diffusion fits images, it helps to first see what’s wrong with the LLM-style approach when you point it at pixels.

The problem with “predict the next pixel”

An autoregressive image model has to choose an order. Top-left to bottom-right? Hilbert curve? Coarse-to-fine over patches? Whatever you pick, you’re now claiming that pixel (i, j) only depends on pixels that came earlier in your ordering. That’s not how images work. Pixel (100, 100) depends on pixel (101, 101) just as much as the other way around. The autoregressive factorization picks a side anyway, because it has to.

This bites in three ways:

  1. Long-range coherence is hard. By the time the model is choosing a pixel near the bottom of a face, it has already committed to the pixels near the top — eyes, hairline. If the bottom doesn’t match (chin shape, lighting), the model can’t go back. LLMs have the same problem in principle, but text is roughly causally ordered (we read left-to-right, the next word does mostly depend on prior words). Pixels aren’t.
  2. Errors compound multiplicatively. Every sampled pixel conditions on every previous sampled pixel. A small mistake early — a slightly-off skin tone — gets baked into the conditional distribution for everything after. With millions of pixels per image, the joint distribution drifts.
  3. Sequence length is brutal. A 512×512 RGB image is ~786,000 “tokens” if you go pixel-by-pixel. Image tokenizers cut this down sharply — Parti, for example, uses a 32×32 = 1,024-token grid for a 256×256 image — but the per-token cost of an autoregressive transformer plus quadratic attention still makes naive AR image generation expensive enough that early pixel-RNN/CNN models could only ever produce small images.

You can fix some of this — Image GPT used a learned vector-quantized tokenizer, and modern token-based image models (Parti, plus the autoregressive part of GPT-4o image generation) use much more sophisticated image tokenizers. There are also non-autoregressive token-based generators in the same family — MaskGIT (Chang et al., CVPR 2022) deliberately isn’t raster-scan AR; it iteratively unmasks in parallel, which is closer in spirit to diffusion than to next-token prediction. The “predict the next image token in raster order” branch is the one that lost, not “all token-based image models.”

What diffusion actually does

Diffusion flips the problem. Instead of generating an image one piece at a time, it generates the whole image at every step, but starts with one that’s almost entirely noise and gradually removes the noise.

The training story is small enough to hold in your head:

  1. Take a real image x.
  2. Pick a random “timestep” t between 0 (clean) and T (pure noise).
  3. Add Gaussian noise to x according to a fixed schedule that knows how much noise corresponds to step t. Call the result x_t.
  4. Show the network x_t and t. Ask it to predict the noise that was added (this is the DDPM noise-prediction objective from Ho, Jain, Abbeel 2020).
  5. Loss = mean squared error between predicted and actual noise.
  6. Repeat over millions of (image, timestep) pairs.

That’s the entire training objective. No adversarial loss, no likelihood-by-pixel-ordering, no discriminator. One regression problem.

To generate, you reverse the schedule:

  1. Sample pure noise x_T.
  2. For t = T, T-1, ..., 1: ask the network “what noise is in x_t at step t?” Subtract a fraction of it. You now have x_{t-1}, slightly less noisy.
  3. After T steps you’ve reached x_0, an image.

A few things fall out of this that are worth pausing on:

Latent diffusion: the practical version

The DDPM paper (Ho, Jain, Abbeel 2020) ran diffusion in pixel space. That works for small images, but for 512×512 or 1024×1024 it’s expensive — you’re running a U-Net over the full image at every step.

Latent diffusion (Rombach et al., CVPR 2022; this is the architecture behind Stable Diffusion) added one trick: train an autoencoder first that compresses images into a smaller latent space (e.g. 512×512 RGB becomes a 64×64×4 tensor), and then run diffusion in that latent space. The denoiser is smaller, the per-step compute is much lower, and a (lossy) autoencoder handles the high-frequency detail. The LDM paper frames this explicitly as a perceptual-quality vs. compression trade — the autoencoder is not lossless, just lossy in ways the diffusion stage doesn’t care about.

Almost every open-weights image diffusion model since (Stable Diffusion 1.x, 2.x, SDXL, plenty of the third-party fine-tunes) uses this idea. “Stable Diffusion” up to SDXL is approximately “latent diffusion + a text encoder fed in via cross-attention + a public release.” SD3 moved to flow-matching internals, so the lineage is no longer a single recipe.

Why the field bet on this

The pivotal empirical moment was Dhariwal and Nichol, Diffusion Models Beat GANs on Image Synthesis (2021). Up to that point GANs were the state of the art on image quality. After that point, on ImageNet at 256×256 and 512×512, diffusion was both higher quality (better FID) and more stable to train. GANs are notoriously fiddly — mode collapse, training instability, hyperparameter sensitivity. The diffusion training loop is boring in comparison: one regression objective, no discriminator, no adversarial dynamics. That mattered a lot when scaling up.

So what diffusion gave the field, in plain terms:

The cost is that you need many forward passes per sample. That’s a real, painful trade — and a huge amount of recent work (consistency models, rectified flow, distillation) is about pushing the step count down without tanking quality.

Where diffusion misleads you

Going deeper

What I’m confident about: the noise-prediction training objective, the high-level reasons autoregressive pixels lost (ordering, error compounding, sequence length), and the latent-diffusion compute argument. What I’m less confident about: the exact share of commercial frontier systems still using “classical” diffusion vs. flow matching vs. autoregressive image tokens — labs publish selectively, and the labels in marketing copy don’t always match the math underneath. Treat the cluster, not the individual attribution, as the load-bearing claim.