Why image generation went diffusion, not autoregressive

LLMs are autoregressive: predict the next token. Image models could have been the same — predict the next pixel. Almost none of the dominant ones are. Here's why the field walked away from that approach.

AI & ML intermediate Apr 29, 2026

Why it exists

If you came to generative AI through LLMs, the architecture story sounds settled. You take a giant transformer, you train it to predict the next token in a sequence, you sample from it. Text, code, JSON, anything that can be tokenized — same recipe.

So the obvious question, the first time you look at how Stable Diffusion or Midjourney or DALL·E or Sora actually work, is: why isn’t this the same trick? Pixels are just numbers. You could flatten an image into a sequence of pixel values and predict the next one. Some early models did exactly that — PixelRNN and PixelCNN in 2016, Image GPT in 2020. They worked. They just didn’t win. For most of the last five years the dominant open-weights and publicly documented image/video systems — DDPM, Stable Diffusion (1.x/2.x/SDXL), Imagen, and Sora on the video side — have been diffusion models (or close cousins like flow matching). Different machine, different objective, different sampling procedure. For closed systems like Midjourney, Runway, and Veo the exact internals aren’t public, so treat “diffusion won” as a claim about the cluster of papers and open weights, not a per-product attribution.

This post is about why. The short version is that “predict the next pixel” is a real architecture, but it solves a problem images don’t have (a natural ordering) while ignoring problems they do have (every pixel matters at once, and tiny independent errors compound). Diffusion gave up on sequence prediction entirely and replaced it with something that fits the geometry of images much better: start from noise, denoise toward an image, in many small steps.

Engineers integrating image or video generation into products keep running into the consequences — cost per image, latency, why a single sampling step looks blurry, why guidance scale matters, why there’s no “streaming first pixels” the way there’s streaming first tokens. Those are all downstream of this choice.

Why it matters now

Diffusion is the production form factor for visual generation:

Image and video products — Stable Diffusion, Flux, Imagen, Sora, plus the closed systems behind Midjourney, Runway, Veo, and others. Where labs publish (Stable Diffusion, Imagen, Sora, SD3), the architecture is diffusion or flow matching. Where they don’t (Midjourney, Runway, Veo, the various closed Dall·E iterations), diffusion is a reasonable guess for most of the lineage but not a documented fact. The cluster of public papers is overwhelmingly diffusion-family.
Inference cost shape — a diffusion image is “N forward passes through a denoiser” where N is the number of sampling steps (often 20–50, sometimes as low as 1–4 with distilled models). That’s a very different cost curve from “M forward passes through an LLM” where M is the number of output tokens. Latency is roughly fixed per image, not proportional to image content.
No native streaming — there’s no “first pixel” to show early. Intermediate steps look like noise, then like a blurry hallucination, then sharpen. UIs that show progressive denoising are showing the sampler, not partial output.
Hybrid models are blurring the line — GPT-4o’s image generation is, per OpenAI’s launch post and system-card addendum, at least partly autoregressive: a transformer generates image tokens from text, and there’s a diffusion-flavored decoder on the back of the pipeline (the marketing diagram shows tokens → transformer → diffusion → pixels). The exact split between AR and diffusion inside that pipeline isn’t fully public. The takeaway isn’t “image generation is going back to autoregressive” or “it’s all diffusion” — it’s that “image = pure diffusion” was never a law, and frontier systems are now openly mixing the two.

If you’re shipping anything visual, the cost model, the failure modes, and the controllability primitives all come from this design choice.

The short answer

diffusion model = a denoiser + a fixed noise schedule + sample by reversing the schedule from pure noise

You train one neural network to do one job: given a noisy image and a number telling it how noisy, predict the noise (or equivalently, a slightly-cleaner version). To generate a new image, you start with pure random noise and apply that denoiser many times in sequence, each step removing a little more noise, until what’s left is an image. The “generative model” is the entire reverse-noising trajectory, not a single forward pass.

That’s it. The rest of the post is why this beat autoregressive pixels for images, and where the seams are.

How it works

To see why diffusion fits images, it helps to first see what’s wrong with the LLM-style approach when you point it at pixels.

The problem with “predict the next pixel”

An autoregressive image model has to choose an order. Top-left to bottom-right? Hilbert curve? Coarse-to-fine over patches? Whatever you pick, you’re now claiming that pixel (i, j) only depends on pixels that came earlier in your ordering. That’s not how images work. Pixel (100, 100) depends on pixel (101, 101) just as much as the other way around. The autoregressive factorization picks a side anyway, because it has to.

This bites in three ways:

Long-range coherence is hard. By the time the model is choosing a pixel near the bottom of a face, it has already committed to the pixels near the top — eyes, hairline. If the bottom doesn’t match (chin shape, lighting), the model can’t go back. LLMs have the same problem in principle, but text is roughly causally ordered (we read left-to-right, the next word does mostly depend on prior words). Pixels aren’t.
Errors compound multiplicatively. Every sampled pixel conditions on every previous sampled pixel. A small mistake early — a slightly-off skin tone — gets baked into the conditional distribution for everything after. With millions of pixels per image, the joint distribution drifts.
Sequence length is brutal. A 512×512 RGB image is ~786,000 “tokens” if you go pixel-by-pixel. Image tokenizers cut this down sharply — Parti, for example, uses a 32×32 = 1,024-token grid for a 256×256 image — but the per-token cost of an autoregressive transformer plus quadratic attention still makes naive AR image generation expensive enough that early pixel-RNN/CNN models could only ever produce small images.

You can fix some of this — Image GPT used a learned vector-quantized tokenizer, and modern token-based image models (Parti, plus the autoregressive part of GPT-4o image generation) use much more sophisticated image tokenizers. There are also non-autoregressive token-based generators in the same family — MaskGIT (Chang et al., CVPR 2022) deliberately isn’t raster-scan AR; it iteratively unmasks in parallel, which is closer in spirit to diffusion than to next-token prediction. The “predict the next image token in raster order” branch is the one that lost, not “all token-based image models.”

What diffusion actually does

Diffusion flips the problem. Instead of generating an image one piece at a time, it generates the whole image at every step, but starts with one that’s almost entirely noise and gradually removes the noise.

The training story is small enough to hold in your head:

Take a real image x.
Pick a random “timestep” t between 0 (clean) and T (pure noise).
Add Gaussian noise to x according to a fixed schedule that knows how much noise corresponds to step t. Call the result x_t.
Show the network x_t and t. Ask it to predict the noise that was added (this is the DDPM noise-prediction objective from Ho, Jain, Abbeel 2020).
Loss = mean squared error between predicted and actual noise.
Repeat over millions of (image, timestep) pairs.

That’s the entire training objective. No adversarial loss, no likelihood-by-pixel-ordering, no discriminator. One regression problem.

To generate, you reverse the schedule:

Sample pure noise x_T.
For t = T, T-1, ..., 1: ask the network “what noise is in x_t at step t?” Subtract a fraction of it. You now have x_{t-1}, slightly less noisy.
After T steps you’ve reached x_0, an image.

A few things fall out of this that are worth pausing on:

Every step refines the whole image. No pixel ordering. The network sees the entire current canvas every time and can change any part of it. Long-range coherence comes for free in the sense that the network is always conditioning on the global state.
The number of steps is a knob, not a property of the image. More steps = better quality, more compute. Recent samplers (DDIM, DPM-Solver, and distilled “few-step” models like SDXL Turbo and consistency models) have pushed the practical step count from 1000 down to 20–50, and in distilled regimes down to 1–4 steps, with quality losses that are sometimes barely visible.
Errors don’t compound the same way. A bad denoising step can be partially undone by later steps, because each step is making a small correction toward the data distribution. The trajectory is forgiving in a way an autoregressive sequence isn’t.

Latent diffusion: the practical version

The DDPM paper (Ho, Jain, Abbeel 2020) ran diffusion in pixel space. That works for small images, but for 512×512 or 1024×1024 it’s expensive — you’re running a U-Net over the full image at every step.

Latent diffusion (Rombach et al., CVPR 2022; this is the architecture behind Stable Diffusion) added one trick: train an autoencoder first that compresses images into a smaller latent space (e.g. 512×512 RGB becomes a 64×64×4 tensor), and then run diffusion in that latent space. The denoiser is smaller, the per-step compute is much lower, and a (lossy) autoencoder handles the high-frequency detail. The LDM paper frames this explicitly as a perceptual-quality vs. compression trade — the autoencoder is not lossless, just lossy in ways the diffusion stage doesn’t care about.

Almost every open-weights image diffusion model since (Stable Diffusion 1.x, 2.x, SDXL, plenty of the third-party fine-tunes) uses this idea. “Stable Diffusion” up to SDXL is approximately “latent diffusion + a text encoder fed in via cross-attention + a public release.” SD3 moved to flow-matching internals, so the lineage is no longer a single recipe.

Why the field bet on this

The pivotal empirical moment was Dhariwal and Nichol, Diffusion Models Beat GANs on Image Synthesis (2021). Up to that point GANs were the state of the art on image quality. After that point, on ImageNet at 256×256 and 512×512, diffusion was both higher quality (better FID) and more stable to train. GANs are notoriously fiddly — mode collapse, training instability, hyperparameter sensitivity. The diffusion training loop is boring in comparison: one regression objective, no discriminator, no adversarial dynamics. That mattered a lot when scaling up.

So what diffusion gave the field, in plain terms:

Better samples than GANs at scale (FID and human eval).
Stable training with a single, simple objective.
Better mode coverage than GANs — diffusion training is likelihood-adjacent (closely related to a variational bound on log- likelihood; not quite “maximum likelihood” — there’s a separate literature on training score models for ML directly). In practice this shows up as better recall and less of the “every face looks the same” mode-collapse failure GANs are notorious for. “No mode collapse” is overstating it; “much less mode collapse, by construction” is the honest version.
Natural conditioning — text, depth maps, segmentation, ControlNet-style spatial signals all plug in as extra inputs to the denoiser. You don’t have to redesign the architecture per conditioning type.
An ordering-free factorization of the image distribution — which is what was wrong with autoregressive in the first place.

The cost is that you need many forward passes per sample. That’s a real, painful trade — and a huge amount of recent work (consistency models, rectified flow, distillation) is about pushing the step count down without tanking quality.

Where diffusion misleads you

It’s not literally “image = denoising.” The training objective is to predict noise (or, equivalently up to scaling, predict the clean image), but what’s actually being learned is a score function — the gradient of the data log-density with respect to the input. Sampling is following that gradient, with controlled noise injection, from the prior to the data. The “denoising” framing is the easiest way to teach it; the score-matching framing is the one that connects to most of the modern theory (Song et al., 2021).
Classifier-free guidance is doing something weirder than it looks. Most text-to-image quality at high “guidance scale” comes from extrapolating between conditional and unconditional score predictions. That’s why cranking guidance up makes outputs more “on-prompt” but also more saturated and stylized — you’re walking off the data manifold on purpose.
“Diffusion” is becoming a leaky label. Flow matching, rectified flow, and consistency models are mathematically distinct but ship in the same product slots. Several frontier systems (e.g. SD3-class models) use rectified flow rather than classical DDPM-style diffusion. The user-facing API is identical; the math underneath isn’t.
Autoregressive isn’t dead for images. Token-based image generation is back in frontier systems, especially where tight coupling with a text model matters. OpenAI’s public material on GPT-4o image generation describes it as autoregressive over image tokens — but the launch diagram also shows a diffusion-flavored decoder at the end (tokens → transformer → diffusion → pixels), so it isn’t pure AR. I’d read the field’s current answer to “diffusion or autoregressive?” as “increasingly, both, in different parts of the same pipeline.”

DDPM — DDPM ≈ U-Net denoiser + fixed noise schedule + simplified noise-prediction MSE loss. The 2020 Ho/Jain/Abbeel paper derives the objective from a weighted variational bound; the simplified MSE form is the practical recipe everyone uses.
Latent diffusion / Stable Diffusion — latent diffusion = autoencoder + diffusion in the autoencoder's latent space. Rombach et al., CVPR 2022. Why diffusion got cheap enough to open-source.
Score-based models — score model = network that predicts ∇ log p(x) at each noise level. The mathematical sibling of diffusion (Song & Ermon, 2019; Song et al., 2021); same trained network, different framing.
Classifier-free guidance — CFG = conditional prediction + extrapolated away from unconditional prediction. The reason text-to-image follows prompts as well as it does at the cost of looking a bit stylized.
Consistency models — consistency model ≈ network trained so any point on a noise→data trajectory maps to the same final image. Song et al., 2023. Latent Consistency Models (LCMs) apply the same idea to Stable-Diffusion-style latent diffusion.
Few-step distillation (e.g. SDXL Turbo) — Turbo-style model ≈ diffusion model distilled into 1–4 steps via adversarial / distillation losses. Same product slot as consistency models — “near-real-time image generation” — but mathematically a separate family.
Flow matching / rectified flow — flow matching ≈ diffusion + a straighter probability path between noise and data. Cousin family used in some recent frontier image and video models. Same product, different math.
Autoregressive image models — AR image model = image tokenizer + transformer predicting the next image token. The road not (mostly) taken — Image GPT, Parti, MaskGIT — and apparently the road being re-taken in some 2024–2026 frontier systems.

Going deeper

Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (NeurIPS 2020, arXiv:2006.11239) — the paper that defined the modern training objective.
Dhariwal & Nichol, Diffusion Models Beat GANs on Image Synthesis (2021) — the empirical turning point that made the field commit to diffusion.
Rombach et al., High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022, arXiv:2112.10752) — the latent-diffusion paper behind Stable Diffusion.
Song et al., Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021) — the cleanest unifying view; diffusion as discretized SDE.
Lilian Weng’s blog post What Are Diffusion Models? — the most-recommended single primer; saves you a lot of equation-chasing.

What I’m confident about: the noise-prediction training objective, the high-level reasons autoregressive pixels lost (ordering, error compounding, sequence length), and the latent-diffusion compute argument. What I’m less confident about: the exact share of commercial frontier systems still using “classical” diffusion vs. flow matching vs. autoregressive image tokens — labs publish selectively, and the labels in marketing copy don’t always match the math underneath. Treat the cluster, not the individual attribution, as the load-bearing claim.