Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

How does an LLM 'see' an image?

You paste a screenshot into ChatGPT and it reads the text, describes the scene, answers questions. But the model only ever predicts text tokens — so how does a picture get into it at all?

AI & ML intermediate May 20, 2026

Why it exists

You drag a screenshot of an error message into ChatGPT and it reads the stack trace back to you. You photograph a fridge and ask what you can cook. You paste a chart and it pulls the trend out. From the outside it feels like the model looked at the picture.

But step back to what an LLM actually is: a machine that takes a sequence of tokens and predicts the next one. Its whole universe is a list of integer IDs drawn from a fixed vocabulary of words and word-pieces. A photo is none of those things — it’s a grid of millions of pixel brightnesses. So there’s a real puzzle here: how does something with no notion of “pixel” end up answering questions about a picture?

The trick is almost suspiciously simple. You don’t teach the transformer to see. You turn the image into the only thing the transformer understands — vectors in a sequence — and drop them into the context window right next to the word vectors. The model then does the exact same thing it always does: attention over a sequence, predict the next token. What looks like “seeing” is image-shaped tokens flowing through the very same machinery that handles text.

Why it matters now

Vision is no longer a bolt-on. The default models people reach for in 2026 — GPT-4o, Claude, Gemini, the open Qwen-VL and Llama families — are natively multimodal: the same model that writes your code also reads the screenshot of the bug. Three concrete places this shows up:

The short answer

image input = picture cut into patches → each patch becomes a vector → those vectors are fed to the transformer as tokens, right beside the words

A picture is split into a grid of small fixed-size squares called patches. Each patch is flattened and run through a small learned function that turns it into a vector — the same kind of vector a word gets from the embedding table. Those vectors are placed in the sequence alongside the text vectors, and from that point on the transformer cannot tell which vectors came from pixels and which came from words. It just runs attention over all of them and predicts text. That’s the whole idea; the rest is detail about how the patch becomes a good vector.

How it works

Why patches, and not pixels

The naïve idea — feed the model one token per pixel — dies on arithmetic. A modest 512×512 image is 262,144 pixels. Since attention cost grows with the square of the sequence length, a quarter-million-token image would be wildly out of reach. So instead the image is carved into a grid of patches — say 16×16 pixels each — and each patch becomes one token. Now that 512×512 image is a 32×32 grid: 1,024 tokens, not a quarter million. This is the core move from the 2020 paper that kicked off the whole approach, titled — literally — An Image Is Worth 16×16 Words.

Each patch is flattened from its little square of pixel values into a long list of numbers, then multiplied by a learned matrix that projects it down to the model’s vector width. That’s the patch embedding: the visual analogue of looking a word up in the embedding table. One extra ingredient is added — a positional encoding — because a bare set of patch vectors has lost all sense of where each patch sat, and “the cat is above the dog” depends entirely on that.

A photo

To the computer it’s just a grid of pixel brightnesses — no words, no objects, no “cat.”

Cut into patches

Carve the grid into fixed-size squares — e.g. 16×16 pixels. Each square is one “visual word.”

Patch → vector (+ position)

Flatten each patch and project it into a vector — like looking a word up in the embedding table. A position signal records where it sat.

Encode & project → image tokens

A vision encoder mixes the patches together; a projector maps them into the LLM’s vector space. Out come image tokens.

IMG IMG IMG IMG
One sequence: image tokens + words → answer

The image tokens sit in the context right beside the words of your question. The transformer attends over all of them and predicts text — same loop as always.

IMG IMG What is this ?
A tabby cat on a sofa.
An image is cut into patches; each patch is projected into a vector; a vision encoder and a projector turn those into image tokens in the LLM’s embedding space; the image tokens join the text tokens in one sequence and the model predicts an answer. Grid sizes and counts are illustrative.

From patches to “image tokens”

Patch embeddings on their own are raw. Before the language model sees them, two things usually happen.

First, a vision encoder — almost always a ViT, a transformer that runs attention over the patches — lets every patch look at every other patch. A patch that’s part of an eye gets context from the patches that form the rest of the face. The output is one context-aware vector per patch. Many systems use an encoder pretrained by CLIP, so its vectors already align with language.

Second, a projector (often called a connector or adapter) maps those vision vectors into the language model’s own embedding space, producing the vectors I’ve been calling image tokens. The honest summary is that there isn’t one standard projector — this is the part that differs most across models:

Either way, the result is a handful to a few hundred vectors sitting in the LLM’s vector space. They get concatenated with the embedded text tokens, and the combined sequence flows through the transformer. From here it is exactly the decode loop you already know: attention mixes information across the whole sequence — letting a word like “What” pull from the image tokens — and the model predicts text one token at a time.

A caveat worth stating plainly: the exact architecture inside closed models like GPT-4o or Claude isn’t public. The patch → encoder → projector → shared sequence shape above is the well-documented open-model recipe (LLaVA, Qwen-VL, Flamingo, BLIP-2) and the standard account of how these systems work; treat the specific connector as “one of these families,” not a claim about any one vendor’s internals.

Why the failure modes look the way they do

Once you see the pipeline, the quirks stop being mysterious:

Going deeper