How does an LLM 'see' an image?
You paste a screenshot into ChatGPT and it reads the text, describes the scene, answers questions. But the model only ever predicts text tokens — so how does a picture get into it at all?
Why it exists
You drag a screenshot of an error message into ChatGPT and it reads the stack trace back to you. You photograph a fridge and ask what you can cook. You paste a chart and it pulls the trend out. From the outside it feels like the model looked at the picture.
But step back to what an LLM actually is: a machine that takes a sequence of tokens and predicts the next one. Its whole universe is a list of integer IDs drawn from a fixed vocabulary of words and word-pieces. A photo is none of those things — it’s a grid of millions of pixel brightnesses. So there’s a real puzzle here: how does something with no notion of “pixel” end up answering questions about a picture?
The trick is almost suspiciously simple. You don’t teach the transformer to see. You turn the image into the only thing the transformer understands — vectors in a sequence — and drop them into the context window right next to the word vectors. The model then does the exact same thing it always does: attention over a sequence, predict the next token. What looks like “seeing” is image-shaped tokens flowing through the very same machinery that handles text.
Why it matters now
Vision is no longer a bolt-on. The default models people reach for in 2026 — GPT-4o, Claude, Gemini, the open Qwen-VL and Llama families — are natively multimodal: the same model that writes your code also reads the screenshot of the bug. Three concrete places this shows up:
- Token cost balloons on images. Because an image becomes a block of tokens, a single high-resolution screenshot can cost more than a long paragraph of text — and you pay for it on every turn it stays in context. Knowing an image is tokens is how you predict your bill.
- The failure modes are specific and repeatable. Vision models miscount objects, fumble exact spatial relationships, and misread tiny text. Those aren’t random — they fall straight out of how the image gets chopped up, covered below.
- “Read this document” is now the same pipeline as “describe this photo.” Screenshot-to-answer, receipt scanning, UI agents that click around a screen — they all ride on the same image-into-tokens machinery, so its limits are their limits.
The short answer
image input = picture cut into patches → each patch becomes a vector → those vectors are fed to the transformer as tokens, right beside the words
A picture is split into a grid of small fixed-size squares called patches. Each patch is flattened and run through a small learned function that turns it into a vector — the same kind of vector a word gets from the embedding table. Those vectors are placed in the sequence alongside the text vectors, and from that point on the transformer cannot tell which vectors came from pixels and which came from words. It just runs attention over all of them and predicts text. That’s the whole idea; the rest is detail about how the patch becomes a good vector.
How it works
Why patches, and not pixels
The naïve idea — feed the model one token per pixel — dies on arithmetic. A modest 512×512 image is 262,144 pixels. Since attention cost grows with the square of the sequence length, a quarter-million-token image would be wildly out of reach. So instead the image is carved into a grid of patches — say 16×16 pixels each — and each patch becomes one token. Now that 512×512 image is a 32×32 grid: 1,024 tokens, not a quarter million. This is the core move from the 2020 paper that kicked off the whole approach, titled — literally — An Image Is Worth 16×16 Words.
Each patch is flattened from its little square of pixel values into a long list of numbers, then multiplied by a learned matrix that projects it down to the model’s vector width. That’s the patch embedding: the visual analogue of looking a word up in the embedding table. One extra ingredient is added — a positional encoding — because a bare set of patch vectors has lost all sense of where each patch sat, and “the cat is above the dog” depends entirely on that.
To the computer it’s just a grid of pixel brightnesses — no words, no objects, no “cat.”
Carve the grid into fixed-size squares — e.g. 16×16 pixels. Each square is one “visual word.”
A vision encoder mixes the patches together; a projector maps them into the LLM’s vector space. Out come image tokens.
The image tokens sit in the context right beside the words of your question. The transformer attends over all of them and predicts text — same loop as always.
From patches to “image tokens”
Patch embeddings on their own are raw. Before the language model sees them, two things usually happen.
First, a vision encoder — almost always a ViT, a transformer that runs attention over the patches — lets every patch look at every other patch. A patch that’s part of an eye gets context from the patches that form the rest of the face. The output is one context-aware vector per patch. Many systems use an encoder pretrained by CLIP, so its vectors already align with language.
Second, a projector (often called a connector or adapter) maps those vision vectors into the language model’s own embedding space, producing the vectors I’ve been calling image tokens. The honest summary is that there isn’t one standard projector — this is the part that differs most across models:
- A plain projection. The simplest approach applies a small network to each patch vector independently. The original LLaVA used a single learned linear projection from the vision features into the LLM’s embedding space (later LLaVA versions swapped in a two-layer MLP). Either way it’s sequence-preserving: one visual feature in maps to one image token out.
- A resampler / cross-attention. Some designs (Flamingo’s perceiver resampler, BLIP-2’s Q-Former) use a fixed, small set of learned query vectors that attend to the patches and pull out a fixed number of image tokens, regardless of how many patches went in. This caps the token count.
Either way, the result is a handful to a few hundred vectors sitting in the LLM’s vector space. They get concatenated with the embedded text tokens, and the combined sequence flows through the transformer. From here it is exactly the decode loop you already know: attention mixes information across the whole sequence — letting a word like “What” pull from the image tokens — and the model predicts text one token at a time.
A caveat worth stating plainly: the exact architecture inside closed models like GPT-4o or Claude isn’t public. The patch → encoder → projector → shared sequence shape above is the well-documented open-model recipe (LLaVA, Qwen-VL, Flamingo, BLIP-2) and the standard account of how these systems work; treat the specific connector as “one of these families,” not a claim about any one vendor’s internals.
Why the failure modes look the way they do
Once you see the pipeline, the quirks stop being mysterious:
- Tiny text and fine detail get lost. Detail smaller than a patch can be smeared into a single vector. If a patch is 16 pixels wide and a line of text is 6 pixels tall, that text barely survives the projection. This is why models miss small captions, footnotes, or watermarks.
- High-resolution images are tiled. To read detail, systems often cut a big image into several tiles and run each through the encoder, then stitch the tokens together (LLaVA’s high-res variants and the “high detail” mode in OpenAI’s vision API both do versions of this). More tiles means more tokens means more cost — which is exactly why a detailed screenshot can cost hundreds to over a thousand tokens. The precise token-counting formula is in each provider’s docs; the mechanism is tiling.
- Counting and exact layout are hard. “How many people are in this photo?” asks the model to integrate information across many patches and keep a running count — something next-token prediction over patch vectors isn’t reliably good at, in the same family of weakness as why LLMs can’t count letters.
- There’s often no separate OCR step. In the open vision-language architectures above, reading text from an image emerges from the same patch-attention machinery, not a dedicated character-recognition module — which is why a model’s reading is fluent but occasionally hallucinates a character. (Some products do bolt a real OCR engine on top for documents, and the internals of closed models aren’t public, so this is a claim about the base VLM recipe, not every system you might use.)
Famous related terms
- Multimodal model —
multimodal = one model that takes more than one input type (text + images) in the same sequence. The umbrella term for everything in this post. - ViT (Vision Transformer) —
ViT = transformer + self-attention over image patches instead of words. The encoder that turns patches into context-aware vectors. - Patch embedding —
patch embedding = flatten a square of pixels + project it into a vector. The image analogue of a word embedding. - CLIP —
CLIP = image encoder + text encoder trained so matching pairs land near each other. The common pretraining recipe that makes vision vectors “speak language.” - Projector / connector —
projector = small network mapping vision vectors into the LLM's embedding space. The seam where the two modalities are stitched together. - VLM (Vision-Language Model) —
VLM = vision encoder + projector + LLM. The full stack this post describes.
Going deeper
- An Image Is Worth 16×16 Words (Dosovitskiy et al., 2020) — the primary source for patchifying an image and treating each patch as a token; read it for where “patch = visual word” comes from.
- Visual Instruction Tuning / LLaVA (Liu et al., 2023) — the clearest worked example of the projector idea: an open model that connects a CLIP vision encoder to an LLM with just a small trained projection.
- Learning Transferable Visual Models From Natural Language Supervision / CLIP (Radford et al., 2021) — the rabbit hole on why a vision encoder’s output can align with language at all, which is what lets the projector’s job be small.