Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is a transformer?

The neural network architecture behind every modern LLM, image model, and protein folder — and the one big idea that made it work: drop recurrence, let every token look at every other token directly.

AI & ML intro Apr 30, 2026

Why it exists

Before transformers, the default way to handle a sequence — a sentence, a time series, an audio clip — was a RNN or its better-behaved cousin, the LSTM. You fed the model one token, it updated a hidden state, you fed it the next token, it updated again. Sentence as conveyor belt.

That design had two problems that compounded each other.

The first was parallelism. Step t needs the hidden state from t−1, which needs t−2, and so on. You couldn’t really use a GPU the way GPUs want to be used — thousands of cores at once — because the work was a chain. No amount of hardware fixed it.

The second was long-range dependencies. By the time information from the start of a passage reached the end, it had been squeezed through hundreds of tiny update steps and mostly washed out. LSTMs helped but didn’t solve it. Connecting “the trophy” at the start of a paragraph to “it” at the end meant fighting the architecture.

The 2017 paper Attention Is All You Need (Vaswani et al.) made a sharper move than people expected: throw recurrence out entirely. Instead of carrying state forward step by step, let every token look directly at every other token, in parallel, in a single operation called self-attention. No conveyor belt. No washing-out. Embarrassingly parallel on a GPU.

That is the whole reason the transformer exists. The rest is engineering around that one swap.

Why it matters now

Once people had an architecture that scaled cleanly with compute, scale itself became the lever — and the transformer turned out to be ridiculously general. The same block design sits underneath every modern LLM (GPT, Claude, Gemini, Llama), most strong vision models (ViT and successors), speech recognition (Whisper), and protein structure prediction (AlphaFold’s core blocks are attention-based, even if the full system is more than a vanilla transformer). If you’re trying to understand any current AI system above the API layer, “transformer” is almost certainly the shape you’re looking at.

The short answer

transformer = stack of (self-attention + feed-forward) blocks, with positional info baked into the input

A transformer turns a sequence of tokens into a sequence of contextual vectors by repeatedly mixing information across positions (self-attention) and then transforming each position individually (feed-forward). Stack that block enough times, train at scale, and you get the substrate behind nearly every modern AI system.

How it works

Five pieces, in order:

1. Token embeddings. Text is split into tokens (see tokenization) and each token ID looks up a vector in a learned table — its embedding. A sentence becomes a matrix.

2. Positional information. Self-attention is order-blind — shuffle the tokens and the math gives you the same answer. So the architecture has to inject where each token sits into the input. The original paper used fixed sinusoidal positional encodings; modern models mostly use learned or rotary variants. See why positional encodings exist and why RoPE replaced sinusoidal.

3. The block, repeated N times. Each transformer block has two sublayers:

Around each sublayer the architecture wraps a residual connection and a layer norm. Those two pieces are what make stacking dozens of these blocks trainable at all.

A useful mental image is the residual stream: each token has a vector that travels straight through the stack, and every block adds something to it. Attention reads and writes back; the FFN reads and writes back. The stream is the working memory.

4. Final projection. At the top of the stack, each token’s vector is projected into a distribution over the vocabulary. For an LLM, the distribution over the last position is the next-token prediction. Sample, append, run again — that’s generation.

I’m being deliberately loose about the math (queries, keys, values, softmax, head counts, hidden sizes). The shape is re-derivable; the hyperparameters change with every new model and aren’t worth memorizing.

Going deeper

A note on what I’m sure of: the high-level shape — embeddings, positional info, stacked attention + FFN blocks with residuals and norm, final projection — is the same across essentially every modern transformer. The details (head counts, hidden sizes, exact norm placement, positional scheme, FFN activation) vary model to model and aren’t pinned down here.