What is a transformer?

The neural network architecture behind every modern LLM, image model, and protein folder — and the one big idea that made it work: drop recurrence, let every token look at every other token directly.

AI & ML intro Apr 30, 2026

Why it exists

Before transformers, the default way to handle a sequence — a sentence, a time series, an audio clip — was a RNN or its better-behaved cousin, the LSTM. You fed the model one token, it updated a hidden state, you fed it the next token, it updated again. Sentence as conveyor belt.

That design had two problems that compounded each other.

The first was parallelism. Step t needs the hidden state from t−1, which needs t−2, and so on. You couldn’t really use a GPU the way GPUs want to be used — thousands of cores at once — because the work was a chain. No amount of hardware fixed it.

The second was long-range dependencies. By the time information from the start of a passage reached the end, it had been squeezed through hundreds of tiny update steps and mostly washed out. LSTMs helped but didn’t solve it. Connecting “the trophy” at the start of a paragraph to “it” at the end meant fighting the architecture.

The 2017 paper Attention Is All You Need (Vaswani et al.) made a sharper move than people expected: throw recurrence out entirely. Instead of carrying state forward step by step, let every token look directly at every other token, in parallel, in a single operation called self-attention. No conveyor belt. No washing-out. Embarrassingly parallel on a GPU.

That is the whole reason the transformer exists. The rest is engineering around that one swap.

Why it matters now

Once people had an architecture that scaled cleanly with compute, scale itself became the lever — and the transformer turned out to be ridiculously general. The same block design sits underneath every modern LLM (GPT, Claude, Gemini, Llama), most strong vision models (ViT and successors), speech recognition (Whisper), and protein structure prediction (AlphaFold’s core blocks are attention-based, even if the full system is more than a vanilla transformer). If you’re trying to understand any current AI system above the API layer, “transformer” is almost certainly the shape you’re looking at.

The short answer

transformer = stack of (self-attention + feed-forward) blocks, with positional info baked into the input

A transformer turns a sequence of tokens into a sequence of contextual vectors by repeatedly mixing information across positions (self-attention) and then transforming each position individually (feed-forward). Stack that block enough times, train at scale, and you get the substrate behind nearly every modern AI system.

How it works

Five pieces, in order:

1. Token embeddings. Text is split into tokens (see tokenization) and each token ID looks up a vector in a learned table — its embedding. A sentence becomes a matrix.

2. Positional information. Self-attention is order-blind — shuffle the tokens and the math gives you the same answer. So the architecture has to inject where each token sits into the input. The original paper used fixed sinusoidal positional encodings; modern models mostly use learned or rotary variants. See why positional encodings exist and why RoPE replaced sinusoidal.

3. The block, repeated N times. Each transformer block has two sublayers:

Self-attention. Every position computes a weighted sum of every other position, where the weights are learned and depend on the content. This is where information moves between tokens.
Feed-forward network (a small MLP). A per-token transformation, applied independently at each position. This is where each token “thinks” about what it just gathered.

Around each sublayer the architecture wraps a residual connection and a layer norm. Those two pieces are what make stacking dozens of these blocks trainable at all.

A useful mental image is the residual stream: each token has a vector that travels straight through the stack, and every block adds something to it. Attention reads and writes back; the FFN reads and writes back. The stream is the working memory.

4. Final projection. At the top of the stack, each token’s vector is projected into a distribution over the vocabulary. For an LLM, the distribution over the last position is the next-token prediction. Sample, append, run again — that’s generation.

I’m being deliberately loose about the math (queries, keys, values, softmax, head counts, hidden sizes). The shape is re-derivable; the hyperparameters change with every new model and aren’t worth memorizing.

Attention — attention = each token weights every other token by learned relevance. The mechanism the architecture is named after.
Self-attention vs cross-attention — self-attention = a sequence attending to itself; cross-attention = one sequence attending to a different sequence (e.g. a decoder reading from an encoder’s output).
Positional encoding / RoPE — positional encoding = "where am I in the sequence?" injected into the input. See also why RoPE replaced sinusoidal.
Decoder-only / encoder-only / encoder-decoder — decoder-only = causal self-attention only; encoder-only = bidirectional self-attention only; encoder-decoder = encoder + decoder with cross-attention. Three flavors of the same block. Encoder-only (BERT-style) reads a whole sequence at once. Decoder-only (GPT-style) is causal — each position only attends to earlier ones, which is what makes next-token generation work. Encoder-decoder (the original 2017 design) does both.
Residual connection — residual = output + input. Lets gradients flow straight from the top of the stack to the bottom, which is most of why very deep transformers train at all.
LLM — LLM = transformer + "predict the next token" objective at scale. The most visible product of this architecture.

Going deeper

Attention Is All You Need (Vaswani et al., 2017) — the original paper. Short, readable, diagrams still hold up.
Jay Alammar, The Illustrated Transformer — the canonical visual walkthrough; it has done more to teach this architecture than any textbook.
Andrej Karpathy, Let’s build GPT: from scratch, in code, spelled out — builds a tiny decoder-only transformer end-to-end on video. After watching it once, the literature stops feeling like incantations.

A note on what I’m sure of: the high-level shape — embeddings, positional info, stacked attention + FFN blocks with residuals and norm, final projection — is the same across essentially every modern transformer. The details (head counts, hidden sizes, exact norm placement, positional scheme, FFN activation) vary model to model and aren’t pinned down here.