What is an LLM?

A neural network trained to predict the next token of text — and why that simple goal scaled into something that feels like reasoning.

AI & ML intro Apr 29, 2026

Why it exists

For decades, language software was hand-built rule by rule. Translation, search, summarization, dialogue — each was a separate brittle system. If you wanted a program that could “understand” a sentence, you wrote a parser, a grammar, a domain ontology, and prayed.

The hope behind LLMs is older than the technology: maybe a single model, fed enough text, could learn the structure of language by itself — without anyone sitting down to write the rules. Once it could, you wouldn’t build a separate system for translation and another for summarization. You’d just ask.

That hope kept failing because models weren’t expressive enough and training data wasn’t big enough. The transformer architecture (2017) and the leap in GPU compute changed both at once. Suddenly the simplest possible objective — “predict the next word” — produced models that could write code, explain jokes, and pass professional exams. Not because anyone taught them to. Because at scale, “predict the next word well” turns out to require a lot of competence.

Why it matters now

LLMs are the substrate behind almost every consumer AI product in 2026: chatbots, coding assistants, document Q&A, agents that book flights, customer support that doesn’t sound like customer support. Whole company functions are being rewritten around them. The shift is comparable to the early web — except compressed into about three years.

If you build software, you will integrate one. If you don’t, one will be in the tools you use tomorrow. The mechanics are worth understanding even at a high level, because the ways LLMs fail — hallucination, prompt injection, context limits — are the new bugs in the systems we’re shipping.

The short answer

LLM = neural net + "predict the next token" objective at scale

An LLM is a neural network trained on huge amounts of text to predict the next token (≈ word piece) given the tokens that came before. That’s the entire training objective. Everything else — answering questions, writing essays, following instructions — is an emergent consequence of doing that prediction well across enough text.

How it works

Three pieces, in order:

1. Tokenization. Text is chopped into tokens — small chunks like " the", "un", "derstand", ".". Each token gets an integer ID. A typical vocabulary is 30k–200k tokens. A sentence becomes a list of integers.

2. The transformer. A stack of layers (often 30–100) processes the token sequence. Each layer does two things:

Attention — every token looks at every other token in the context and decides which ones are relevant. “It” in “the cat sat on the mat because it was warm” learns to attend to “mat”.
A feed-forward network — a learned transformation applied per token.

After all layers, the model produces a probability distribution over the vocabulary for the next token.

3. Sampling. Pick a token from that distribution (sometimes the most likely, sometimes randomly weighted by probability), append it to the context, and repeat. Token by token, an answer appears.

That’s the base model. Two more steps make it useful for chat:

Instruction tuning — fine-tune on examples of “user asks X, assistant responds Y” so the model learns to follow requests instead of just continuing text.
Reinforcement learning from human/AI feedback (RLHF / RLAIF) — train the model to prefer responses humans (or another model) rated higher. This is what makes the assistant feel helpful and avoid obviously bad outputs.

The “intelligence” is mostly compressed into the transformer’s weights — usually billions to trillions of numbers — formed during pretraining on internet-scale text.

Transformer — transformer ≈ stack of (attention + feed-forward) layers — the neural network architecture LLMs are built on.
Tokenization — tokenization = text → list of integer IDs — turning text into the integer IDs the model actually sees.
Embeddings — embedding = thing → vector you can do math on — the vectors that represent tokens (and meanings) inside the model.
Attention — attention = each token weights every other token by learned relevance — the mechanism that lets each token look at every other token.
Context window — context window = how many tokens the model can see at once.
Hallucination — hallucination = confident output that isn't true — confidently producing something false, because next-token prediction has no built-in truth check.
RLHF — RLHF = supervised fine-tune + reward model + RL loop — the alignment step that shapes a base model into an assistant.

Going deeper

Attention Is All You Need (Vaswani et al., 2017) — the transformer paper.
Language Models are Few-Shot Learners (Brown et al., 2020) — GPT-3, the paper where scaling really announced itself.
3Blue1Brown’s Neural Networks series — visual intuition for how a transformer actually computes.
Andrej Karpathy’s Let’s build GPT video — builds a tiny LLM from scratch in about two hours.