10 famous AI-ML terms
The vocabulary you keep hearing on every podcast — neural network, transformer, RLHF, RAG — compressed to one line each, then unpacked.
Why it exists
If you’ve followed AI for the last few years, you’ve collected a pile of words — transformer, embedding, RLHF, hallucination, agent — that get used like everyone already knows them. They sound technical because they are, but most of them compress to a single sentence once someone bothers to draw the picture.
This post is that picture, ten times. Each entry is one line of compression and a short paragraph of unpack. The goal is not to teach you the math — the goal is to make the words stop being noise.
Why it matters now
In 2026, AI vocabulary has leaked into product specs, hiring posts, news headlines, and dinner conversations. The cost of being fuzzy on what a word means stopped being “I sound silly at a party” and started being “I can’t tell whether the press release is meaningful.” Compression beats mystique. If a term truly requires a paragraph to explain, that’s signal too — but most of these don’t.
I picked these ten because they’re the words that show up most often when non-specialists try to discuss modern AI. Other strong candidates (fine-tuning, chain-of-thought, diffusion, MoE, quantization) are in Famous related terms at the bottom.
The short answer
AI-ML jargon = math you can re-derive + branding you have to memorize
Most of the famous terms are an old idea (matrix math, gradient descent, probability) wrapped in a new name and a specific recipe. The recipe is the part worth knowing.
How it works
1. Neural network
neural network = stack of (linear layer + nonlinearity) + gradient descent
A function approximator built from many tiny, dumb pieces. Each “neuron” multiplies its inputs by weights, adds a bias, and passes the result through a simple nonlinearity. Stack enough of those, and the whole thing can fit patterns nobody knows how to write down by hand. Training means nudging all the weights, one mini-batch at a time, in whatever direction reduces the loss. See neural network.
2. LLM (Large Language Model)
LLM = neural net + "predict the next token" objective at scale
A transformer trained on enormous amounts of text to guess the next token given the ones that came before. Everything else — answering questions, writing code, holding a conversation — is what falls out of doing that one prediction job extremely well across enough text. See LLM.
3. Transformer
transformer = stack of (self-attention + feed-forward) blocks + positional info
The architecture under almost every modern AI system. Its central trick is letting every token in a sequence look at every other token directly, in one parallel operation, instead of marching through the sequence step by step the way older RNNs did. That’s why it scales on GPUs, and that’s why we have the current wave at all. See transformer.
4. Attention
attention = each token weights every other token by learned relevance
Inside a transformer block, every position decides “which other positions matter to me right now?” and pulls in a weighted blend of their information. The weights are learned, content-dependent, and recomputed every layer. It’s the mechanism that lets the word it find the cat three sentences earlier. See attention.
5. Embedding
embedding = a thing → a vector you can do math on
Words, sentences, images, even users — each gets mapped to a list of numbers (often hundreds to a few thousand) so similar things land near each other in that space. Once everything is a vector, “find similar items” becomes a distance query, which is what powers semantic search, recommendations, and the input layer of essentially every modern language model. See embeddings.
6. Tokenization
tokenization = text → list of integer IDs the model actually sees
Models don’t read characters or words — they read token IDs from a fixed
vocabulary, typically tens of thousands of entries. A token might be
" the", "un", or "derstand". This is why models miscount letters in a
word, why some languages cost more per request than others, and why
“context length” is measured in tokens, not words. See
tokenization.
7. RLHF (Reinforcement Learning from Human Feedback)
RLHF = supervised fine-tune + reward model + RL loop
The step that turns a base LLM (which just continues text) into an assistant (which follows instructions and avoids obviously bad outputs). Humans rank pairs of model responses; a separate “reward model” learns to predict those rankings; then the main model is trained, via reinforcement learning, to produce outputs the reward model likes. See why RLHF exists.
8. Hallucination
hallucination = confident output that isn't true
The signature failure mode of LLMs: a plausible-sounding citation, function name, or fact that simply doesn’t exist. There’s no internal “do I actually know this?” check — next-token prediction will produce something, and at the boundary of the model’s knowledge that something is often fabricated. See hallucination.
9. RAG (Retrieval-Augmented Generation)
RAG = retrieve relevant docs + stuff them into the prompt + generate
A way to ground an LLM in information it wasn’t trained on (your company’s docs, last week’s news, a private codebase) without retraining the model. At query time, you fetch the most relevant chunks — usually via embedding similarity — and paste them into the context before the question. The model then answers using that context. See RAG.
10. Agent
agent = model + harness (tools + loop + memory + control flow)
An LLM by itself just produces text. An agent is what you get when you wrap it in a loop that lets it call tools (search, code execution, file edits, APIs), feeds the results back, and keeps going until the task is done. The buzz is about the model; most of the engineering is the harness. See agent harness.
Famous related terms
- Fine-tuning —
fine-tuning = continue training a pretrained model on your own data. Cheap because the hard work was done in pretraining; see why fine-tuning is cheap. - Chain-of-thought —
CoT = prompt the model to write its reasoning before its answer. Often improves accuracy on multi-step problems; see chain-of-thought. - Diffusion model —
diffusion = learn to denoise + run the denoiser repeatedly. The generative recipe behind most current image and video systems (the architecture inside is usually a U-Net or a diffusion transformer); see why diffusion models exist. - Mixture of Experts (MoE) —
MoE = many expert FFNs + a router that picks a few per token. How recent frontier models grow capacity without paying the full per-token compute cost; see why mixture of experts exists. - Quantization —
quantization = store weights in fewer bits (8, 4, even 2) instead of 16/32. Makes big models fit on smaller hardware; see why quantization works. - Context window —
context window = how many tokens the model can see at once. Bigger isn’t automatically better — see why lost in the middle.
Going deeper
- Attention Is All You Need (Vaswani et al., 2017) — the transformer paper. Short and still the right starting point.
- Language Models are Few-Shot Learners (Brown et al., 2020) — the GPT-3 paper that made scaling undeniable.
- Training language models to follow instructions with human feedback (Ouyang et al., 2022) — the InstructGPT paper that popularized the SFT + reward model + PPO recipe for instruction-tuned LLMs. The RLHF idea itself is older (Christiano et al. 2017; Stiennon et al. 2020).
- Andrej Karpathy, Let’s build GPT: from scratch, in code, spelled out — builds a tiny LLM end-to-end on video; the best two hours you can spend on this list.
- 3Blue1Brown, Neural Networks series — visual intuition for what a neural net actually computes.