Why do embeddings exist?
Computers want numbers, but you also want 'cat' and 'kitten' to live next to each other. Embeddings are the trick that makes both true at once.
Why it exists
A neural network can’t read the word “cat.” It can only multiply numbers by other numbers. So at some point, every piece of text, every image, every product in your catalog has to become a list of floats. The boring way to do that — assign each word an arbitrary integer ID — works for indexing but throws away every interesting fact about the word. In an ID-based world, “cat” and “kitten” are as far apart as “cat” and “thermodynamics.” That’s a tragedy if you wanted to build search, recommendations, deduplication, or anything that cares about meaning.
Embeddings exist to fix that. The goal is a representation that satisfies two demands simultaneously:
- It’s a vector of numbers, so a model (or a database) can do math on it.
- Similar things land near each other in that vector space, so distance becomes a proxy for “are these alike?”
The second demand is the magical one. Once your representation has it, suddenly an enormous family of problems — find documents about this question, suggest songs like this song, cluster these support tickets, flag these two reviews as near-duplicates — collapses into one operation: nearest-neighbor lookup in a vector space.
The earliest hint that this was even possible came from distributional semantics: if “cat” and “kitten” show up around the same other words, maybe “cat” and “kitten” mean something similar. Decades of work turned that hint into a tool, and at some point along the way the tool got a name — embedding — and quietly became one of the load-bearing pieces of modern software.
Why it matters now
If you build anything on top of LLMs in 2026, you’re using embeddings whether or not you call them that.
- RAG — the dominant pattern for “answer questions from my docs” — is, at its core, an embedding-based search step bolted onto an LLM. The model is the visible part. The retrieval is doing most of the actual work.
- Semantic search in your product (the box that finds “how do I cancel my subscription” when the user typed “stop the bill”) is embeddings plus a vector index.
- Recommendation systems built in the last few years are mostly “embed users, embed items, look up neighbors.”
- Clustering, deduplication, classification with very few labels — all routine once your data lives in a sensible vector space.
- Vector databases (Pinecone, Weaviate, Qdrant, Chroma, Postgres
with
pgvector) exist as a category only because embeddings made vector lookup the new primary key for unstructured data.
The practical reason engineers care: an embedding model plus a vector store gets you 80% of “AI features” without ever fine-tuning anything. It’s the cheapest, most reliable AI lever in the toolbox, and it pre- dates LLMs by a decade.
The short answer
embedding = a learned function (text | image | thing) → ℝᵈ such that semantic similarity ≈ vector similarity
An embedding is a fixed-length list of floats — usually a few hundred to a few thousand dimensions — produced by a model that was trained so that “things that mean similar stuff” come out as “vectors that point in similar directions.” Once you have that, “are these two things alike?” becomes a dot product.
How it works
Two questions to keep separate: how do you build the function, and how do you use the vectors once you have them?
Building the function
The training trick has the same shape across most embedding models, even when the details differ wildly:
Pull together representations of things that should be similar. Push apart representations of things that shouldn’t.
That’s it. The art is in defining “should” and “shouldn’t” without humans labeling every pair.
A few canonical recipes:
- word2vec (Mikolov et al., 2013). Slide a window across a huge
corpus. For each center word, treat the surrounding words as
“should be similar” and randomly-sampled other words as
“shouldn’t be.” Train a tiny network to make that true. The famous
party trick —
vec("king") − vec("man") + vec("woman") ≈ vec("queen")— fell out almost as a side effect. (How robust that analogy actually is has been debated since; the point is that linear structure showed up in the geometry at all.) - Sentence and document encoders (Sentence-BERT, modern OpenAI / Cohere / Voyage / open-source embedding models). Same idea, but the “thing” is now a whole sentence or passage, and the supervision often comes from question/answer pairs, paraphrase pairs, or click data — natural sources of “these two should be close.” The encoder is typically a transformer; the output is a single pooled vector.
- Image and multimodal embeddings (CLIP and successors). Train one encoder for images and one for text on hundreds of millions of (image, caption) pairs from the web, with the objective: a real pair should be close, a mismatched pair should be far. Result: a shared vector space where pictures and the words that describe them end up near each other.
You almost never train your own embedding model. You pick one off the shelf, sized to your needs and your budget, and use it as a black box.
Using the vectors
Once your data is embedded, three operations cover most of what you’ll do:
similarity(a, b) = cosine(a, b) # how alike are these two?
search(q, corpus) = top_k by cosine # which are most like q?
cluster(corpus) = k-means / HDBSCAN # group by proximity
A worked sketch of the RAG case:
# offline, once
for chunk in docs:
store(id=chunk.id, vector=embed(chunk.text), payload=chunk)
# online, per question
q_vec = embed(user_question)
hits = vector_db.search(q_vec, top_k=8)
prompt = SYSTEM + format(hits) + user_question
answer = llm(prompt)
The LLM looks like the smart part. The embedding step is what made the right eight chunks land in the prompt in the first place.
A few things that surprise people the first time:
- Cosine similarity, not Euclidean distance, is usually the right metric. Embedding models tend to use the direction of the vector to encode meaning and let the magnitude drift. Two vectors pointing the same way are “similar” even if one is longer.
- Dimensionality is a budget choice, not a quality knob. 1536-dim vectors are not “smarter” than 384-dim vectors in any deep sense; they’re just more expensive to store and search, and sometimes — not always — slightly more accurate. Newer models (Matryoshka-style) even let you truncate the same vector to a shorter prefix and still have it work, trading accuracy for cost on a slider.
- Embedding spaces from different models do not line up. A vector from one model is meaningless to another. If you re-embed your corpus with a new model, you have to re-embed your queries too. This is the single most common operational footgun.
- They are not magic semantic understanding. Embeddings encode whatever signal the training objective rewarded. That’s usually “topical similarity,” but it can also drag in surface features (length, language, formality) you didn’t ask for. When retrieval surprises you, the embedding is usually the suspect.
- They drift with the world. A 2022 model has never heard of a 2025 product name. The vectors will be confidently wrong about it. This is not really a bug; it’s the same staleness problem every cached representation has, and it’s why “re-embed when you swap models or add a domain” lives on every team’s runbook.
The reason it all works at all is the second-most-important fact in modern ML, after “scale helps”: when you train a big enough model on a big enough corpus with a sensible “similar things should be near each other” objective, the geometry that falls out is useful far beyond what the objective explicitly asked for. You trained on next-word prediction or contrastive pairs and you got, almost as a gift, a coordinate system where “find documents like this one” is a one-line operation. That gift is the entire reason embeddings became infrastructure.
Famous related terms
- Vector / vector space —
vector = ordered list of numbers. The container; embeddings are vectors with a learned meaning. - Cosine similarity —
cos(a, b) = (a·b) / (‖a‖‖b‖)— measures the angle between two vectors. The standard “are these alike?” function. - Vector database —
vector DB = store + ANN index over embeddings. Built so that “find the nearest 10 of 100M vectors” returns in milliseconds. - ANN —
ANN ≈ nearest-neighbor search + an index that's allowed to be slightly wrong. Without it, vector search at scale is unaffordable. - RAG (Retrieval-Augmented Generation) —
RAG = embedding-based retrieval + LLM generation. The most common way embeddings show up in product code today. - word2vec —
word2vec = shallow net + "predict context word" objective. The result that made everyone believe in the geometry. - CLIP —
CLIP = image encoder + text encoder + contrastive (image, caption) loss. The recipe behind shared image/text embeddings. - Tokenization — the step before embedding: chopping text into the discrete units (tokens) that the embedding lookup table is indexed by.
Going deeper
- Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013) — the word2vec paper, still the cleanest introduction to the contrastive idea.
- Sentence-BERT (Reimers & Gurevych, 2019) — the move from word vectors to usable sentence vectors.
- Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021) — CLIP, the cross-modal version.
- The docs for any current embedding API (OpenAI, Cohere, Voyage,
open-source
sentence-transformers) — five minutes of reading the “use this for X” section will give you a working mental map of what’s available off the shelf.
A note on what I’m sure of: the high-level story (objectives, use cases, the contrastive shape, the “vectors point similar directions for similar things” property) is well-established. Specific benchmark numbers and “best embedding model right now” rankings change every few months — treat any such claim as something to verify against a current leaderboard rather than memorize.