Why does cosine similarity dominate over Euclidean distance in embeddings?
Two vectors can be far apart and still mean the same thing. Cosine similarity asks the only question that turns out to matter: are they pointing the same way?
Why it exists
Picture two Spotify playlists. Yours has 200 songs; a friend’s has 20. Both are roughly 60% rock, 30% pop, 10% jazz. Do you have similar taste? If you measured “how far apart are the raw song counts,” you’d say no — yours is ten times bigger. If you measured “are the proportions the same,” you’d say yes. Cosine similarity is the second measurement. It throws away how much there is and only asks whether two things point in the same direction. That’s exactly the question you want to ask of an embedding, and exactly the wrong question to ask of a map.
If you’ve ever wired up RAG, opened a vector database, or read the docs for any embedding API, you’ve seen the same line: use cosine similarity to compare vectors. Almost nobody stops to ask why. Euclidean distance — the straight-line ruler distance you learned in school — is right there, it works in any number of dimensions, and it’s what “distance” means in normal life. So why did the entire field quietly agree to ignore it?
The short version is: when you turn meaning into a vector, the direction of the vector is the part that carries meaning, and the length of the vector tends to drift around for boring reasons — how long the document was, how confident the model felt, how many tokens it averaged over. Euclidean distance treats length and direction as equally important. Cosine similarity throws length away on purpose. In embedding space, that turns out to be exactly the right thing to do.
This post is about why that works out, what cosine similarity actually is once you cut through the formula, and where it stops being the right tool.
Why it matters now
Anyone building on top of LLMs
in 2026 is doing nearest-neighbor lookups in a vector space many times a
day, often without thinking about it. The choice of similarity metric is
buried inside pgvector, Pinecone, FAISS, Qdrant — and it shows up as a
configuration flag (cosine / dot / l2) that most teams set once and
never revisit.
Picking the wrong one quietly degrades retrieval quality. A RAG system that uses Euclidean distance over un-normalized embeddings will rank short, terse passages systematically differently from long, verbose ones — not because they mean different things but because their vectors are different lengths. The model never told you that was going to happen. The metric did.
So this is one of those small choices that sits under a lot of working software. Worth understanding once.
The short answer
cosine_similarity(a, b) = (a · b) / (‖a‖ · ‖b‖) = cos(θ)
It’s the cosine of the angle between two vectors. It ignores how long each vector is and only asks how aligned they are. Two vectors pointing the same way score 1, perpendicular ones score 0, opposite ones score −1. Euclidean distance, by contrast, asks “how far apart are the two tips?” — which mixes “different direction” and “different length” into a single number, and embedding models specifically trained the direction.
How it works
Two vectors a and b in some high-dimensional space. There are
basically two reasonable ways to ask “are these similar?”
Euclidean distance — ‖a − b‖. The straight line between the two
tips. Smaller is more similar.
Cosine similarity — (a · b) / (‖a‖ ‖b‖). Normalize both vectors
to unit length first, then take the dot product. Larger is more similar.
Geometrically, cosine similarity is what you get if you slide both vectors to the origin, project them onto the unit sphere, and then measure how close they are. Length information is discarded by the projection. Direction is all that’s left.
Why direction is the meaningful part
Embedding models are trained with objectives like “pull these two representations together, push those two apart.” The training signal shapes where the vector points. It does not, in general, pin down a canonical length. Two side effects fall out of that:
- Vectors of related things end up clustered along similar directions — that’s the property the loss explicitly rewarded.
- Vector magnitudes encode incidental stuff — passage length, token
count, how many activations happened to be large that day. Models
often pool over tokens (mean, max, last hidden state of a
[CLS]token) and that pooling step is one of the main ways magnitude leaks in.
If you compare with Euclidean distance, point 2 contaminates point 1. A short query and a long document about the exact same topic can have very different magnitudes, so their tips can be far apart in raw space even though their directions agree. Cosine sidesteps the issue: project both onto the unit sphere, look at the angle, done.
The “they’re almost the same metric” trick
Here’s the part that’s worth carrying around in your head. If both
vectors have been normalized to unit length (‖a‖ = ‖b‖ = 1), then:
‖a − b‖² = ‖a‖² + ‖b‖² − 2(a · b)
= 1 + 1 − 2(a · b)
= 2 − 2 · cos(θ)
So on the unit sphere, Euclidean distance and cosine similarity are monotonically related — ranking by one gives the same nearest neighbors as ranking by the other. That’s why a vector database that “only supports L2” can still do cosine search: you normalize the vectors at insert time and the index does the right thing.
It’s also why, in practice, lots of systems quietly store unit-normalized embeddings and use plain dot product as the similarity score. With normalization, dot product is cosine. Without it, dot product is “cosine, weighted by how big the vectors happen to be” — sometimes useful (popular items get a magnitude boost in some recsys setups) but usually a footgun.
A concrete example
Imagine three documents, embedded into 3D for the sake of the illustration:
A = [10, 0, 0] # short doc about cats
B = [ 1, 0, 0] # one-sentence note about cats
C = [ 0, 9, 0] # short doc about thermodynamics
Euclidean distances: ‖A − B‖ = 9, ‖A − C‖ ≈ 13.5. So B is “closer”
to A than C is — fine, that matches intuition.
Cosine similarities: cos(A, B) = 1, cos(A, C) = 0. B is identical
to A in direction; C is fully orthogonal. The cosine view says “A and B
are about the same thing; A and C are about totally different things,”
which is closer to what you actually want from a search system.
The Euclidean ranking happens to agree here, but it’s quantitatively muddled by the magnitude gap between A and B. Now imagine A and B are 1536-dimensional, the magnitude gap is correlated with document length in your corpus, and you’re asking “rank ten thousand documents by how much they match this query.” The muddle compounds.
Where cosine is the wrong tool
Cosine isn’t universally correct. A few honest exceptions:
- When magnitude is meaningful. If you’ve designed your vectors so that “more important” or “more confident” really does mean “longer,” cosine throws that away. Some classical TF-IDF and recsys setups intentionally lean on magnitude.
- When the embedding model wasn’t trained for it. Some image and recsys embeddings are trained against Euclidean / L2 objectives; using cosine on those is a metric mismatch, and you can usually tell by reading the model card. (My read is that the AI-era text embedding models have converged hard on cosine/dot, but the broader ML world has not — I don’t have a clean survey number for what fraction goes which way.)
- For dense numerical features (sensor readings, geographic coordinates, anything where the axes have units). Cosine is mostly a semantic-space tool. On a map, Euclidean is what you want.
Famous related terms
- Dot product —
a · b = Σ aᵢ bᵢ— cosine’s unnormalized cousin. Same ranking as cosine when vectors are unit-length, faster to compute, and the actual primitive vector indexes use under the hood. - Euclidean distance (L2) —
‖a − b‖— straight-line distance. The default in geometry, the wrong default in semantic search unless you normalize first. - L2 normalization —
a / ‖a‖— projecting a vector onto the unit sphere. The bridge between “I have a Euclidean index” and “I want cosine semantics.” - ANN index — the data structure (HNSW, IVF, etc.) that makes cosine/dot/L2 search sub-linear at billion-vector scale.
- Embedding — see embeddings — the reason you ever need a similarity metric in the first place.
Going deeper
- Any decent linear algebra reference on inner-product spaces — the
identity
‖a − b‖² = ‖a‖² + ‖b‖² − 2(a · b)is the whole story underneath the “cosine and L2 are equivalent on the unit sphere” result. - The docs for
pgvector, FAISS, or your vector database of choice — read the section on distance metrics. Most of them note explicitly that cosine is implemented as “normalize, then dot product.” - Embedding model cards (OpenAI, Cohere, Voyage,
sentence-transformers) almost always state the recommended similarity metric. When they do, trust the card over the default.
A note on what I’m sure of and what I’m not. The mathematical claims here — the formulas, the unit-sphere equivalence, the direction-vs-magnitude framing — are standard. The empirical claim that modern text embedding models are trained such that direction carries the signal and magnitude carries noise is the consensus story I’ve seen across model cards and tutorials, but I don’t have a single canonical citation that proves it across every embedding model in use. If a specific embedding you’re using disagrees, the model card is the source of truth, not this post.