Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why vector search is approximate on purpose

Exact nearest-neighbor search exists, works, and is correct. At scale, the AI-era retrieval stack quietly walks away from it. The reason is more interesting than 'it's faster.'

AI & ML intermediate Apr 29, 2026

Why it exists

The first time you read the docs for a vector database — Pinecone, Qdrant, Weaviate, pgvector, FAISS — there’s a phrase that should make you stop and squint: approximate nearest neighbor. Approximate. The system whose entire job is “find me the most similar vector to this query” is openly telling you it might not return the actual most similar vector.

These systems do support exact search — pgvector’s flat index, FAISS’s IndexFlat, Qdrant’s exact mode, Weaviate’s flat index. They’re just not what most production deployments switch on once the corpus gets big. Approximate becomes the default. That’s the strange thing. We don’t accept it from databases elsewhere. A SQL WHERE id = 42 that returned “probably row 42, but maybe row 39, sorry” would be a bug. Why is the AI-era retrieval stack happy to ship the equivalent?

There are two answers, and both are worth carrying around. The boring one is that exact NN search degrades sharply in high dimensions: the data structures that make exact search fast in low-dimensional space lose most of their pruning power, and you end up not far from a linear scan over the whole corpus. The interesting one is empirical: in the workloads where ANN dominates, the embeddings themselves are noisy enough that an ANN index’s near-misses are roughly the same size as the embedding model’s own ranking jitter. So an “exact” answer to a fuzzy question is no better than an approximate answer to it. That’s a regime claim, not a law — true for text/RAG, less true for tasks with cleaner signals — and it’s the one worth understanding.

Why it matters now

If you’re building anything on RAG or semantic search, ANN is the layer your latency, recall, and bill all sit on top of. Treating it as a black box leads to predictable disappointments:

The mental model “vector DB = approximate on purpose, and that’s fine because embeddings are fuzzy” answers most of those.

The short answer

ANN = sub-linear-time index + accept a small recall loss + big speedup, justified because the embedding distance is itself fuzzy

Exact nearest-neighbor in high dimensions tends to collapse toward linear scan — the indexes that prune well in 2D or 3D lose most of their power. ANN indexes (graph-based like HNSW, partition-based like IVF, hash-based like LSH) skip most of the data using clever structure, at the cost of occasionally missing the true top-k. In text/RAG-style workloads the cost is usually acceptable because the embedding model’s own ranking is already noisy at the margins.

How it works

Two things have to be true for the ANN trade to make sense. Both turn out to be true, for very different reasons.

1. Exact NN at scale gets ugly fast

There’s a classical result loosely called the curse of dimensionality. The compact statement: under many distributions, as dimensionality grows the distances between points concentrate, and partitioning structures that pruned aggressively in low dimensions lose most of their leverage. A k-d tree splits space at each node, but in high-dimensional spaces a query’s “nearby” region touches almost every branch, so the tree ends up walking most of the data anyway. Beyer et al. (1999) is the standard reference for the concentration effect.

That leaves you with mostly-honest options for big high-dimensional corpora:

Neither comfortably hits the low-millisecond, large-corpus, single-node bar that production semantic search is usually held to. So the field went looking for something that isn’t exact.

2. Approximation is often effectively free in embedding space

This is the part that’s worth internalizing, with the caveat that it’s a regime claim, not a theorem. It’s true for the workloads that drove ANN to dominance (text-style RAG, semantic search over noisy human-written corpora). It’s less true for tasks with cleaner signals, which I’ll come back to.

An embedding model maps text (or images, or whatever) into a vector. The training objective shapes which directions are close, but it does not nail down a precise distance: two equally-good paraphrases of a sentence land in nearby-but-not-identical points, and the model’s “rank” of the top results for a query is genuinely noisy near the boundary. Re-train the model with a different seed, or even use a different pooling strategy, and the ordering of result #4 vs. result #7 will jiggle. Result #1 vs. result #500 won’t.

Now layer an ANN index on top. A well-tuned HNSW index typically returns the true top-k most of the time, and when it misses, it usually returns vectors that are almost as close as the true neighbor — its errors look like “swap result #5 with result #6,” not “completely miss the topic.” That error is roughly the same kind, and the same scale, as the embedding’s own ranking noise.

Stack them and you get the argument: in this regime, the ANN-vs-exact gap lives below the embedding-vs-truth gap. So spending compute to close the ANN gap doesn’t move end-to-end retrieval quality much. You’re sharpening a measurement whose underlying signal was already blurry.

This is why “high recall” benchmarks for ANN are easy to over-index on. The last few percentage points of recall often cost a sizable multiple in latency, and in an end-to-end RAG eval the difference is frequently inside the noise of the LLM’s own answer variability. (I don’t have a single canonical citation for this — it’s the folklore claim that “embedding error dominates ANN error” in noisy text workloads. The shape is well-supported by ANN benchmarks like ann-benchmarks.com, but how universally it holds across embedding models and corpora is a claim that varies by setup.)

How the indexes actually skip work

Three families dominate, and each cheats in a different direction:

All three give you sub-linear query time and a knob that says “trade more compute for fewer mistakes.” The memory bill varies by family — graph indexes like HNSW typically end up larger than the raw vectors; IVF combined with PQ is often smaller, sometimes by an order of magnitude.

Where the argument breaks down

The “approximation is free” story is contingent. A few honest exceptions:

The thing to take away: vector DBs are approximate not because the field gave up on correctness, but because correctness is being measured against a ranking that was already approximate before the index ever saw it. The index is rounded to the precision of the question it’s being asked.

Going deeper