Why vector search is approximate on purpose
Exact nearest-neighbor search exists, works, and is correct. At scale, the AI-era retrieval stack quietly walks away from it. The reason is more interesting than 'it's faster.'
Why it exists
The first time you read the docs for a vector database — Pinecone, Qdrant,
Weaviate, pgvector, FAISS — there’s a phrase that should make you stop
and squint: approximate nearest neighbor. Approximate. The system whose
entire job is “find me the most similar vector to this query” is openly
telling you it might not return the actual most similar vector.
These systems do support exact search — pgvector’s flat index, FAISS’s
IndexFlat, Qdrant’s exact mode, Weaviate’s flat index. They’re just not
what most production deployments switch on once the corpus gets big.
Approximate becomes the default. That’s the strange thing. We don’t accept
it from databases elsewhere. A SQL WHERE id = 42 that returned “probably
row 42, but maybe row 39, sorry” would be a bug. Why is the AI-era
retrieval stack happy to ship the equivalent?
There are two answers, and both are worth carrying around. The boring one is that exact NN search degrades sharply in high dimensions: the data structures that make exact search fast in low-dimensional space lose most of their pruning power, and you end up not far from a linear scan over the whole corpus. The interesting one is empirical: in the workloads where ANN dominates, the embeddings themselves are noisy enough that an ANN index’s near-misses are roughly the same size as the embedding model’s own ranking jitter. So an “exact” answer to a fuzzy question is no better than an approximate answer to it. That’s a regime claim, not a law — true for text/RAG, less true for tasks with cleaner signals — and it’s the one worth understanding.
Why it matters now
If you’re building anything on RAG or semantic search, ANN is the layer your latency, recall, and bill all sit on top of. Treating it as a black box leads to predictable disappointments:
- You won’t understand the dial. Every ANN index exposes a
speed-vs-accuracy knob (HNSW’s
efSearch, IVF’snprobe, etc.). Without a model of why it’s approximate, you tune it by guessing. - You’ll over-spec. Teams routinely demand “100% recall” from vector search and then pay a steep multiple in latency or hardware for an improvement the downstream LLM can’t even see.
- You’ll misdiagnose retrieval failures. When RAG returns a bad passage, the cause is usually the embedding being a bad fit for the query, or the chunking strategy — not the ANN index missing a true neighbor. Blaming the index is a common dead end.
- You’ll pick the wrong index for the wrong reason. HNSW vs. IVF vs. ScaNN vs. flat search isn’t a flavor war; each is a different point on a curve, and which point you want depends on data size, update frequency, and how much recall you actually need.
The mental model “vector DB = approximate on purpose, and that’s fine because embeddings are fuzzy” answers most of those.
The short answer
ANN = sub-linear-time index + accept a small recall loss + big speedup, justified because the embedding distance is itself fuzzy
Exact nearest-neighbor in high dimensions tends to collapse toward linear scan — the indexes that prune well in 2D or 3D lose most of their power. ANN indexes (graph-based like HNSW, partition-based like IVF, hash-based like LSH) skip most of the data using clever structure, at the cost of occasionally missing the true top-k. In text/RAG-style workloads the cost is usually acceptable because the embedding model’s own ranking is already noisy at the margins.
How it works
Two things have to be true for the ANN trade to make sense. Both turn out to be true, for very different reasons.
1. Exact NN at scale gets ugly fast
There’s a classical result loosely called the curse of dimensionality. The compact statement: under many distributions, as dimensionality grows the distances between points concentrate, and partitioning structures that pruned aggressively in low dimensions lose most of their leverage. A k-d tree splits space at each node, but in high-dimensional spaces a query’s “nearby” region touches almost every branch, so the tree ends up walking most of the data anyway. Beyer et al. (1999) is the standard reference for the concentration effect.
That leaves you with mostly-honest options for big high-dimensional corpora:
- Linear scan with a fast inner-product kernel. Correct, embarrassingly parallel, and expensive: every query touches every vector. Possible on a big GPU for moderate corpora; quickly stops being the per-query cost target a search product wants once you scale out.
- Distributed linear scan. Same per-query cost, spread across machines. Buys throughput, not latency.
Neither comfortably hits the low-millisecond, large-corpus, single-node bar that production semantic search is usually held to. So the field went looking for something that isn’t exact.
2. Approximation is often effectively free in embedding space
This is the part that’s worth internalizing, with the caveat that it’s a regime claim, not a theorem. It’s true for the workloads that drove ANN to dominance (text-style RAG, semantic search over noisy human-written corpora). It’s less true for tasks with cleaner signals, which I’ll come back to.
An embedding model maps text (or images, or whatever) into a vector. The training objective shapes which directions are close, but it does not nail down a precise distance: two equally-good paraphrases of a sentence land in nearby-but-not-identical points, and the model’s “rank” of the top results for a query is genuinely noisy near the boundary. Re-train the model with a different seed, or even use a different pooling strategy, and the ordering of result #4 vs. result #7 will jiggle. Result #1 vs. result #500 won’t.
Now layer an ANN index on top. A well-tuned HNSW index typically returns the true top-k most of the time, and when it misses, it usually returns vectors that are almost as close as the true neighbor — its errors look like “swap result #5 with result #6,” not “completely miss the topic.” That error is roughly the same kind, and the same scale, as the embedding’s own ranking noise.
Stack them and you get the argument: in this regime, the ANN-vs-exact gap lives below the embedding-vs-truth gap. So spending compute to close the ANN gap doesn’t move end-to-end retrieval quality much. You’re sharpening a measurement whose underlying signal was already blurry.
This is why “high recall” benchmarks for ANN are easy to over-index on. The last few percentage points of recall often cost a sizable multiple in latency, and in an end-to-end RAG eval the difference is frequently inside the noise of the LLM’s own answer variability. (I don’t have a single canonical citation for this — it’s the folklore claim that “embedding error dominates ANN error” in noisy text workloads. The shape is well-supported by ANN benchmarks like ann-benchmarks.com, but how universally it holds across embedding models and corpora is a claim that varies by setup.)
How the indexes actually skip work
Three families dominate, and each cheats in a different direction:
- Graph-based (HNSW). Build a multi-layer graph where each vector
links to a small number of others. To search, greedy-walk the graph
toward the query, descending layers. Skips work because most of the
vectors are never visited at all. The dial: how many candidates to
expand at each step (
efSearch). Higher = closer to exact, slower. - Partition-based (IVF). Cluster all vectors into, say, 4096 cells
via k-means. At query time, find the cells closest to the query and
search only those. Skips work by ignoring most cells. The dial:
nprobe, how many cells to inspect. Higher = closer to exact, slower. Often combined with PQ to also shrink memory. - Hash-based (LSH). Pick hash functions where similar vectors collide more often than dissimilar ones. Look up the query’s bucket, score what’s there. The theoretical pitch is probabilistic guarantees on recall as a function of distance — clean math, and it was the dominant ANN family for years. In practice it’s usually beaten by HNSW and IVF on modern dense embeddings, but it shows up where worst-case behavior matters more than average-case throughput.
All three give you sub-linear query time and a knob that says “trade more compute for fewer mistakes.” The memory bill varies by family — graph indexes like HNSW typically end up larger than the raw vectors; IVF combined with PQ is often smaller, sometimes by an order of magnitude.
Where the argument breaks down
The “approximation is free” story is contingent. A few honest exceptions:
- Small corpora. When the corpus is small enough, exact search is
fast enough that you should just do it.
pgvector’s flat scan, FAISSIndexFlat, or even NumPy will give you exact answers quickly. ANN’s fixed costs (build time, memory, parameter tuning) aren’t worth it until the dataset earns them. - Safety-critical lookups. If you’re using vector search for dedup, copyright matching, or anything where a missed true neighbor is a correctness bug rather than a quality wobble, exact search (or ANN as candidate generation followed by exact rerank) is the standard pattern.
- Tiny
kand tight thresholds. If you only care whether the closest match is below distancet, andtis small, ANN’s near-miss errors can flip the answer. Exact rerank of the top candidates is the standard fix. - The “approximation is free” claim is empirical. It depends on the embedding model’s noise floor being similar in magnitude to the ANN index’s miss rate. For embeddings trained on much cleaner signals (face recognition, fingerprint matching), the noise floor is much lower and ANN errors do show up in end-to-end quality. The text/RAG case is the friendly one.
The thing to take away: vector DBs are approximate not because the field gave up on correctness, but because correctness is being measured against a ranking that was already approximate before the index ever saw it. The index is rounded to the precision of the question it’s being asked.
Famous related terms
- HNSW —
HNSW = multi-layer graph + greedy nearest-neighbor walk— a widely deployed graph-based ANN index. Common across pgvector, Qdrant, Weaviate, and FAISS. - IVF —
IVF = k-means clusters + search only the closest cells— the partition-based alternative; often paired with PQ for memory savings. - PQ (Product Quantization) —
PQ = chop vector into chunks + quantize each chunk— substantially shrinks memory at a small accuracy cost. Why FAISS can hold very large corpora in RAM. - LSH —
LSH = hash function where similar things collide— the original ANN family; clean probabilistic guarantees, less common in production for dense embeddings today. - Recall@k —
recall@k = (exact top-k neighbors retrieved by the index) / k— the metric ANN benchmarks live or die by. The number you tune the speed-vs-accuracy dial against. - Cosine similarity —
cosine similarity = dot product / (‖a‖·‖b‖)— the metric most text-embedding ANN setups are configured to optimize; dot product and L2 also common. - Embeddings —
embedding = learned vector representation of a discrete thing— the vectors the index is full of, and the source of the noise that makes approximation acceptable in the regime where it does. - RAG —
RAG = retriever + generator + prompt assembly— the most common reason this whole stack exists.
Going deeper
- Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, 2016 — the HNSW paper. Surprisingly readable; the algorithm is short.
- Jégou, Douze, Schmid, Product Quantization for Nearest Neighbor Search, 2011 — the PQ paper. Explains how FAISS gets its memory numbers.
- The ann-benchmarks.com site — open benchmarks comparing every major ANN library on real datasets. The recall-vs-QPS curves are the clearest picture of the trade-off space.
- The FAISS wiki — Meta’s library is the de facto reference implementation; its docs are also the best general explanation of the index families.