Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why Bloom filters exist

A data structure that answers "have I seen this?" with "definitely no" or "maybe" — and saves enormous amounts of work by being wrong on purpose.

Data intro Apr 29, 2026

Why it exists

You have a billion keys somewhere — on disk, in another datacenter, behind a slow API — and a stream of incoming questions: is this one of them? If the answer is “no” 99% of the time, you’d really rather not pay the cost of looking each time. A hash set would work, but a hash set of a billion strings is huge. You don’t have the RAM. You don’t even want the RAM, because you’re going to put one of these in front of every shard, every SSTable, every cache.

What you actually need is a tiny structure that can quickly say “don’t bother, the answer is definitely no” — and is allowed to occasionally lie in the other direction (say “maybe” when the truth is “no”). A “maybe” just means you fall through to the slow path you were going to take anyway. A wrong “no” would be a correctness bug; a wrong “maybe” is just a missed optimization.

That asymmetry is the entire point. Burton Howard Bloom published the idea in 1970 (Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM), at a time when memory was scarce enough that being willing to be wrong was a real currency you could spend.

Why it matters now

Memory is cheaper, but the latency and bandwidth gaps between RAM, local SSD, the network, and cold object storage are still enormous — orders of magnitude in each step, depending on the workload. Anywhere there’s a steep cliff between two storage tiers, a Bloom filter wants to live at the top of it.

Concrete places they show up:

If you’re building anything in front of a slow store — a vector index, a dedup pipeline, a cache — and you find yourself burning time on “is this thing here?” lookups that almost always come back negative, the Bloom filter is the shape of the answer.

The short answer

Bloom filter = bit array + k hash functions

To insert x, you hash it k different ways, take each hash mod the array length, and flip those k bits to 1. To query x, you hash it the same k ways and check those bits. If any of them is 0, x is definitely not in the set. If all of them are 1, x is probably in the set — but it might just be that other insertions collectively flipped exactly those bits. There’s no remove (in the basic version), and no enumerate. You traded both for size.

How it works

Picture a bit array of m bits, all zero. Pick k independent hash functions that map keys uniformly into [0, m). In practice you don’t keep k real hash functions around — you take two good base hashes h1, h2 (often by splitting a single 128-bit hash like MurmurHash3 or xxHash in half) and derive each of the k indices as (h1(x) + i · h2(x)) mod m for i = 0..k−1. That’s Kirsch & Mitzenmacher’s “less hashing, same performance” result; the common implementation pattern of just splitting one 128-bit hash is the practitioner’s shortcut on top of it.

Insert "alice":

  1. Hash to, say, indices [3, 17, 42, 88] (with k=4).
  2. Set those bits to 1.

Insert "bob":

  1. Hash to, say, [3, 22, 71, 99].
  2. Set those bits to 1. Index 3 was already 1 — fine, it stays 1.

Query "carol":

  1. Hash to, say, [3, 22, 42, 88].
  2. All four bits are 1. The filter says “maybe.” This is a false positive: nobody inserted "carol", but Alice and Bob between them happened to flip every bit Carol’s hashes hit.

Query "dave":

  1. Hash to [3, 17, 50, 99].
  2. Bit 50 is 0. The filter says “definitely no.” This is never a false negative — once a key is in, every one of its k bits is 1, so the answer is always “maybe” or better.

The math that matters is the false-positive rate. After inserting n items into m bits with k hashes, the probability a random query returns a false positive is approximately (1 − e^(−kn/m))^k. Two consequences worth remembering:

These are well-known textbook formulas; the derivation is in any standard probabilistic-data-structures reference. I’m not citing a specific source for the constants because they fall straight out of the equation above.

The seams — places the abstraction leaks:

Going deeper