Why Bloom filters exist

A data structure that answers "have I seen this?" with "definitely no" or "maybe" — and saves enormous amounts of work by being wrong on purpose.

Data intro Apr 29, 2026

Why it exists

You have a billion keys somewhere — on disk, in another datacenter, behind a slow API — and a stream of incoming questions: is this one of them? If the answer is “no” 99% of the time, you’d really rather not pay the cost of looking each time. A hash set would work, but a hash set of a billion strings is huge. You don’t have the RAM. You don’t even want the RAM, because you’re going to put one of these in front of every shard, every SSTable, every cache.

What you actually need is a tiny structure that can quickly say “don’t bother, the answer is definitely no” — and is allowed to occasionally lie in the other direction (say “maybe” when the truth is “no”). A “maybe” just means you fall through to the slow path you were going to take anyway. A wrong “no” would be a correctness bug; a wrong “maybe” is just a missed optimization.

That asymmetry is the entire point. Burton Howard Bloom published the idea in 1970 (Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM), at a time when memory was scarce enough that being willing to be wrong was a real currency you could spend.

Why it matters now

Memory is cheaper, but the latency and bandwidth gaps between RAM, local SSD, the network, and cold object storage are still enormous — orders of magnitude in each step, depending on the workload. Anywhere there’s a steep cliff between two storage tiers, a Bloom filter wants to live at the top of it.

Concrete places they show up:

LSM-tree storage engines (RocksDB, Cassandra, LevelDB, HBase) commonly attach a Bloom filter to each on-disk sorted file (SSTable in RocksDB/Cassandra terms, HFile in HBase) so reads can skip files that definitely don’t contain the key. It’s usually configurable, not always on. See lsm-trees.
CDNs and web caches sometimes use approximate-membership filters (Bloom or cuckoo) to avoid expensive negative lookups against a slower origin. I don’t have a clean public reference for which big CDN uses what — treat this as “the technique fits the shape of the problem,” not a vendor claim.
Browsers historically used them for malicious-URL checks. Chromium shipped a Safe Browsing implementation backed by Bloom filters, then switched to an exact-prefix-set design around 2012. The current Safe Browsing protocol has moved on again; I’m not going to pretend to know its 2026 internals.
Bigtable’s 2006 paper explicitly describes per-SSTable Bloom filters as a way to drastically reduce disk seeks for reads — especially reads for rows or columns that don’t exist.
Bitcoin’s SPV wallets historically used them via BIP 37 to ask full nodes for transactions matching a set of addresses without sending the set in the clear. The privacy story turned out to be weak in practice — the filter parameters wallets chose often made the set recoverable — and the recurring lesson is that Bloom filters leak information sideways.

If you’re building anything in front of a slow store — a vector index, a dedup pipeline, a cache — and you find yourself burning time on “is this thing here?” lookups that almost always come back negative, the Bloom filter is the shape of the answer.

The short answer

Bloom filter = bit array + k hash functions

To insert x, you hash it k different ways, take each hash mod the array length, and flip those k bits to 1. To query x, you hash it the same k ways and check those bits. If any of them is 0, x is definitely not in the set. If all of them are 1, x is probably in the set — but it might just be that other insertions collectively flipped exactly those bits. There’s no remove (in the basic version), and no enumerate. You traded both for size.

How it works

Picture a bit array of m bits, all zero. Pick k independent hash functions that map keys uniformly into [0, m). In practice you don’t keep k real hash functions around — you take two good base hashes h1, h2 (often by splitting a single 128-bit hash like MurmurHash3 or xxHash in half) and derive each of the k indices as (h1(x) + i · h2(x)) mod m for i = 0..k−1. That’s Kirsch & Mitzenmacher’s “less hashing, same performance” result; the common implementation pattern of just splitting one 128-bit hash is the practitioner’s shortcut on top of it.

Insert "alice":

Hash to, say, indices [3, 17, 42, 88] (with k=4).
Set those bits to 1.

Insert "bob":

Hash to, say, [3, 22, 71, 99].
Set those bits to 1. Index 3 was already 1 — fine, it stays 1.

Query "carol":

Hash to, say, [3, 22, 42, 88].
All four bits are 1. The filter says “maybe.” This is a false positive: nobody inserted "carol", but Alice and Bob between them happened to flip every bit Carol’s hashes hit.

Query "dave":

Hash to [3, 17, 50, 99].
Bit 50 is 0. The filter says “definitely no.” This is never a false negative — once a key is in, every one of its k bits is 1, so the answer is always “maybe” or better.

The math that matters is the false-positive rate. After inserting n items into m bits with k hashes, the probability a random query returns a false positive is approximately (1 − e^(−kn/m))^k. Two consequences worth remembering:

For a fixed target false-positive rate p, the optimal bit count is m ≈ −n · ln(p) / (ln 2)² — about 9.6 bits per key for 1% false positives, 14.4 bits per key for 0.1%. That’s the headline number: a Bloom filter for 1B keys at 1% false-positive rate is ~1.2 GB (≈ 1.12 GiB), versus a real hash set of those same keys typically running into many tens of GB once you account for key length, pointers, and runtime overhead. Easily an order of magnitude smaller in practice.
The optimal k at that sizing is k = (m/n) · ln 2 — about 7 hashes for 1%, 10 for 0.1%.

These are well-known textbook formulas; the derivation is in any standard probabilistic-data-structures reference. I’m not citing a specific source for the constants because they fall straight out of the equation above.

The seams — places the abstraction leaks:

No deletion. Clearing the k bits on remove would also clear bits other keys depend on. Counting Bloom filters fix this by storing small counters instead of bits, typically at ~3–4x the space cost depending on counter width. There are also cuckoo filters (Fan et al., 2014) that support deletion; for many practical workloads at low false-positive rates (below ~3%) they can use less space than space-optimized Bloom filters, and are a strong alternative when you need deletes. Bloom is still everywhere because it’s older, simpler, and good enough.
You have to know n ahead of time. Insert too many keys and the false-positive rate blows up. As a rule of thumb, when kn/m ≈ 1 the filter is already well past half full — the expected fraction of 1-bits is 1 − 1/e ≈ 63% — and false positives have climbed well above whatever you originally sized for. Scalable Bloom filters sidestep this by layering filters of growing size to handle unknown n, at the cost of more lookups.
Cache behavior matters more than you’d guess. A naive Bloom filter for billions of keys touches k random cache lines per query. Blocked Bloom filters (Putze, Sanders, Singler, 2007) keep all k probes inside one cache line, which can substantially improve throughput on modern CPUs at a small hit to false-positive rate. If you’re using Bloom filters at hot-path scale, this is where the real performance lives.
They leak set-membership in adversarial settings. Responses to repeated membership queries can leak information about what’s in the set. The Bitcoin SPV BIP 37 story is the canonical cautionary tale: in practice, the filter parameters wallets chose were often loose enough that observers could recover the address set and deanonymize the wallet.

Hash table — hash table = array + hash function — the exact-answer cousin; trades memory for never being wrong. See hash-table.
Cuckoo filter — cuckoo filter ≈ Bloom filter + deletion + better space at low false-positive rates — newer (2014) alternative; a strong default when deletions matter.
HyperLogLog — HyperLogLog = hash + count leading zeros + harmonic mean — answers a different probabilistic question (how many distinct items?) in a few KB. Same spirit: be wrong on purpose, save orders of magnitude.
Counting Bloom filter — counting Bloom = Bloom filter + small counters instead of bits — supports remove, costs more space.
Scalable Bloom filter — scalable Bloom = sequence of Bloom filters of growing size — handles unknown n without blowing up the false-positive rate.

Going deeper

Burton H. Bloom — Space/Time Trade-offs in Hash Coding with Allowable Errors, Communications of the ACM 13(7):422–426, July 1970. The original.
Kirsch & Mitzenmacher — Less Hashing, Same Performance: Building a Better Bloom Filter, ESA 2006 (journal version 2008). Why you only need two real hash functions.
Fan, Andersen, Kaminsky, Mitzenmacher — Cuckoo Filter: Practically Better Than Bloom, CoNEXT 2014.
Putze, Sanders, Singler — Cache-, Hash- and Space-Efficient Bloom Filters, 2007. The blocked Bloom filter.
Mitzenmacher & Upfal, Probability and Computing — a standard reference with a clean derivation of the false-positive formula.