Why B-trees still dominate database indexes
Disks read in blocks, not bytes. B-trees were designed around that one fact — and decades later, even on SSDs, no one has dethroned them.
Why it exists
You have a billion rows in a table and you want the one where user_id = 8429173. Reading every row to find it is obviously absurd. But the obvious
alternatives are also bad: a sorted array would let you binary-search, but
you’d have to rewrite huge chunks of it on every insert. A
hash table gives you O(1) lookups but
loses the ordering you need for WHERE created_at > '2026-01-01' or ORDER BY name LIMIT 50.
The deeper problem is that databases don’t live in RAM. They live on storage that’s read in blocks — historically a 4 KB or 8 KB chunk for a spinning disk, today often a 4 KB page on an SSD. Reading one byte and reading a whole page cost roughly the same. So the right data structure isn’t the one that minimizes comparisons; it’s the one that minimizes how many pages you touch to find your row.
B-trees are the data structure shaped exactly around that constraint.
Why it matters now
Every relational database engineers reach for in 2026 — Postgres,
MySQL/InnoDB, SQLite, SQL Server, Oracle — uses a B-tree variant as the default
index. So do most embedded key-value stores that need ordered iteration. When
you write CREATE INDEX without specifying a type, you almost certainly get a
B-tree.
That’s not because nobody tried alternatives. LSM trees power RocksDB, Cassandra, and most of the NoSQL wave. Hash indexes exist. Skip lists exist. Bitmap indexes exist. They all win on specific workloads. B-trees keep winning the default slot because the workload most applications actually have — mixed reads and writes, range queries, point lookups, ordered scans — is exactly what B-trees are tuned for.
The short answer
B-tree = sorted index + fat nodes sized to a disk page
A B-tree is a balanced search tree where each node holds hundreds of keys instead of one or two. That fan-out is the whole trick: with ~500 keys per node, a tree only 4 levels deep can index something like 500⁴ ≈ 62 billion entries. Four page reads to find any row. Compare to a binary tree, which would need ~36 levels for the same data and 36 page reads.
How it works
A B-tree has three kinds of nodes:
- Root — the entry point; one per tree.
- Internal nodes — hold sorted keys plus pointers to child nodes.
- Leaf nodes — hold the actual indexed values (or pointers to the row).
Each node is sized to fit one disk page. To look up a key, you load the root, binary-search its keys to find which child to descend into, load that child, and repeat. Because each node holds hundreds of keys, you eliminate hundreds of branches per page read. That’s where the depth-4 number comes from.
Inserts keep the tree balanced by splitting full nodes: when a leaf overflows, it splits into two leaves and pushes a separator key up to the parent. If the parent overflows, it splits too. Splits propagate up the tree, but rarely all the way — and the tree stays balanced by construction, not by periodic rebalancing.
The B+ tree refinement
What databases actually ship is almost always a B+ tree, not a plain B-tree. Two differences matter:
- Only leaves hold data. Internal nodes hold only keys-as-routing-info. That makes internal nodes denser, which raises fan-out, which lowers tree height.
- Leaves are linked in a list. Once you find the start of a range, you
walk the leaf list directly without going back up the tree. That’s why
WHERE created_at BETWEEN ... AND ...is fast.
When someone says “B-tree index” in a database context, they almost always mean B+ tree. The naming is sloppy but universal.
Why fat nodes are still the right answer on SSDs
The original B-tree paper (Bayer & McCreight, 1972 — the date is well documented) was motivated by spinning disks, where seek time dominated and reading a whole track was almost free relative to repositioning the head. You’d expect SSDs, with their flat random-access cost, to break the analysis.
They don’t, for two reasons:
- SSDs still read in pages. The flash translation layer hides this from you, but under the hood, the smallest unit of read is a page (commonly 4 KB). Sub-page reads cost the same as full-page reads. The block-oriented assumption B-trees were designed around still holds.
- CPU caches and memory bandwidth. Even when the index lives entirely in RAM, fat sorted nodes are friendly to CPU caches and prefetchers. A binary tree of pointers chases cache lines; a B-tree node is a contiguous sorted array of keys you can binary-search inside one cache line’s worth of work.
I don’t have a clean public benchmark to cite for the in-memory B-tree-vs- binary-tree gap on modern CPUs, but the qualitative result — fat nodes win — shows up consistently in the engineering of in-memory databases too (e.g. SQLite’s in-memory tables still use B-trees).
Show the seams
A few places where the textbook story is misleading:
- B-trees aren’t great for write-heavy workloads. Every insert that causes a split writes multiple pages. LSM trees buffer writes in memory and flush sorted runs in batches, which gets you orders-of-magnitude better write throughput at the cost of read amplification and merge work. This is why high-ingest systems (Cassandra, ScyllaDB, RocksDB-backed stores) use LSMs instead.
- Concurrency is hard. Two transactions inserting into the same leaf can both trigger a split. Real implementations use schemes like latch crabbing or lock-free variants. The “B-tree” you read about in algorithms class is the single-threaded version; the one in your database is significantly more elaborate.
- “Covering indexes” change the calculus. If your index includes all the
columns the query needs, the database never touches the table — it answers
from the index alone. That’s a B+ tree feature (data lives in leaves) more
than a B-tree one, and it’s why
CREATE INDEX ... INCLUDE (...)exists in Postgres.
Famous related terms
- B+ tree —
B+ tree = B-tree + (data only in leaves) + (leaf linked list)— the version actually shipped in databases. - LSM tree —
LSM = in-memory buffer + sorted runs + background merges— the write-optimized rival; powers RocksDB, Cassandra, modern NoSQL. - Hash index —
hash index = hash table on disk— O(1) point lookups, no range queries, rarely the default. - Clustered index —
clustered index ≈ table physically sorted by the index— InnoDB’s primary key works this way; Postgres’s doesn’t.
Going deeper
- Bayer & McCreight, Organization and Maintenance of Large Ordered Indexes (1972) — the original B-tree paper.
- Postgres docs, “Index Types” — short, concrete tour of B-tree, hash, GiST, GIN, BRIN.
- Goetz Graefe, Modern B-Tree Techniques (2011) — comprehensive survey of how real systems implement B-trees, including concurrency and recovery.