Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why B-trees still dominate database indexes

Disks read in blocks, not bytes. B-trees were designed around that one fact — and decades later, even on SSDs, no one has dethroned them.

Data intermediate Apr 29, 2026

Why it exists

You have a billion rows in a table and you want the one where user_id = 8429173. Reading every row to find it is obviously absurd. But the obvious alternatives are also bad: a sorted array would let you binary-search, but you’d have to rewrite huge chunks of it on every insert. A hash table gives you O(1) lookups but loses the ordering you need for WHERE created_at > '2026-01-01' or ORDER BY name LIMIT 50.

The deeper problem is that databases don’t live in RAM. They live on storage that’s read in blocks — historically a 4 KB or 8 KB chunk for a spinning disk, today often a 4 KB page on an SSD. Reading one byte and reading a whole page cost roughly the same. So the right data structure isn’t the one that minimizes comparisons; it’s the one that minimizes how many pages you touch to find your row.

B-trees are the data structure shaped exactly around that constraint.

Why it matters now

Every relational database engineers reach for in 2026 — Postgres, MySQL/InnoDB, SQLite, SQL Server, Oracle — uses a B-tree variant as the default index. So do most embedded key-value stores that need ordered iteration. When you write CREATE INDEX without specifying a type, you almost certainly get a B-tree.

That’s not because nobody tried alternatives. LSM trees power RocksDB, Cassandra, and most of the NoSQL wave. Hash indexes exist. Skip lists exist. Bitmap indexes exist. They all win on specific workloads. B-trees keep winning the default slot because the workload most applications actually have — mixed reads and writes, range queries, point lookups, ordered scans — is exactly what B-trees are tuned for.

The short answer

B-tree = sorted index + fat nodes sized to a disk page

A B-tree is a balanced search tree where each node holds hundreds of keys instead of one or two. That fan-out is the whole trick: with ~500 keys per node, a tree only 4 levels deep can index something like 500⁴ ≈ 62 billion entries. Four page reads to find any row. Compare to a binary tree, which would need ~36 levels for the same data and 36 page reads.

How it works

A B-tree has three kinds of nodes:

Each node is sized to fit one disk page. To look up a key, you load the root, binary-search its keys to find which child to descend into, load that child, and repeat. Because each node holds hundreds of keys, you eliminate hundreds of branches per page read. That’s where the depth-4 number comes from.

Inserts keep the tree balanced by splitting full nodes: when a leaf overflows, it splits into two leaves and pushes a separator key up to the parent. If the parent overflows, it splits too. Splits propagate up the tree, but rarely all the way — and the tree stays balanced by construction, not by periodic rebalancing.

The B+ tree refinement

What databases actually ship is almost always a B+ tree, not a plain B-tree. Two differences matter:

  1. Only leaves hold data. Internal nodes hold only keys-as-routing-info. That makes internal nodes denser, which raises fan-out, which lowers tree height.
  2. Leaves are linked in a list. Once you find the start of a range, you walk the leaf list directly without going back up the tree. That’s why WHERE created_at BETWEEN ... AND ... is fast.

When someone says “B-tree index” in a database context, they almost always mean B+ tree. The naming is sloppy but universal.

Why fat nodes are still the right answer on SSDs

The original B-tree paper (Bayer & McCreight, 1972 — the date is well documented) was motivated by spinning disks, where seek time dominated and reading a whole track was almost free relative to repositioning the head. You’d expect SSDs, with their flat random-access cost, to break the analysis.

They don’t, for two reasons:

I don’t have a clean public benchmark to cite for the in-memory B-tree-vs- binary-tree gap on modern CPUs, but the qualitative result — fat nodes win — shows up consistently in the engineering of in-memory databases too (e.g. SQLite’s in-memory tables still use B-trees).

Show the seams

A few places where the textbook story is misleading:

Going deeper