Why LSM trees exist
B-trees write where the key lives. LSM trees refuse to do that — and that refusal is the whole point.
Why it exists
A B-tree writes a key roughly where that key already lives on disk. That sounds harmless until you imagine a write-heavy workload — a metrics pipeline, a log ingester, a chat backend taking millions of small inserts per second. Each insert lands at a different spot in the tree, so the disk head (or the SSD’s flash translation layer) is constantly being asked to go somewhere else and update a few bytes. Random writes are the slowest thing a storage stack does.
LSM trees came out of refusing that bargain. The original 1996 paper by O’Neil, Cheng, Gawlick, and O’Neil — The Log-Structured Merge-Tree — was written in a world of spinning disks where a random seek cost ~10 ms and a sequential MB/s was 50–100x cheaper per byte. The idea: never update a key in place. Just append. Sort and merge later, in the background, when you can do it sequentially.
That choice is why LevelDB, RocksDB, Cassandra, ScyllaDB, HBase, BigTable, DynamoDB’s storage engine, and most modern time-series databases are LSM-based. It’s also why your Postgres tables are not. Different bargains, different workloads.
Why it matters now
Even on NVMe SSDs, where random I/O is much cheaper than it was in 1996, the LSM bargain still pays off — just for slightly different reasons. SSDs have an internal flash translation layer that hates small random writes because flash can only be erased in big blocks. Sequential append is friendlier to the device and extends its lifetime. So the technique that was designed for spinning rust survived the transition to flash almost unchanged.
Anything you touch that calls itself a “key-value store,” “time-series database,” or “embedded storage engine” probably has an LSM tree at the bottom. If you’ve ever wondered why your Cassandra cluster has a “compaction” knob, or why RocksDB has stalls, or why a write-heavy DynamoDB table costs less than a read-heavy one — that’s the LSM showing through.
The short answer
LSM tree = in-memory sorted buffer + append-only sorted files on disk + background merge
Writes hit a small in-memory structure (usually a sorted skiplist), get journaled to a WAL for crash safety, and when the buffer fills, the whole thing gets flushed to disk as one sorted file. Reads check the buffer, then the disk files newest-to-oldest. A background process keeps merging files together so reads don’t have to check too many.
The trick: writes are always sequential, even when keys aren’t.
How it works
Three layers, three rules.
1. The memtable. Incoming writes go into a sorted in-memory structure — typically a skiplist, because it supports concurrent inserts well. Same write also goes to the WAL on disk so a crash doesn’t lose it. No disk seek for the actual data structure; the WAL is one append.
2. Flush. When the memtable hits a size threshold (tens of MB usually), it gets frozen and written out as a single immutable file — an SSTable. The file is sorted, which is cheap because the memtable already was. Now the WAL for that memtable can be discarded.
3. Compaction. Over time you accumulate many SSTables. Reads have to consult each one in newest-to-oldest order, which gets slow. So a background process picks several SSTables, does a merge-sort over them (sequential reads, sequential write), and produces fewer, larger files — discarding overwritten or deleted keys along the way.
Most production LSMs organize SSTables into levels (LevelDB’s namesake): level 0 is freshly flushed, level 1 is ~10x bigger, level 2 ~10x bigger again, and so on. Compaction moves data downward. This bounds how many files a read has to consult to roughly log(total data / memtable size).
A read for key K does, in order:
- Check the memtable.
- Check the immutable memtable being flushed, if any.
- For each level, find the SSTable whose key range contains
Kand look inside it.
Step 3 would be murder if every read had to touch every file, which is why every SSTable ships with a Bloom filter to skip files that definitely don’t contain the key.
The cost shows up as write amplification and read amplification. A single logical write can be re-written several times as it gets compacted down through the levels — typically 10–30x in real RocksDB workloads, though I don’t have a single canonical number and it varies wildly by config. A single logical read can fan out to multiple files. This is the LSM’s deal with the devil: trade some background I/O and read complexity for very fast, very sequential foreground writes.
The seams: deletes are tombstones — markers that say “this key is gone” — and they only really go away when compaction sees them. Range scans get slower as levels grow. And tuning compaction is genuinely hard; the RocksDB tuning guide is famously long. I don’t have a strong sense of how much of that complexity is essential vs. accidental — there’s active research (e.g. tiered vs. leveled compaction tradeoffs) that suggests the design space isn’t settled.
Famous related terms
- B-tree —
B-tree = sorted tree of pages + in-place updates— the other major on-disk index, optimized for read-heavy workloads with mixed point and range queries. See b-tree-indexes. - WAL —
WAL = append-only log + "write here before touching real state"— the crash-safety layer underneath both LSMs and B-tree databases. See write-ahead-log. - SSTable —
SSTable ≈ immutable sorted file + index + Bloom filter— the on-disk unit an LSM flushes and merges. - Compaction —
compaction = pick SSTables + merge-sort them + drop dead keys— the background work that keeps reads from drowning. - Bloom filter —
Bloom filter = bit array + k hash functions— answers “is this key definitely not in this file?” in a few cache lines.
Going deeper
- O’Neil, Cheng, Gawlick, O’Neil — The Log-Structured Merge-Tree (LSM-Tree), 1996. The original paper.
- Designing Data-Intensive Applications by Martin Kleppmann, chapter 3, has the clearest side-by-side of LSM vs. B-tree I know of.
- The RocksDB wiki’s pages on compaction styles and tuning — dense, but the source of truth for what real LSMs actually do.