Write-ahead logging
Why databases write your change to a log before they write it to the actual table — and why crash recovery is impossible without it.
Why it exists
Picture a database in the middle of a transaction. You’ve told it to move $100 from account A to account B. It debits A. The power cuts. When the machine boots, A is short $100 and B never got it. The money has evaporated.
The naive fix is “write both updates at once.” But disks don’t work like that. A single page write to spinning rust or even an SSD is the largest atomic unit you get for free, and a transaction routinely touches dozens of pages across multiple tables and indexes. There is no hardware primitive for “update twelve pages, all-or-nothing.” So databases have to manufacture atomicity in software, and they all reach for the same trick: write what you’re about to do to a small append-only file first, fsync it, and only then start touching the real data pages. That file is the WAL.
This idea is older than Postgres. It’s the central recovery technique in
the ARIES paper (Mohan et al., IBM, 1992), and pretty much every serious
relational database — Postgres, Oracle, SQL Server, MySQL/InnoDB, SQLite —
runs some descendant of it. Postgres is just where curious engineers most
often run into it by name, because pg_wal/ keeps filling up their disk.
Why it matters now
Three things you almost certainly use depend on WAL being a real, durable, ordered stream:
- Crash recovery. If the process is killed, the OS panics, or the box loses power, on restart Postgres replays the WAL forward from the last known-good checkpoint and the database comes back consistent.
- Replication. Streaming replication is literally “ship the WAL bytes to another server and replay them.” Read replicas, hot standbys, point-in-time recovery, and logical replication all sit on top of the same log.
- Performance. Counterintuitively, writing twice (once to WAL, once to the table) is faster than writing once carefully, because WAL writes are sequential appends and the random table-page writes can be batched, reordered, and deferred.
If you’ve ever wondered why synchronous_commit = off makes Postgres
faster but scarier, or why a busy database needs WAL on a fast disk even
if the dataset fits in RAM — this is the layer you’re poking at.
The short answer
WAL = append-only log of intended page changes + fsync before the data page is allowed to move
The rule, called the WAL rule, is one sentence: log first, data after. As long as the log entry describing a change is on stable storage before the modified data page is, the database can always reconstruct the truth after a crash. Either it sees the log record and redoes the change, or it doesn’t and the change effectively never happened. Atomicity falls out of that ordering.
How it works
A simplified life of one UPDATE:
- The change is computed in memory (in the buffer cache). Postgres produces a small WAL record describing the modification: which page, which tuple, what bytes changed.
- The WAL record is appended to an in-memory WAL buffer. At commit
time (or sooner, if the buffer fills) it gets written to the WAL
files in
pg_wal/and thenfsync’d. Only after the fsync returns does Postgres tell the client “COMMIT successful.” - The actual data page is now dirty in the buffer cache but has not been written to the table file yet. It can sit there for a while.
- Eventually a checkpoint forces all dirty pages to disk and records “as of WAL position X, the data files are caught up.” WAL files older than that are now replayable-from-nothing and can be recycled.
- If the server crashes between steps 2 and 4, recovery starts from the last checkpoint, walks forward through the WAL, and reapplies every record whose effects didn’t make it to the data file. This is called the redo phase.
The clever part is that step 2 is fast. WAL is a single sequential file (or small set of segment files) being appended to. Sequential I/O is what disks — even SSDs — are best at. Random updates across a 500 GB table would wreck a disk; one fat sequential append per transaction group does not.
A second clever part: WAL also lets Postgres lie about durability in a
controlled way. With synchronous_commit = off, commits return before
the WAL fsync completes. You get a much faster system, with the explicit
trade that a crash can lose the last few hundred milliseconds of
already-acknowledged commits. The database stays consistent — torn
transactions never appear — you just lose the tail. That distinction
(consistent vs. durable) only makes sense because WAL exists.
I should name a gap: I’m describing the standard ARIES-shaped picture
and the public Postgres docs. Postgres has its own departures —
notably, it does not use ARIES-style undo records the way DB2 does,
because MVCC
already keeps old row versions around, so rollback is “mark this
transaction aborted, the old version is already there.” If you want
the exact byte layout of a Postgres WAL record or the precise rules
for full-page writes, the source (src/backend/access/transam/) and
the docs are the ground truth, not this post.
Famous related terms
- Checkpoint —
checkpoint = flush all dirty pages + record a "WAL replay can start here" marker— bounds how much WAL recovery has to chew through after a crash. - fsync —
fsync = "don't return until this file's bytes are on stable storage"— the syscall the WAL rule is built on. If your disk lies about fsync (some consumer SSDs historically did), WAL’s guarantees evaporate. - MVCC — see postgres-mvcc — Postgres keeps multiple row versions, which changes how undo and rollback interact with WAL.
- Logical replication —
logical replication ≈ decode WAL records back into row-level changes + ship them— the WAL turns out to also be a clean change-data-capture stream, which is half of why tools like Debezium exist. - Redo log / binlog —
redo log ≈ WAL by another name— Oracle’s redo log, MySQL’s InnoDB redo log and binlog — same idea, different names, slightly different splits between durability and replication.
Going deeper
- C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS, 1992.
- Postgres docs: Reliability and the Write-Ahead Log (
https://www.postgresql.org/docs/current/wal.html). - Postgres source:
src/backend/access/transam/xlog.cfor the core WAL machinery.