Write-ahead logging
Why databases write your change to a log before they write it to the actual table — and why crash recovery is impossible without it.
Why it exists
Picture a database in the middle of a transaction. You’ve told it to move $100 from account A to account B. It debits A. The power cuts. When the machine boots, A is short $100 and B never got it. The money has evaporated.
The naive fix is “write both updates at once.” But disks don’t work like that. A transaction routinely touches dozens of pages across multiple tables and indexes, and there is no hardware primitive for “update twelve pages, all-or-nothing.” Even a single 8 KB page write on spinning rust or an SSD can tear if a crash interrupts it mid-write. So databases have to manufacture atomicity in software, and they all reach for the same trick: write what you’re about to do to a small append-only file first, fsync it, and only then start touching the real data pages. That file is the WAL.
This idea is older than Postgres. It’s the central recovery technique in
the ARIES paper (Mohan et al., IBM, 1992), and pretty much every serious
relational database — Postgres, Oracle, SQL Server, MySQL/InnoDB, SQLite —
runs some descendant of it. Postgres is just where curious engineers most
often run into it by name, because pg_wal/ keeps filling up their disk.
Why it matters now
Three things you almost certainly use depend on WAL being a real, durable, ordered stream:
- Crash recovery. If the process is killed, the OS panics, or the box loses power, on restart Postgres replays the WAL forward from the last known-good checkpoint and the database comes back consistent.
- Replication. Streaming replication is literally “ship the WAL bytes to another server and replay them.” Read replicas, hot standbys, point-in-time recovery, and logical replication all sit on top of the same log.
- Performance. Counterintuitively, writing twice (once to WAL, once to the table) is faster than writing once carefully, because WAL writes are sequential appends and the random table-page writes can be batched, reordered, and deferred.
If you’ve ever wondered why synchronous_commit = off makes Postgres
faster but scarier, or why a busy database needs WAL on a fast disk even
if the dataset fits in RAM — this is the layer you’re poking at.
The short answer
WAL = append-only log of intended page changes + fsync before the data page is allowed to move
The rule, called the WAL rule, is one sentence: log first, data after. As long as the log entry describing a change is on stable storage before the modified data page is, the database can always reconstruct the truth after a crash. Either the WAL contains the transaction’s commit record — recovery redoes the change — or it doesn’t, and the transaction is treated as if it never happened. Atomicity falls out of that ordering. In the $100-transfer story: either both the debit and the credit make it through together (their commit record landed in the WAL before the crash) or neither effectively did (no commit record, the half-written rows are skipped on replay).
How it works
A simplified life of one UPDATE:
- The change is computed in memory (in the buffer cache). Postgres produces a small WAL record describing the modification: which page, which tuple, what bytes changed.
- The WAL record is appended to an in-memory WAL buffer. At commit
time (or sooner, if the buffer fills) it gets written to the WAL
files in
pg_wal/and thenfsync’d. Only after the fsync returns does Postgres tell the client “COMMIT successful.” - The actual data page is now dirty in the buffer cache but has not been written to the table file yet. It can sit there for a while.
- Eventually a checkpoint forces all dirty pages to disk and records “as of WAL position X, the data files are caught up.” WAL files older than that are now eligible for recycling — unless archiving, a replica, or a replication slot is still holding them (more on that below).
- If the server crashes between steps 2 and 4, recovery starts from the last checkpoint, walks forward through the WAL, and reapplies every record whose effects didn’t make it to the data file. This is called the redo phase.
The shape of that lifecycle, with the fsync barrier in the middle:
sequenceDiagram
participant C as Client
participant B as Backend
participant W as WAL (disk)
participant D as Data file (disk)
C->>B: UPDATE ... COMMIT
Note over B: build WAL record,<br/>dirty page in cache
B->>W: append record + fsync
W-->>B: durable
B-->>C: COMMIT ok
Note over B,D: page stays dirty; written later
B->>D: flush dirty page (at next checkpoint)
The commit acknowledgement crosses the wire before the dirty data page ever touches its file — that gap is exactly where WAL earns its keep.
The clever part is that step 2 is fast. WAL is a single sequential file (or small set of segment files) being appended to. Sequential I/O is what disks — even SSDs — are best at. Random updates across a 500 GB table would wreck a disk; one fat sequential append per transaction group does not.
A second clever part: WAL also lets Postgres lie about durability in a
controlled way. With synchronous_commit = off, commits return before
the WAL fsync completes. You get a much faster system, with the explicit
trade that a crash can lose the last few hundred milliseconds of
already-acknowledged commits. The database stays consistent — torn
transactions never appear — you just lose the tail. That distinction
(consistent vs. durable) only makes sense because WAL exists.
I should name a gap: I’m describing the standard ARIES-shaped picture
and the public Postgres docs. Postgres has its own departures —
notably, it does not use ARIES-style undo records the way DB2 does,
because MVCC
already keeps old row versions around, so rollback is “mark this
transaction aborted, the old version is already there.” If you want
the exact byte layout of a Postgres WAL record or the precise rules
for full-page writes, the source (src/backend/access/transam/) and
the docs are the ground truth, not this post.
Advanced usage
The WAL rule is one sentence, but the machinery built on top of it piles up fast. A handful of things that matter in practice:
Full-page writes — the torn-page defense. A page write to disk isn’t
atomic either. Postgres uses 8 KB pages, but the OS and storage layer
write in 4 KB (or smaller) sectors. A crash mid-write can leave half the
old page and half the new — a torn page.
Replaying a delta-style WAL record against a torn page produces garbage.
Postgres’s fix: after each checkpoint, the first time any page is
modified, the entire page is written into WAL, not just the change.
Recovery replays the full page first, overwriting whatever torn version
is on disk, and only then applies the subsequent deltas. This is why WAL
volume spikes immediately after a checkpoint, and why turning
full_page_writes = off is only safe if your storage independently
guarantees atomic page writes.
Group commit — sharing the fsync. An fsync that flushes 1 KB and
one that flushes 1 MB cost roughly the same; the latency is dominated by
the round trip to the disk’s durability barrier, not the bytes. So if a
hundred transactions are trying to commit at the same moment, you can
fsync the WAL once and acknowledge all of them. Postgres does this
implicitly: committing backends pile into the same WAL flush, so one
fsync satisfies many commits. It also exposes commit_delay and
commit_siblings for explicit pacing (“wait this many microseconds for
friends to show up, provided at least N other backends are
mid-transaction”). It’s the cheapest throughput win in a write-heavy
workload, and it falls out of WAL being a single shared append point.
WAL archiving and point-in-time recovery. Set archive_mode = on
and an archive_command that copies completed WAL segments to a backup
location, and Postgres will hand off each segment as it rotates. Take
one base backup, keep archiving WAL forever, and you can recover the
database to any moment covered by the base backup plus the WAL you
still have —
PITR.
The recovery target can be a timestamp, an XID, or a named restore
point. Mechanically it’s the same replay loop crash recovery uses, just
fed from an archive instead of pg_wal/ and stopped early on purpose.
Replication slots — pinning the log. WAL segments are eligible for
recycling once a checkpoint passes them, unless something is still
holding them — archiving, wal_keep_size, a standby, or a slot. If a
streaming replica or a logical-decoding consumer is lagging without
such protection, the bytes they still need can disappear out from under
them. A
replication slot
tells Postgres “do not recycle WAL past position X until this consumer
acknowledges past it.” Useful and dangerous in the same breath: a
forgotten slot on a dead consumer is the classic way to fill pg_wal/
until the disk runs out and the database refuses new writes.
Logical decoding — WAL as a change stream. The WAL is binary and
tied to physical page layouts, but a logical decoding plugin lets
Postgres decode committed table changes into a plugin-defined stream —
typically row-level events (“INSERT into orders, values …”) — without
anyone re-querying the table. This is what Debezium, pg_recvlogical,
and most Postgres
CDC
tooling sit on. The WAL was designed for recovery; it turned out to
also be an ordered, durable, at-least-once change stream (consumers
handle the occasional re-delivery after a crash), and the ecosystem
has been mining that second job ever since.
Famous related terms
- Checkpoint —
checkpoint = flush all dirty pages + record a "WAL replay can start here" marker— bounds how much WAL recovery has to chew through after a crash. - fsync —
fsync = "don't return until this file's bytes are on stable storage"— the syscall the WAL rule is built on. If your disk lies about fsync (some consumer SSDs historically did), WAL’s guarantees evaporate. - MVCC — see postgres-mvcc — Postgres keeps multiple row versions, which changes how undo and rollback interact with WAL.
- Logical replication —
logical replication ≈ decode WAL records back into row-level changes + ship them— the WAL turns out to also be a clean change-data-capture stream, which is half of why tools like Debezium exist. - Redo log —
redo log ≈ WAL by another name— Oracle’s redo log and MySQL’s InnoDB redo log play the same crash-recovery role as Postgres’s WAL. - MySQL binlog —
binlog = server-level log of committed changes, separate from InnoDB's redo log— used for replication and point-in-time recovery; commonly confused with the redo log because both are “logs,” but they live at different layers and exist for different reasons.
Going deeper
- C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS, 1992.
- Postgres docs: Reliability and the Write-Ahead Log (
https://www.postgresql.org/docs/current/wal.html). - Postgres source:
src/backend/access/transam/xlog.cfor the core WAL machinery.