Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Write-ahead logging

Why databases write your change to a log before they write it to the actual table — and why crash recovery is impossible without it.

Data intermediate Apr 29, 2026

Why it exists

Picture a database in the middle of a transaction. You’ve told it to move $100 from account A to account B. It debits A. The power cuts. When the machine boots, A is short $100 and B never got it. The money has evaporated.

The naive fix is “write both updates at once.” But disks don’t work like that. A single page write to spinning rust or even an SSD is the largest atomic unit you get for free, and a transaction routinely touches dozens of pages across multiple tables and indexes. There is no hardware primitive for “update twelve pages, all-or-nothing.” So databases have to manufacture atomicity in software, and they all reach for the same trick: write what you’re about to do to a small append-only file first, fsync it, and only then start touching the real data pages. That file is the WAL.

This idea is older than Postgres. It’s the central recovery technique in the ARIES paper (Mohan et al., IBM, 1992), and pretty much every serious relational database — Postgres, Oracle, SQL Server, MySQL/InnoDB, SQLite — runs some descendant of it. Postgres is just where curious engineers most often run into it by name, because pg_wal/ keeps filling up their disk.

Why it matters now

Three things you almost certainly use depend on WAL being a real, durable, ordered stream:

If you’ve ever wondered why synchronous_commit = off makes Postgres faster but scarier, or why a busy database needs WAL on a fast disk even if the dataset fits in RAM — this is the layer you’re poking at.

The short answer

WAL = append-only log of intended page changes + fsync before the data page is allowed to move

The rule, called the WAL rule, is one sentence: log first, data after. As long as the log entry describing a change is on stable storage before the modified data page is, the database can always reconstruct the truth after a crash. Either it sees the log record and redoes the change, or it doesn’t and the change effectively never happened. Atomicity falls out of that ordering.

How it works

A simplified life of one UPDATE:

  1. The change is computed in memory (in the buffer cache). Postgres produces a small WAL record describing the modification: which page, which tuple, what bytes changed.
  2. The WAL record is appended to an in-memory WAL buffer. At commit time (or sooner, if the buffer fills) it gets written to the WAL files in pg_wal/ and then fsync’d. Only after the fsync returns does Postgres tell the client “COMMIT successful.”
  3. The actual data page is now dirty in the buffer cache but has not been written to the table file yet. It can sit there for a while.
  4. Eventually a checkpoint forces all dirty pages to disk and records “as of WAL position X, the data files are caught up.” WAL files older than that are now replayable-from-nothing and can be recycled.
  5. If the server crashes between steps 2 and 4, recovery starts from the last checkpoint, walks forward through the WAL, and reapplies every record whose effects didn’t make it to the data file. This is called the redo phase.

The clever part is that step 2 is fast. WAL is a single sequential file (or small set of segment files) being appended to. Sequential I/O is what disks — even SSDs — are best at. Random updates across a 500 GB table would wreck a disk; one fat sequential append per transaction group does not.

A second clever part: WAL also lets Postgres lie about durability in a controlled way. With synchronous_commit = off, commits return before the WAL fsync completes. You get a much faster system, with the explicit trade that a crash can lose the last few hundred milliseconds of already-acknowledged commits. The database stays consistent — torn transactions never appear — you just lose the tail. That distinction (consistent vs. durable) only makes sense because WAL exists.

I should name a gap: I’m describing the standard ARIES-shaped picture and the public Postgres docs. Postgres has its own departures — notably, it does not use ARIES-style undo records the way DB2 does, because MVCC already keeps old row versions around, so rollback is “mark this transaction aborted, the old version is already there.” If you want the exact byte layout of a Postgres WAL record or the precise rules for full-page writes, the source (src/backend/access/transam/) and the docs are the ground truth, not this post.

Going deeper