Why Git stores snapshots, not diffs
Git's reputation says 'version control = diffs.' Git's actual model says 'version control = snapshots, hashed.' That swap is the whole reason Git feels different.
Why it exists
If you came to Git after using Subversion or CVS, the mental model you arrived with was almost certainly wrong: “a repo is a sequence of diffs from the previous version.” That’s how the older systems worked. To check out an old file, they walked backward applying patches.
Git does not do this. A Git commit is a snapshot of the entire tree as it existed at that moment — every file, in full. When a file doesn’t change between two commits, Git doesn’t store a diff of zero bytes; it just points the new commit at the same file blob the old commit was already pointing at.
That sounds wasteful at first. It isn’t, and the reason it isn’t is the whole point of this post — and most of why Git feels qualitatively different from what came before.
Why it matters now
Almost every codebase you’ll touch as a software engineer in 2026 lives in Git.
The mental model leaks into everything: why git log is fast, why branches are
free, why git rebase can reorder history without corrupting it, why a
detached HEAD is a normal state and not a disaster, why GitHub can show you
any historical version of a file instantly without replaying patches.
It also matters because the snapshot model is what makes Git a content-addressed store — the same idea now showing up in IPFS, container image layers, Nix, and large-model weight stores. Git was an early mainstream example of “address things by what they are, not where they live,” and that idea keeps being rediscovered.
The short answer
git commit ≈ snapshot of the tree, addressed by the SHA-1 hash of its contents
A commit is a tiny object that points at a tree (a directory listing). The tree points at blobs (file contents) and at sub-trees. Every one of those objects is named by the hash of its own bytes. If two commits contain the same file, they end up pointing at the same blob — automatically, with no deduplication step.
There are no diffs in the storage model. Diffs are something Git computes on
demand when you ask git diff or git log -p.
How it works
Four object types, all hashed, all immutable:
- blob — the raw bytes of a file. Just bytes; no filename.
- tree — a directory: a list of entries
(mode, name, hash)where the hash points to a blob (a file) or another tree (a sub-directory). - commit — a tiny record with: the hash of one tree (the snapshot), the hash(es) of its parent commit(s), an author, a committer, a timestamp, a message.
- tag — an annotated pointer to a commit (less essential to the mental model).
Each object’s name is the SHA-1 of its contents, so identical content has identical names automatically. Add the same file to two repos in two countries; both repos call its blob the same 40-character hex string.
A walked-through example. Imagine a repo with two files, README.md and
src/main.py. Commit 1 has both. You then change only README.md and make
commit 2.
- Commit 1 → tree
T1→{README.md → blobA, src → tree S1}.S1→{main.py → blobX}. - Commit 2 → tree
T2→{README.md → blobB, src → tree S1}. SameS1. SameblobX.
Two full snapshots are stored, but blobX and the entire src/ tree object
exist exactly once on disk. The “diff” between commit 1 and commit 2 is
something Git derives at read time by walking the two trees and noticing that
README.md changed.
This is why:
- Branches are cheap. A branch is a 41-byte file containing a commit hash. Creating one is free; you’re not copying anything.
git logis fast. Walking commit parents is just chasing pointers through a hash-keyed object store; no patch replay.- History is tamper-evident. Change one byte of one old file and every hash from that commit forward changes. Git’s identity is the chain of hashes — this is essentially the same idea blockchains were named for, and it predates them in mainstream tools.
- Rebase works. Re-writing history means producing new commit objects with new hashes; the old ones aren’t mutated, just orphaned until garbage collection.
”But surely it can’t store full copies forever”
It doesn’t, exactly. The four object types above describe the logical model —
what Git tells the rest of itself it has. Underneath, Git has a second layer
called packfiles.
When a repo grows, git gc rolls many loose objects into a packfile and
there it does delta-compress similar blobs against each other to save space.
The crucial detail: the deltas in a packfile are an internal storage trick, not the model. They aren’t tied to commit history — Git picks whichever pair of similar blobs compresses best, regardless of which commits they belong to. The logical layer is still snapshots-by-hash; the deltas are just zip-like compression underneath.
So the slogan “Git stores snapshots, not diffs” is true at the layer that matters for reasoning about Git, even though at the bytes-on-disk layer Git absolutely uses deltas to save space. The trick is that the deltas don’t define identity. The hash of the snapshot does.
Where the model shows its seams
A few places it gets weird:
- Large binary files. Snapshots-by-hash is brutal for big binaries that change often: each version is a fully new blob, and packfile deltas don’t compress unrelated binary changes well. This is why Git LFS exists.
- Renames. Git doesn’t track renames as a first-class operation. A rename
is a delete plus an add of identical content;
git log --followandgit diff -Minfer renames after the fact by comparing blob hashes and similarity. This works surprisingly well, and it falls down on heavy edits during a rename. - SHA-1. Git’s identity layer was built on SHA-1, which is no longer considered cryptographically safe. Migration to SHA-256 has been in progress for years; I don’t have a confident read on how far adoption has gotten in practice — last I checked it was supported but not the default for most hosting.
Famous related terms
- Content-addressed storage —
CAS = blob + hash(blob) as its name— the general pattern Git is one example of. - Hash table —
hash table = array + hash function— the same “use the hash to find the thing” idea, scoped to one process. See hash-table. - Merkle tree —
Merkle tree ≈ tree where each node is the hash of its children's hashes— a Git tree object is a Merkle node; this is what makes whole-repo integrity follow from the top commit hash.
Going deeper
- Pro Git, chapter 10 (“Git Internals”). The free book on git-scm.com walks
through blobs, trees, commits, and packfiles with hands-on
git cat-fileexamples. Best single source for this material. git cat-file -p <hash>andgit cat-file -t <hash>on any object in any real repo. Five minutes of poking will make the model concrete in a way no diagram can.