Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is a hash function?

A deterministic shrinker that turns any blob of bytes into a fixed-size fingerprint — the same primitive that powers hash tables, Git commits, and password storage.

Computer Science intro Apr 30, 2026

Why it exists

The hash function shows up twice in computing history, from two very different directions, and both stories matter.

The first is data structures. You have a million keys and want lookup that doesn’t grow with the size of the collection. If you can compute directly from the key which slot in an array its value lives in, you don’t have to search at all — you jump there. The thing that does the computing is a hash function. (See hash table for the full story.)

The second is cryptography. You want to publish a short fingerprint of a big file so anyone can later check it hasn’t been tampered with — without re-sending the whole file. The fingerprint must be small, deterministic, and hard to forge: nobody should be able to cook up a different file that lands on the same fingerprint.

Both want the same primitive: a small, deterministic, hard-to-collide fingerprint of arbitrary input. The two families it splits into — fast non-cryptographic hashes and slow cryptographic hashes — are tuned for different threat models but share a shape.

Why it matters now

You are using hash functions constantly, usually without noticing.

The short answer

hash function = deterministic map from arbitrary bytes → fixed-size fingerprint

You feed in any number of bytes — one character, a 4 GB video, the entire Linux kernel source — and you get back a fixed-size output (say, 256 bits). Same input always produces the same output. Different inputs should produce different outputs, and for cryptographic hashes it should be infeasible to engineer a clash.

There are two flavors, optimized for opposite goals:

Reach for the wrong one and you either ship a slow hash table or a forgeable signature.

How it works

The shape is always the same:

input (any length)  ─►  [ hash function ]  ─►  output (fixed length)
"hi"                ─►       SHA-256       ─►  8f434346… (256 bits)
"hi!"               ─►       SHA-256       ─►  c0535e4b… (256 bits)
4 GB video file     ─►       SHA-256       ─►  3d2b9f1c… (256 bits)

What people actually rely on is a small set of properties. Different hash functions satisfy different subsets of this list.

1. Determinism. Same input, same output. Always. Across machines, runs, processes. Without this nothing else makes sense.

2. Uniform distribution / avalanche effect. A good hash spreads inputs evenly across the output space. Cryptographic hashes go further: flipping one input bit flips about half the output bits, unpredictably. That’s why "hi" and "hi!" above look completely unrelated.

3. Collision resistance. For a cryptographic hash, it should be infeasible to find any two inputs that hash to the same output — strictly stronger than uniform distribution. MD5 and SHA-1 once claimed this property and have since lost it; both have published collisions and shouldn’t be used where collision resistance matters. SHA-256, SHA-3, and BLAKE3 are the current defaults.

4. Preimage resistance and second-preimage resistance. You can’t run the function backwards, and you can’t take an existing (input, hash) pair and find a different input with the same hash. This is what lets a website store hash(password) instead of the password.

Not a feature: hash functions are not encryption. There is no key, and the output is meant to discard information — a 4 GB file does not fit in 256 bits, so the mapping cannot be reversible even in principle. If someone asks you to “decrypt this hash,” they’re confused about what a hash is.

The cleanest way to remember the split: every hash function has property 1 and tries for property 2. Cryptographic hashes additionally promise 3 and 4.

Going deeper

Honest gap: no throughput numbers here — those move with CPU generations and implementation quality, and a stale benchmark is worse than none. Measure on your own hardware if it matters.