What is a hash function?
A deterministic shrinker that turns any blob of bytes into a fixed-size fingerprint — the same primitive that powers hash tables, Git commits, and password storage.
Why it exists
The hash function shows up twice in computing history, from two very different directions, and both stories matter.
The first is data structures. You have a million keys and want lookup that doesn’t grow with the size of the collection. If you can compute directly from the key which slot in an array its value lives in, you don’t have to search at all — you jump there. The thing that does the computing is a hash function. (See hash table for the full story.)
The second is cryptography. You want to publish a short fingerprint of a big file so anyone can later check it hasn’t been tampered with — without re-sending the whole file. The fingerprint must be small, deterministic, and hard to forge: nobody should be able to cook up a different file that lands on the same fingerprint.
Both want the same primitive: a small, deterministic, hard-to-collide fingerprint of arbitrary input. The two families it splits into — fast non-cryptographic hashes and slow cryptographic hashes — are tuned for different threat models but share a shape.
Why it matters now
You are using hash functions constantly, usually without noticing.
- Every
dict,Map,HashMap,Setin your language’s stdlib. - Every Git commit ID is a hash of the commit’s contents.
- Package managers verify downloads against a published hash.
- Password storage stores the hash, not the password.
- Blockchains chain blocks together by hash.
- Content-addressed storage (IPFS, S3 ETags, Docker layers) names files by their hash so identical content collapses to one copy.
- HMAC wraps a hash function to authenticate messages inside TLS.
The short answer
hash function = deterministic map from arbitrary bytes → fixed-size fingerprint
You feed in any number of bytes — one character, a 4 GB video, the entire Linux kernel source — and you get back a fixed-size output (say, 256 bits). Same input always produces the same output. Different inputs should produce different outputs, and for cryptographic hashes it should be infeasible to engineer a clash.
There are two flavors, optimized for opposite goals:
- Non-cryptographic (xxHash, MurmurHash, FNV) — as fast as possible while spreading inputs evenly. Used inside hash tables and bloom filters. Speed is the feature; security isn’t on the menu.
- Cryptographic (SHA-256, SHA-3, BLAKE3) — slower, with extra guarantees about collisions and one-wayness. Used wherever an adversary might try to forge a fingerprint.
Reach for the wrong one and you either ship a slow hash table or a forgeable signature.
How it works
The shape is always the same:
input (any length) ─► [ hash function ] ─► output (fixed length)
"hi" ─► SHA-256 ─► 8f434346… (256 bits)
"hi!" ─► SHA-256 ─► c0535e4b… (256 bits)
4 GB video file ─► SHA-256 ─► 3d2b9f1c… (256 bits)
What people actually rely on is a small set of properties. Different hash functions satisfy different subsets of this list.
1. Determinism. Same input, same output. Always. Across machines, runs, processes. Without this nothing else makes sense.
2. Uniform distribution / avalanche effect.
A good hash spreads inputs evenly across the output space. Cryptographic
hashes go further: flipping one input bit flips about half the output bits,
unpredictably. That’s why "hi" and "hi!" above look completely unrelated.
3. Collision resistance. For a cryptographic hash, it should be infeasible to find any two inputs that hash to the same output — strictly stronger than uniform distribution. MD5 and SHA-1 once claimed this property and have since lost it; both have published collisions and shouldn’t be used where collision resistance matters. SHA-256, SHA-3, and BLAKE3 are the current defaults.
4. Preimage resistance
and second-preimage resistance.
You can’t run the function backwards, and you can’t take an existing
(input, hash) pair and find a different input with the same hash. This
is what lets a website store hash(password) instead of the password.
Not a feature: hash functions are not encryption. There is no key, and the output is meant to discard information — a 4 GB file does not fit in 256 bits, so the mapping cannot be reversible even in principle. If someone asks you to “decrypt this hash,” they’re confused about what a hash is.
The cleanest way to remember the split: every hash function has property 1 and tries for property 2. Cryptographic hashes additionally promise 3 and 4.
Famous related terms
- Hash table —
hash table = array + hash function— the canonical non-cryptographic use; the hash picks a bucket. - HMAC —
HMAC ≈ keyed hash for message authentication— wraps a hash function with a secret key so only key-holders can produce valid tags. Used inside TLS, JWTs, AWS request signing. - Salt —
salt = unique random prefix per input— defeats rainbow-table precomputation; see password hashing. - Merkle tree —
Merkle tree = binary tree of hashes of hashes— lets you prove one leaf is in a huge dataset by showing only a logarithmic chain of hashes. Powers Git, Bitcoin, certificate transparency. - Bloom filter —
bloom filter = bit array + k hash functions— a probabilistic set-membership test that trades false positives for tiny memory. - Cryptographic vs non-cryptographic — same shape, different tunings: xxHash is fast and forgeable; SHA-256 is slower and (currently) not. Pick by threat model, not by familiarity.
Going deeper
- Introduction to Algorithms (CLRS), the hash-table chapter — best treatment of the data-structure side, including what “universal hashing” buys you.
- Bruce Schneier, Applied Cryptography — long-running reference on the cryptographic side; older editions predate SHA-3 and BLAKE3 but the framing of collision/preimage/second-preimage is timeless.
- The Keccak / SHA-3 design rationale from the Keccak team — worth reading once to see how a modern cryptographic hash is built (sponge construction), as opposed to the Merkle–Damgård lineage of MD5/SHA-1/SHA-2.
Honest gap: no throughput numbers here — those move with CPU generations and implementation quality, and a stale benchmark is worse than none. Measure on your own hardware if it matters.