Why compression works at all
Zip a photo and it shrinks; zip the zip and it doesn't. Compression isn't magic — it only ever exploits the patterns that were already there.
Why it exists
You take a folder of plain text files or an uncompressed bitmap, right-click, “Compress to .zip”, and it comes out a fraction of the size. So you try the obvious trick: zip the zip. It comes out the same size again — maybe a few bytes bigger. Run it a third time and nothing happens at all. If compression is just “make files smaller,” why does it refuse to keep going?
The answer is the whole point of this post: a compressor doesn’t make data smaller, it makes redundancy smaller. The original file was full of predictable structure — repeated words, repeated byte sequences, long runs of the same value. The first pass spent that structure. What’s left looks, statistically, like random noise, and noise has nothing left to exploit. The surprising claim underneath the everyday tool is: random data can’t be losslessly compressed — not on average, and not for almost any individual file. Compression is not a property of “shrinking” — it’s a property of prediction.
Why it matters now
This isn’t a museum fact; it sets hard limits on systems people use
every day. It’s why a video codec leans on
interframe prediction
— consecutive frames are nearly identical, and that near-identity is
the compressible part. It’s why already-compressed formats (JPEG, MP4,
.zip) barely shrink when you zip them again, so backup tools skip the
wasted second pass. It’s why encrypted traffic is roughly
incompressible — good encryption is designed to look random, which by
this same argument means there’s nothing to squeeze. And it’s why
“infinite compression” pitches are a reliable tell for a scam: the
counting argument below rules them out.
The short answer
compression = a model of what's likely + spend fewer bits on the likely cases
A compressor works by predicting. It builds (explicitly or implicitly) a model of which patterns are common, then assigns short codes to common things and long codes to rare things. Total size goes down only because the data really was predictable. Feed it data with no pattern — where every continuation is equally likely — and there are no “common cases” to give short codes to. The model has nothing to say, so on average the output is no smaller than the input, and the rare input that does shrink only does so by luck.
How it works
A worked example: run-length encoding
Take the simplest real compressor. Suppose a scanline of a black-and-white image is:
WWWWWWWWWWWWBWWWWWWWWWWWWBBB
28 characters. Run-length encoding rewrites it as “12 W, 1 B, 12 W, 3 B”:
12W1B12W3B
10 characters. It worked — but look at why. The input had long runs, i.e. each pixel strongly predicted the next. Now feed the same encoder a scanline with no runs:
WBWBWBWBWBWBWBWBWBWBWBWBWBW
Run-length encoding produces 1W1B1W1B... — longer than the
original. Note this string isn’t random — “repeat WB” is an obvious
pattern — it just has no run-length redundancy, the only kind this
encoder knows how to spend. Every general-purpose compressor is this
same bet at a larger scale: it wins on inputs that match its model and
loses on inputs that don’t.
Why “lose on some inputs” is unavoidable
Here’s the precise version of the surprising claim — and it’s a counting argument, not hand-waving.
A lossless compressor must be invertible: distinct inputs map to distinct outputs, or you couldn’t decompress unambiguously. Now count. There are exactly 2¹⁰⁰ possible 100-bit files. There are only 2⁰ + 2¹ + … + 2⁹⁹ = 2¹⁰⁰ − 1 files shorter than 100 bits — fewer than the number of inputs. So no lossless compressor can map every 100-bit input to a shorter output: there isn’t enough room. By the pigeonhole principle, at least one 100-bit input must come out the same length or longer.
The honest statement is therefore not “compressors always make some
file bigger by a lot” — it’s: no lossless compressor shrinks every
input, and only a small fraction of inputs can shrink by much. (If
many inputs shrank a lot, there wouldn’t be enough short outputs to
hold them — same counting again.) Real compressors are useful anyway
because the files people actually have — text, images, audio, code —
are a tiny, highly structured corner of the space of all possible
files. Compressors are tuned to win big on that corner and usually pay
only a small overhead on inputs that don’t fit their model — though a
mismatch can cost more, as the alternating-WB case above shows.
Where entropy comes in
Information entropy makes “how predictable” into a number. The entropy of a data source — the process generating the symbols — is the average number of bits per symbol you’d need to encode it optimally, and it is a hard floor. No lossless scheme can average fewer bits per symbol than the source’s entropy; this is Shannon’s source-coding theorem. (It’s a statement about the source’s distribution, not any single finite file — but it’s what pins down what’s possible.)
That reframes everything above cleanly:
- A structured source (English text, an uncompressed image) has entropy well below its raw size — the next letter, the next pixel, is often guessable. The gap between raw size and entropy is exactly the room a compressor has to work in.
- A uniform random source has entropy equal to its raw size — every symbol is maximally surprising given the rest. Floor meets ceiling; there’s no gap, so there’s nothing to compress.
- Compressed output has had the redundancy its compressor’s model
could find already removed. That’s why the second
.zippass does almost nothing — the patterns ZIP looks for are gone, even if the output isn’t literally at the entropy floor.
So “you can’t compress random data” isn’t a quirk of .zip. It’s the
source-coding theorem stated in plain words: you can’t represent a
source in fewer bits per symbol than its own unpredictability.
One honest seam
“Random” here means no exploitable structure for the model in use —
not “produced by a random process” in some cosmic sense. Data can look
random to one compressor and be wildly compressible to another that
knows the right pattern. The digits of π look like noise to gzip, but
“the first million digits of π” is a short description if your
decompressor knows how to compute π. Entropy is always relative to a
model; there is no model-free notion of “the” compressibility of a
specific finite string that’s also computable in practice. The
counting argument still holds regardless — no model wins on
everything.
Famous related terms
- Entropy —
entropy = average bits per symbol at optimal encoding— the hard floor compression can approach but never beat. See why entropy uses log base 2. - Huffman coding —
Huffman = greedy prefix-free code from symbol frequencies— the classic “short codes for common symbols” scheme; its expected code length is provably within one bit per symbol of the source entropy (not counting the codebook you ship alongside). - Run-length encoding —
RLE = replace a run with (value, count)— the simplest compressor, and the clearest demonstration that the win comes entirely from redundancy in the input. - Lossy compression — `lossy = throw away detail humans won’t miss
- compress the rest` — JPEG and MP4 sidestep the entropy floor by changing the data first, not by beating the theorem.
- Kolmogorov complexity —
K(x) = length of the shortest program that outputs x— the model-free ideal of “compressibility”; it is provably uncomputable, which is why real compressors use fixed models instead.
Going deeper
- Shannon, C. E. (1948), A Mathematical Theory of Communication — the source-coding theorem, for the precise statement that entropy is the lower bound on lossless compression.
- Elements of Information Theory by Cover and Thomas — the chapter on data compression works the Huffman/entropy-gap relationship out properly, if you want the theorem with its proof rather than its shape.