Why compression works at all

Zip a photo and it shrinks; zip the zip and it doesn't. Compression isn't magic — it only ever exploits the patterns that were already there.

Computer Science intro May 14, 2026

Why it exists

You take a folder of plain text files or an uncompressed bitmap, right-click, “Compress to .zip”, and it comes out a fraction of the size. So you try the obvious trick: zip the zip. It comes out the same size again — maybe a few bytes bigger. Run it a third time and nothing happens at all. If compression is just “make files smaller,” why does it refuse to keep going?

The answer is the whole point of this post: a compressor doesn’t make data smaller, it makes redundancy smaller. The original file was full of predictable structure — repeated words, repeated byte sequences, long runs of the same value. The first pass spent that structure. What’s left looks, statistically, like random noise, and noise has nothing left to exploit. The surprising claim underneath the everyday tool is: random data can’t be losslessly compressed — not on average, and not for almost any individual file. Compression is not a property of “shrinking” — it’s a property of prediction.

Why it matters now

This isn’t a museum fact; it sets hard limits on systems people use every day. It’s why a video codec leans on interframe prediction — consecutive frames are nearly identical, and that near-identity is the compressible part. It’s why already-compressed formats (JPEG, MP4, .zip) barely shrink when you zip them again, so backup tools skip the wasted second pass. It’s why encrypted traffic is roughly incompressible — good encryption is designed to look random, which by this same argument means there’s nothing to squeeze. And it’s why “infinite compression” pitches are a reliable tell for a scam: the counting argument below rules them out.

The short answer

compression = a model of what's likely + spend fewer bits on the likely cases

A compressor works by predicting. It builds (explicitly or implicitly) a model of which patterns are common, then assigns short codes to common things and long codes to rare things. Total size goes down only because the data really was predictable. Feed it data with no pattern — where every continuation is equally likely — and there are no “common cases” to give short codes to. The model has nothing to say, so on average the output is no smaller than the input, and the rare input that does shrink only does so by luck.

How it works

A worked example: run-length encoding

Take the simplest real compressor. Suppose a scanline of a black-and-white image is:

WWWWWWWWWWWWBWWWWWWWWWWWWBBB

28 characters. Run-length encoding rewrites it as “12 W, 1 B, 12 W, 3 B”:

12W1B12W3B

10 characters. It worked — but look at why. The input had long runs, i.e. each pixel strongly predicted the next. Now feed the same encoder a scanline with no runs:

WBWBWBWBWBWBWBWBWBWBWBWBWBW

Run-length encoding produces 1W1B1W1B... — longer than the original. Note this string isn’t random — “repeat WB” is an obvious pattern — it just has no run-length redundancy, the only kind this encoder knows how to spend. Every general-purpose compressor is this same bet at a larger scale: it wins on inputs that match its model and loses on inputs that don’t.

Why “lose on some inputs” is unavoidable

Here’s the precise version of the surprising claim — and it’s a counting argument, not hand-waving.

A lossless compressor must be invertible: distinct inputs map to distinct outputs, or you couldn’t decompress unambiguously. Now count. There are exactly 2¹⁰⁰ possible 100-bit files. There are only 2⁰ + 2¹ + … + 2⁹⁹ = 2¹⁰⁰ − 1 files shorter than 100 bits — fewer than the number of inputs. So no lossless compressor can map every 100-bit input to a shorter output: there isn’t enough room. By the pigeonhole principle, at least one 100-bit input must come out the same length or longer.

The honest statement is therefore not “compressors always make some file bigger by a lot” — it’s: no lossless compressor shrinks every input, and only a small fraction of inputs can shrink by much. (If many inputs shrank a lot, there wouldn’t be enough short outputs to hold them — same counting again.) Real compressors are useful anyway because the files people actually have — text, images, audio, code — are a tiny, highly structured corner of the space of all possible files. Compressors are tuned to win big on that corner and usually pay only a small overhead on inputs that don’t fit their model — though a mismatch can cost more, as the alternating-WB case above shows.

Where entropy comes in

Information entropy makes “how predictable” into a number. The entropy of a data source — the process generating the symbols — is the average number of bits per symbol you’d need to encode it optimally, and it is a hard floor. No lossless scheme can average fewer bits per symbol than the source’s entropy; this is Shannon’s source-coding theorem. (It’s a statement about the source’s distribution, not any single finite file — but it’s what pins down what’s possible.)

That reframes everything above cleanly:

A structured source (English text, an uncompressed image) has entropy well below its raw size — the next letter, the next pixel, is often guessable. The gap between raw size and entropy is exactly the room a compressor has to work in.
A uniform random source has entropy equal to its raw size — every symbol is maximally surprising given the rest. Floor meets ceiling; there’s no gap, so there’s nothing to compress.
Compressed output has had the redundancy its compressor’s model could find already removed. That’s why the second .zip pass does almost nothing — the patterns ZIP looks for are gone, even if the output isn’t literally at the entropy floor.

So “you can’t compress random data” isn’t a quirk of .zip. It’s the source-coding theorem stated in plain words: you can’t represent a source in fewer bits per symbol than its own unpredictability.

One honest seam

“Random” here means no exploitable structure for the model in use — not “produced by a random process” in some cosmic sense. Data can look random to one compressor and be wildly compressible to another that knows the right pattern. The digits of π look like noise to gzip, but “the first million digits of π” is a short description if your decompressor knows how to compute π. Entropy is always relative to a model; there is no model-free notion of “the” compressibility of a specific finite string that’s also computable in practice. The counting argument still holds regardless — no model wins on everything.

Entropy — entropy = average bits per symbol at optimal encoding — the hard floor compression can approach but never beat. See why entropy uses log base 2.
Huffman coding — Huffman = greedy prefix-free code from symbol frequencies — the classic “short codes for common symbols” scheme; its expected code length is provably within one bit per symbol of the source entropy (not counting the codebook you ship alongside).
Run-length encoding — RLE = replace a run with (value, count) — the simplest compressor, and the clearest demonstration that the win comes entirely from redundancy in the input.
Lossy compression — `lossy = throw away detail humans won’t miss
- compress the rest` — JPEG and MP4 sidestep the entropy floor by changing the data first, not by beating the theorem.
Kolmogorov complexity — K(x) = length of the shortest program that outputs x — the model-free ideal of “compressibility”; it is provably uncomputable, which is why real compressors use fixed models instead.

Going deeper

Shannon, C. E. (1948), A Mathematical Theory of Communication — the source-coding theorem, for the precise statement that entropy is the lower bound on lossless compression.
Elements of Information Theory by Cover and Thomas — the chapter on data compression works the Huffman/entropy-gap relationship out properly, if you want the theorem with its proof rather than its shape.