Why does information entropy use log base 2?
Shannon could have picked any base for the logarithm in his entropy formula. He picked 2 — and the choice quietly fixes the unit you measure information in.
Why it exists
Open any reference on information theory and the same formula stares back at you:
H(X) = − Σ p(x) · log₂ p(x)
That little subscript 2 looks decorative — like the author had to pick
some base, and 2 was as good as any. It isn’t decorative. The base of
the logarithm is the part of the formula that decides what unit you’re
measuring in. Pick log₂, you get answers in bits. Pick ln, you
get nats. Pick log₁₀, you get dits (also called bans or
hartleys). The numerical answer changes by a constant factor; the
meaning of “1 unit of information” changes with it.
The interesting question isn’t “which base is mathematically correct?” — all of them are, because logs in different bases differ only by a constant. The interesting question is: why did the convention land on base 2, and what does it mean that we measure information in powers of two?
Why it matters now
Every working programmer is steeped in base-2 thinking already. A byte is 8 bits. A 32-bit integer can take 2³² values. A cryptographic hash with 256 bits of output offers up to 256 bits of collision resistance. A password has “X bits of entropy.” When a model card says a tokenizer has a vocabulary of 50,257 tokens, the implicit follow-up is “so each token is up to log₂(50257) ≈ 15.6 bits of information.”
These numbers all live in the same unit because the underlying
machinery — memory cells, network packets, hash digests, model logits
turned into token IDs — is binary. Choosing log₂ means the entropy
formula speaks the same language as the hardware. That’s not a deep
mathematical truth; it’s an engineering convention that happened to
stick because it’s useful. And it’s useful because the world we built
on top of Shannon is a world of yes/no decisions.
The short answer
information(event) = log₂(1 / p(event)) bits
A bit is the answer to one yes/no question. If an event has probability 1/2, learning that it happened tells you exactly one bit — one yes/no’s worth of news. If it has probability 1/4, it takes two yes/no questions to pin down, so it carries two bits. Entropy is just the expected number of yes/no questions you’d have to answer to identify a random outcome from the distribution. Base 2 falls out the moment you decide a “question” is binary.
How it works
Shannon’s 1948 paper A Mathematical Theory of Communication sets up
information from three axioms about how a sensible measure of “surprise”
should behave: it should be a function of probability only, it should
be smooth, and the surprise from two independent events should add.
The unique function (up to a constant) that satisfies those is
−log p. The base of the log is the only free parameter, and Shannon
chooses base 2 explicitly because the engineering context — telegraphy,
telephony, early digital systems — is binary. He even names the unit
bit (a contraction of binary digit, attributed in the paper to
John Tukey).
The “axioms → log” step is the deep part. The “base 2” step is purely about what unit you want your answers in.
The yes/no-question intuition
Forget the formula for a second. Imagine I pick a number from 1 to 8
uniformly at random and you have to guess it with yes/no questions. The
optimal strategy is binary search: “Is it ≤ 4?”, “Is it ≤ 2?”, “Is it
1?”. Three questions, always. And log₂(8) = 3.
Now imagine the distribution is skewed — I pick 1 with probability 1/2, 2 with probability 1/4, and 3, 4, 5, 6, 7, 8 each with probability 1/12. The optimal strategy isn’t even-split anymore; you’d ask “Is it 1?” first because half the time you’re done in one question. The expected number of yes/no questions you need is exactly the entropy in bits. (More precisely: it’s bounded between H and H+1, by the source-coding theorem. Any prefix-free code can’t beat H bits per symbol on average, and Huffman codes get within one bit.)
So entropy in bits has a real, mechanical meaning: it’s the average length, in yes/no questions, of the best possible binary interrogation of the distribution. That’s what makes base 2 the natural choice — it matches the granularity of the question being asked.
The other bases, and what they’re for
log₂→ bits. What you want for compression, hashing, channel capacity in digital systems, password entropy, anything that ends up on a wire or in RAM.ln(natural log) → nats. What you want for math. Calculus loves the natural log becaused/dx ln(x) = 1/xand there’s no awkward1/ln(2)factor in the gradient. This is why machine learning loss functions — cross-entropy, KL divergence — are almost always written and computed in nats internally, even when the reported number eventually gets divided byln 2to display as bits-per-token.log₁₀→ dits / hartleys. Mostly historical. Hartley’s earlier (pre-Shannon) formulation was decimal-flavored, andlog₁₀makes sense if your “alphabet” is the ten digits. You see it occasionally in older communications papers and almost never in modern code.
The conversion is one line: 1 nat = 1 / ln(2) bits ≈ 1.4427 bits. So
when a research paper reports a model’s cross-entropy loss as 2.1 nats
per token and a benchmark site reports the same model at 3.0 bits per
token, those are the same number through 2.1 / ln(2) ≈ 3.03.
Where the convention shows its seams
A few honest places where the base-2 choice leaks:
- ML training code is in nats, but ML evaluation is often in
bits-per-character or bits-per-token. The conversion is mechanical
but easy to forget — you can lose a factor of
ln 2and not notice for a while. - Password entropy is bits, but humans pick passwords in characters. A “12-character password” doesn’t have 12 × log₂(26) ≈ 56 bits of entropy in practice, because human-chosen text is heavily non-uniform. The bit count is an upper bound, not a measurement.
H = 2³² states ⇒ 32 bitsonly holds for uniform distributions. A 32-bit value drawn from a skewed distribution has entropy strictly less than 32 bits. People use “bits” loosely to mean both “address space” and “Shannon entropy” and the two only agree at uniformity.
These aren’t bugs in the formula; they’re places where the base-2 unit meets messy reality and a careful reader has to keep the two things straight.
Famous related terms
- Bit —
bit = answer to one yes/no question— the unit that base 2 gives you. Named in Shannon’s 1948 paper, attributed there to John Tukey. - Nat —
nat = bit · ln(2)— the natural-log unit. The “right” unit for calculus on probability, the wrong one for talking to humans about file sizes. - Cross-entropy —
H(p, q) = − Σ p(x) log q(x)— entropy’s asymmetric cousin and the loss function nearly every classifier and language model is trained against. - KL divergence
—
D_KL(p ‖ q) = Σ p(x) log(p(x)/q(x))— the gap between cross-entropy and entropy; non-negative, zero iff the distributions agree. - Huffman coding —
Huffman ≈ greedy prefix-free code— concrete proof that the entropy bound is tight to within one bit per symbol.
Going deeper
- Shannon, C. E. (1948), A Mathematical Theory of Communication. Section 1 sets up the unit choice and names the bit. Freely available from Bell Labs / archives.
- Cover and Thomas, Elements of Information Theory — the standard
textbook treatment of why
−log pis the unique choice (up to constant), with the source-coding theorem worked out properly. - Any cross-entropy loss implementation in PyTorch / JAX — the docs
will tell you the output is in nats. Multiply by
1 / ln(2)to get bits per token.
A note on what I’m sure of and what I’m not. The mathematical claims — the axiomatic derivation of
−log p, the source-coding bound, the conversion factor between bases — are standard textbook material. The historical claim that Shannon attributed the term bit to Tukey is from the 1948 paper itself. Beyond that, I don’t have a clean source for why the base-2 convention won out so completely over base-e in engineering circles versus mathematics — my read is that it’s path dependence (binary hardware showed up first and stuck) more than a clean theoretical reason, but I’d take that as a guess, not a documented history.