Why does information entropy use log base 2?

Shannon could have picked any base for the logarithm in his entropy formula. He picked 2 — and the choice quietly fixes the unit you measure information in.

Math intro Apr 29, 2026

Why it exists

Open any reference on information theory and the same formula stares back at you:

H(X) = − Σ p(x) · log₂ p(x)

That little subscript 2 looks decorative — like the author had to pick some base, and 2 was as good as any. It isn’t decorative. The base of the logarithm is the part of the formula that decides what unit you’re measuring in. Pick log₂, you get answers in bits. Pick ln, you get nats. Pick log₁₀, you get dits (also called bans or hartleys). The numerical answer changes by a constant factor; the meaning of “1 unit of information” changes with it.

The interesting question isn’t “which base is mathematically correct?” — all of them are, because logs in different bases differ only by a constant. The interesting question is: why did the convention land on base 2, and what does it mean that we measure information in powers of two?

Why it matters now

Every working programmer is steeped in base-2 thinking already. A byte is 8 bits. A 32-bit integer can take 2³² values. A cryptographic hash with 256 bits of output offers up to 256 bits of collision resistance. A password has “X bits of entropy.” When a model card says a tokenizer has a vocabulary of 50,257 tokens, the implicit follow-up is “so each token is up to log₂(50257) ≈ 15.6 bits of information.”

These numbers all live in the same unit because the underlying machinery — memory cells, network packets, hash digests, model logits turned into token IDs — is binary. Choosing log₂ means the entropy formula speaks the same language as the hardware. That’s not a deep mathematical truth; it’s an engineering convention that happened to stick because it’s useful. And it’s useful because the world we built on top of Shannon is a world of yes/no decisions.

The short answer

information(event) = log₂(1 / p(event)) bits

A bit is the answer to one yes/no question. If an event has probability 1/2, learning that it happened tells you exactly one bit — one yes/no’s worth of news. If it has probability 1/4, it takes two yes/no questions to pin down, so it carries two bits. Entropy is just the expected number of yes/no questions you’d have to answer to identify a random outcome from the distribution. Base 2 falls out the moment you decide a “question” is binary.

How it works

Shannon’s 1948 paper A Mathematical Theory of Communication sets up information from three axioms about how a sensible measure of “surprise” should behave: it should be a function of probability only, it should be smooth, and the surprise from two independent events should add. The unique function (up to a constant) that satisfies those is −log p. The base of the log is the only free parameter, and Shannon chooses base 2 explicitly because the engineering context — telegraphy, telephony, early digital systems — is binary. He even names the unit bit (a contraction of binary digit, attributed in the paper to John Tukey).

The “axioms → log” step is the deep part. The “base 2” step is purely about what unit you want your answers in.

The yes/no-question intuition

Forget the formula for a second. Imagine I pick a number from 1 to 8 uniformly at random and you have to guess it with yes/no questions. The optimal strategy is binary search: “Is it ≤ 4?”, “Is it ≤ 2?”, “Is it 1?”. Three questions, always. And log₂(8) = 3.

Now imagine the distribution is skewed — I pick 1 with probability 1/2, 2 with probability 1/4, and 3, 4, 5, 6, 7, 8 each with probability 1/12. The optimal strategy isn’t even-split anymore; you’d ask “Is it 1?” first because half the time you’re done in one question. The expected number of yes/no questions you need is exactly the entropy in bits. (More precisely: it’s bounded between H and H+1, by the source-coding theorem. Any prefix-free code can’t beat H bits per symbol on average, and Huffman codes get within one bit.)

So entropy in bits has a real, mechanical meaning: it’s the average length, in yes/no questions, of the best possible binary interrogation of the distribution. That’s what makes base 2 the natural choice — it matches the granularity of the question being asked.

The other bases, and what they’re for

log₂ → bits. What you want for compression, hashing, channel capacity in digital systems, password entropy, anything that ends up on a wire or in RAM.
ln (natural log) → nats. What you want for math. Calculus loves the natural log because d/dx ln(x) = 1/x and there’s no awkward 1/ln(2) factor in the gradient. This is why machine learning loss functions — cross-entropy, KL divergence — are almost always written and computed in nats internally, even when the reported number eventually gets divided by ln 2 to display as bits-per-token.
log₁₀ → dits / hartleys. Mostly historical. Hartley’s earlier (pre-Shannon) formulation was decimal-flavored, and log₁₀ makes sense if your “alphabet” is the ten digits. You see it occasionally in older communications papers and almost never in modern code.

The conversion is one line: 1 nat = 1 / ln(2) bits ≈ 1.4427 bits. So when a research paper reports a model’s cross-entropy loss as 2.1 nats per token and a benchmark site reports the same model at 3.0 bits per token, those are the same number through 2.1 / ln(2) ≈ 3.03.

Where the convention shows its seams

A few honest places where the base-2 choice leaks:

ML training code is in nats, but ML evaluation is often in bits-per-character or bits-per-token. The conversion is mechanical but easy to forget — you can lose a factor of ln 2 and not notice for a while.
Password entropy is bits, but humans pick passwords in characters. A “12-character password” doesn’t have 12 × log₂(26) ≈ 56 bits of entropy in practice, because human-chosen text is heavily non-uniform. The bit count is an upper bound, not a measurement.
H = 2³² states ⇒ 32 bits only holds for uniform distributions. A 32-bit value drawn from a skewed distribution has entropy strictly less than 32 bits. People use “bits” loosely to mean both “address space” and “Shannon entropy” and the two only agree at uniformity.

These aren’t bugs in the formula; they’re places where the base-2 unit meets messy reality and a careful reader has to keep the two things straight.

Bit — bit = answer to one yes/no question — the unit that base 2 gives you. Named in Shannon’s 1948 paper, attributed there to John Tukey.
Nat — nat = bit · ln(2) — the natural-log unit. The “right” unit for calculus on probability, the wrong one for talking to humans about file sizes.
Cross-entropy — H(p, q) = − Σ p(x) log q(x) — entropy’s asymmetric cousin and the loss function nearly every classifier and language model is trained against.
KL divergence — D_KL(p ‖ q) = Σ p(x) log(p(x)/q(x)) — the gap between cross-entropy and entropy; non-negative, zero iff the distributions agree.
Huffman coding — Huffman ≈ greedy prefix-free code — concrete proof that the entropy bound is tight to within one bit per symbol.

Going deeper

Shannon, C. E. (1948), A Mathematical Theory of Communication. Section 1 sets up the unit choice and names the bit. Freely available from Bell Labs / archives.
Cover and Thomas, Elements of Information Theory — the standard textbook treatment of why −log p is the unique choice (up to constant), with the source-coding theorem worked out properly.
Any cross-entropy loss implementation in PyTorch / JAX — the docs will tell you the output is in nats. Multiply by 1 / ln(2) to get bits per token.

A note on what I’m sure of and what I’m not. The mathematical claims — the axiomatic derivation of −log p, the source-coding bound, the conversion factor between bases — are standard textbook material. The historical claim that Shannon attributed the term bit to Tukey is from the 1948 paper itself. Beyond that, I don’t have a clean source for why the base-2 convention won out so completely over base-e in engineering circles versus mathematics — my read is that it’s path dependence (binary hardware showed up first and stuck) more than a clean theoretical reason, but I’d take that as a guess, not a documented history.