Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why is the central limit theorem load-bearing?

Almost every confidence interval, A/B test, and gradient-noise argument quietly leans on one fact: averages of independent things look Gaussian, even when the things themselves don't.

Math intro Apr 29, 2026

Why it exists

Roll a single six-sided die. The result jumps wildly between 1 and 6 — no pattern, no preferred value. Now roll ten dice at a time and write down the average. Do that a few hundred times and plot the averages. Almost all of them cluster between 3 and 4, in a tidy bell shape, even though no individual die ever shows 3.5. That collapse from chaos to a predictable bell — every time you average enough independent things — is the central limit theorem. It is why a single user’s behavior on your website looks random but a daily average of a million users looks like a clean curve you can run statistics on.

Take any reasonable distribution — coin flips, latencies from a web service, daily revenue per user, gradients computed on random mini-batches. The shape can be ugly: skewed, bimodal, fat-tailed, nothing like a textbook curve. Now average a lot of independent draws from that distribution. The average doesn’t inherit the ugliness. It collapses to a bell.

That collapse is the central limit theorem (CLT), and once you notice how often you’re secretly averaging things, you start seeing it everywhere. It’s the reason a histogram of individual request latencies looks like a long-tailed mess but a histogram of daily mean latency looks like a tidy bump. It’s the reason an A/B test on a million users gives you a clean p-value even though the per-user metric is a chaotic mixture. It’s the reason SGD people argue about “Gaussian noise in the gradient” with a straight face when each individual sample’s gradient is anything but.

The theorem isn’t deep because the bell is special. It’s deep because the bell is the attractor: average enough independent finite-variance things and you can’t not end up there.

Why it matters now

Three places it’s quietly load-bearing for software engineers right now:

If the CLT failed quietly, none of these tools would announce a warning. They’d just be subtly wrong, and people would chase ghosts.

The short answer

sample_mean ≈ Normal(true_mean, σ / √n) for large n, almost regardless of the underlying distribution

If you average n independent draws from any distribution with finite variance σ², the distribution of that average is approximately a normal distribution with the same center as the original and a width that shrinks as 1/√n. The original distribution can be anything sane — uniform, exponential, a weird mixture. The mean forgets the shape and remembers only the center and the spread.

How it works

The classical statement: let X₁, X₂, …, Xₙ be independent and identically distributed with mean μ and finite variance σ². Define the sample mean X̄ₙ = (X₁ + … + Xₙ) / n. Then as n → ∞,

√n · (X̄ₙ − μ) / σ  →  Normal(0, 1)

That’s it. The interesting structure is what’s missing: no assumption about the shape of the original distribution beyond “has a mean and a finite variance.” It can be discrete, continuous, ugly, mixed.

Why a bell, specifically?

Two intuitions, neither rigorous, both useful.

1. The Gaussian is the only “shape” that’s stable under averaging. If you average two independent Gaussians, you get a Gaussian. Average two independent uniforms — you get a triangle. Average three uniforms — a piecewise-quadratic blob. Keep going and the blob smooths out toward a bell. The Gaussian is a fixed point of the averaging operator; everything else flows toward it.

2. The Gaussian maximizes entropy at fixed mean and variance. Among all distributions with a given center and spread, the normal distribution is the most “spread out” / least committed to any particular shape. Averaging discards information about the original shape (you keep the mean and the variance, you lose the rest). What you’re left with is the most-uncertain distribution consistent with what you kept — which is the bell. (This is the maximum entropy view of the CLT, and it’s why information-theory people love it.)

How fast is “as n → ∞”?

Fast for nice distributions, slow for ugly ones. The convergence rate is governed by the Berry–Esseen theorem, which roughly says the error scales as 1/√n and depends on the third moment (skewness) of the original. Practical rules of thumb:

Where the convention shows its seams

The CLT only promises convergence in distribution of the standardized mean. A few places that matters:

I don’t have a clean source for the historical sequence — Laplace had a form of it in 1810, Lyapunov tightened it around 1900, Lindeberg and Lévy gave the modern proofs in the 1920s — and I’d defer to a real history-of-statistics source for the details rather than try to reconstruct that thread here.

Going deeper

A note on what I’m sure of and what I’m not. The mathematical statement and the standard caveats (independence, finite variance, Berry–Esseen rates) are textbook. The claim that mini-batch gradients in deep learning are not well-modeled as Gaussian is an active and somewhat contested research area; I’d treat the “SGD = GD + Gaussian noise” picture as a useful first approximation, not a theorem. The historical sketch above is rough — I’d check a real history-of-statistics source before quoting dates or attributions.