Central Limit Theorem
Why normality emerges from chaos — the shape of fluctuations, the rate of convergence, and why almost all of classical statistics works.
11.1 Why the LLN Isn’t Enough
Topic 10 closed with a qualitative statement: the sample mean converges to the population mean . The Strong Law of Large Numbers promises it happens almost surely, the Weak Law in probability, and the law of the iterated logarithm even pins down the precise a.s. oscillation rate . What none of these tell us is the shape of the fluctuations at any given .
That gap matters. When we report from a survey of , we want to say how confident we are that is near — not merely that eventually. Confidence requires a distribution on , and the LLN gives us nothing of the kind.
The Central Limit Theorem fills that gap with one of the most surprising results in probability. Standardize the sample mean: subtract the true mean , divide by the true standard error . Call the result . Then — regardless of whether is Bernoulli, Exponential, Poisson, Uniform, or almost anything else with finite variance —
The limiting distribution doesn’t depend on the shape of . Skewed, symmetric, heavy-tailed, bounded, discrete — they all converge to the same Gaussian. The CLT is the mathematical reason confidence intervals, -tests, -values, and the entire apparatus of frequentist inference work.
Here is the roadmap:
| Section | Result | What it says |
|---|---|---|
| §2 | De Moivre–Laplace | The Binomial becomes Normal — the CLT’s historical root |
| §3 | Classical CLT (Lindeberg–Lévy) | iid + finite variance ⟹ |
| §4 | MGF proof | Taylor-expand , apply Lévy continuity |
| §5 | CF proof | Same structure with characteristic functions; no MGF required |
| §6 | Lindeberg CLT | iid is overkill — only need no single summand to dominate |
| §7 | Berry–Esseen | The rate is , and skewness controls the constant |
| §8 | Multivariate CLT | Random vectors converge to |
| §9 | Delta method | Nonlinear transformations inherit the CLT with variance |
| §10 | ML connections | Confidence intervals, SGD noise, Bayesian CLT |
| §11 | Summary | Interactive explorer and reference table |
Throughout, we use for the standard Normal CDF, for characteristic functions (not to be confused with ), for moment-generating functions, and for convergence in distribution (Topic 9, Definition 9.6).
11.2 De Moivre–Laplace: The First CLT
The first CLT predates the general theory by almost two centuries. In 1733, Abraham de Moivre — working on a problem of fair games of chance — proved that the Binomial distribution, properly standardized, approaches the Normal. Laplace generalized it to arbitrary in 1812, and the result became known as the de Moivre–Laplace theorem. Every bell-curve-from-coin-flips demonstration the reader has ever seen is a visualization of this theorem.
Let for fixed . Then
Equivalently, for any ,
Proof [show]
Write with iid. Let . The Binomial PMF is
Near the mode we apply Stirling’s approximation, , to the three factorials:
Simplify the prefactor:
Change variables to , so and . The prefactor becomes
For the exponential factor, write and . Then
Expand each logarithm using with and . The first-order terms cancel (the mean of is by construction), and the second-order terms combine to
Higher-order terms are and vanish in the limit. Multiplying the prefactor and the exponential factor:
The factor is the spacing between standardized values — it converts the PMF to a Riemann approximation of the density. Summing over with gives
which is exactly convergence in distribution to . ◼
With , : , . The standardization says is approximately . A 95% Normal band gives
The exact Binomial probability is — the Normal approximation is off by less than . For discrete-data accuracy, a continuity correction replaces the integer endpoints by half-integer ones: . This is what the explorer below toggles on and off.
De Moivre (1733) proved the case. Laplace (1812) extended to general . Lyapunov (1901) gave the first general CLT under a third-moment condition. Lindeberg (1922) gave the sharp necessary-and-sufficient condition. Lévy (1925) supplied the characteristic function machinery, and Berry (1941) and Esseen (1942) proved the rate. Two centuries of incremental sharpening separate de Moivre’s coin-flip argument from the modern graduate-level CLT.
11.3 The Classical CLT (Lindeberg–Lévy)
The general CLT removes the Bernoulli restriction and asks only that the summands be iid with finite variance.
Let be iid with and . Let . Then
Two conditions, and only two: independence + identical distribution, and finite variance. No shape restrictions, no moment conditions beyond . The theorem is blind to whether is discrete or continuous, bounded or unbounded, symmetric or skewed.
We simulate replications of the standardized mean at several values of :
| Distribution | ||||
|---|---|---|---|---|
| Uniform(0,1) | KS ≈ 0.02 | KS ≈ 0.01 | KS ≈ 0.01 | KS < 0.01 |
| Exponential(1) | KS ≈ 0.13 | KS ≈ 0.08 | KS ≈ 0.04 | KS ≈ 0.02 |
| Bernoulli(0.3) | KS ≈ 0.09 | KS ≈ 0.05 | KS ≈ 0.03 | KS ≈ 0.02 |
| Poisson(5) | KS ≈ 0.06 | KS ≈ 0.04 | KS ≈ 0.02 | KS ≈ 0.01 |
| Chi²(3) | KS ≈ 0.18 | KS ≈ 0.11 | KS ≈ 0.06 | KS ≈ 0.03 |
(KS = Kolmogorov–Smirnov distance from ; smaller is better.)
Symmetric, bounded distributions (Uniform) hit target accuracy almost immediately. Skewed distributions (Chi², Exponential) take longer. The rate is uniformly — §7 will pin it down.
The Lindeberg–Lévy hypothesis is restrictive — real data rarely comes iid. But the iid assumption is not necessary; §6 replaces it with the Lindeberg condition, which is the true mechanism making the CLT work: no single summand should contribute an appreciable fraction of the total variance. iid is simply the simplest setting in which this happens automatically.
11.4 Proof via Moment-Generating Functions
The MGF proof is the most concrete. It assumes exists in a neighborhood of zero (which excludes heavy-tailed distributions like Cauchy and Pareto without moments) but is otherwise a calculus exercise. The structure is identical to the Poisson limit proof of Theorem 9.13 — only the Taylor expansion differs.
Proof [show]
Without loss of generality take and (otherwise work with ). Then . By independence,
Take logarithms:
Since is smooth near and , we can Taylor-expand around zero. Using and :
Substitute :
Now apply with :
Multiply by :
Exponentiating gives the pointwise limit , which is the MGF of . By MGF uniqueness (Expectation, Variance & Moments: Theorem 17) plus the MGF version of Lévy’s continuity theorem, . ◼
For , and . Work with the centered variable . Then
Expand: and . Multiply:
The coefficient of is — consistent with . Substituting and taking logs:
The cubic term is — it vanishes as , giving the limit as expected. But the cubic coefficient is the reason Exponential converges slower than Uniform (where the cubic term vanishes by symmetry). This is the Berry–Esseen rate effect of §7.
Topic 9, Theorem 9.13 proved the Poisson limit theorem by the same five-step recipe: (1) compute the MGF of the standardized sum, (2) take the log, (3) Taylor-expand in powers of , (4) show the limit is the target MGF, (5) invoke Lévy continuity and MGF uniqueness. Only step (3) differs between the two proofs: for Poisson, the Taylor expansion gives in the limit; for the CLT, it gives .
11.5 Proof via Characteristic Functions
MGFs have one flaw: they may not exist. The MGF is an integral that can diverge — Cauchy-distributed variables, Pareto, and anything with sub-exponential tails break it. The Fourier-analytic cousin, the characteristic function, is bulletproof: the integrand has modulus one, so the integral always converges. This generality is why the textbook CLT proof is via characteristic functions.
For a random variable , the characteristic function is
where . Since , the expectation exists for every distribution. The characteristic function uniquely determines the distribution: implies .
The CF of is (compute by contour integration or Hermite polynomials — standard). The CLT target is therefore pointwise, and we need a continuity theorem to lift pointwise CF convergence to convergence in distribution.
Let be random variables with characteristic functions . Then if and only if for every and is continuous at .
The CF version is strictly stronger than the MGF version in Topic 9, Remark 2: no moment conditions required, the result always applies. The continuity-at-zero requirement on is usually automatic — any CF of a proper probability distribution is continuous.
Proof [show]
Take , , so . By independence,
Since and , Taylor-expand around zero using and :
This expansion holds whenever — no higher moments needed. Substitute :
Take logarithms (the principal branch is well-defined once is small, which it is for all fixed and large enough ):
Multiply by :
Exponentiate: , pointwise in . Since is continuous at zero, Lévy’s continuity theorem (Theorem 3) gives . ◼
MGFs are real-valued, which makes Taylor expansion concrete and helps with first intuition. CFs are complex-valued but always exist. The MGF proof assumes finite in a neighborhood of zero — a nontrivial restriction (no Cauchy, no power-law tails without moments). The CF proof requires only . In graduate probability, the CF proof is standard; in a first exposure, the MGF proof is more transparent. We do both.
11.6 The Lindeberg CLT
The iid assumption is unnecessarily strong. What the CLT actually needs is that no single summand contributes an outsize share of the total variance. Lindeberg (1922) pinned this down with a truncation condition that is both sufficient and — by a theorem of Feller — necessary.
Let be independent (not necessarily identically distributed) with and . Set . The Lindeberg condition is
The condition asks that the fraction of variance carried by summands that are individually large compared with vanishes. It does not ask any to be bounded, only that tails are “not too concentrated in a few terms.”
Under the setup of Definition 2, if the Lindeberg condition holds, then
Proof [show]
(Outline — the full proof is technical. See Durrett 2019, §3.4 for details.)
Define Gaussian surrogates independent of each other and of , with . Let , which is exactly by independence. The strategy is to show that the CFs of and differ by .
Writing where for the Taylor remainder of a mean-zero variable (Feller 1971, Lemma XV.4.1), one bounds the difference by the -th Lindeberg contribution plus a term of order .
Summing and using gives
Letting first (so ) and then gives pointwise CF convergence to . Lévy continuity finishes. ◼
The Lindeberg condition is often inconvenient to check because it involves truncated second moments. A sufficient condition using a higher moment — easier in practice — is due to Lyapunov.
Under the Lindeberg setup, if there exists such that
then .
The Lyapunov condition implies the Lindeberg condition.
Proof [show]
On the event , . So
Summing over and dividing by :
◼
Take and all independent. Then
The Lindeberg condition fails because carries essentially all the variance. And sure enough, already — but not because of the CLT. It’s simply inheriting the single Gaussian’s distribution. Replace with (Student with three degrees of freedom scaled by ), still with variance , and will be approximately , not Normal — the CLT truly breaks.
Feller (1935) proved the converse: if and (a “negligibility” condition), then the Lindeberg condition holds. Together these constitute an if-and-only-if characterization: under negligibility, the Lindeberg condition is the precise mechanism making the CLT work. The Lyapunov condition is merely a more tractable sufficient one — it gives up some generality for ease of verification.
11.7 Berry–Esseen: How Fast Is Convergence?
The CLT is a limit theorem — it tells us where we end up, not how fast we get there. For applied work, the rate matters. A 95% confidence interval is meaningless if at our actual the standardized mean is only vaguely Normal. Berry (1941) and Esseen (1942) gave the definitive answer: the CLT converges at rate , with a constant driven by the absolute third moment.
Let be iid with , , and finite absolute third moment . Set . Let be the CDF of . Then there is an absolute constant such that
The best known bound is (Shevtsova, 2011).
Two reads:
- Rate. The factor is the headline: double your sample size, cut the worst-case deviation from the Normal approximation by a factor of .
- Constant. is the absolute third moment normalized by . For symmetric distributions is modest; for right-skewed distributions like Exponential () or Chi²(1) (), is large and the approximation is correspondingly slower. Skewness is the main enemy of normality.
At with :
- Uniform(0, 1): , bound . Empirical sup deviation: .
- Exponential(1): , bound . Empirical sup deviation: .
The bound is an upper envelope — the empirical deviation is typically 5–10× smaller than the bound. But the relative ordering is preserved: the ratio of Uniform to Exponential sup deviations tracks the ratio of their values. The bound is not tight in general (it overstates the deviation), but it is tight in the worst case — there exist distributions (the Bernoulli family) for which the constant cannot be improved.
The original bound had (Esseen, 1942). Over eighty years of refinement, the constant has come down to (Shevtsova, 2011). The true value is known to satisfy by a sharp Bernoulli(p) example. The gap between and is the open problem.
11.8 The Multivariate CLT
Sample means of random vectors converge to the multivariate Normal. The statement is a verbatim translation of the univariate CLT with replaced by the covariance matrix .
Let be iid with and finite covariance matrix . Let . Then
The proof reduces to the univariate case by the Cramér–Wold device: in if and only if for every , . The linear projection is a scalar iid sum, to which the univariate CLT applies, giving — which is exactly .
Take with , . Then
(The cross-covariance is .) By the multivariate CLT, for the sample mean vector is approximately . The Mahalanobis distance — the standard recipe for multivariate hypothesis tests.
The Cramér–Wold device reduces every multivariate convergence problem to a family of univariate problems. It does for convergence in distribution what linearity does for expectation: convert a -dimensional question into a one-dimensional question indexed by the unit sphere. This is why the multivariate Normal is characterized by its one-dimensional projections, why quadratic forms of multivariate Normals are -distributed, and why multivariate Slutsky works.
11.9 The Delta Method Revisited
Topic 9 proved the delta method as Theorems 9.11–9.12. With the CLT now established, its most useful form — which is how every statistics textbook actually applies it — becomes rigorous.
Let be a sequence of estimators with , and let be differentiable at with . Then
The statement follows directly from Topic 9, Theorem 9.11 with the CLT supplying the root- normality hypothesis. The key consequence for applied statistics: any smooth transformation of an asymptotically Normal estimator is itself asymptotically Normal, with variance multiplied by .
Suppose iid, and we want a 95% CI for the rate parameter . The MLE is . By the CLT, . Apply the delta method with so .
The key subtlety: must be evaluated at the true mean , not left as a function of . A common pitfall is to mechanically plug into with the symbol still floating, producing an expression like that has no stable interpretation. Evaluate first:
With pinned to , the delta method gives
The 95% CI is . For skewed estimators like this, a log transform often gives a better small- CI: has derivative , so . Build the CI on the log scale, then exponentiate: . This symmetric-on-log-scale CI respects the non-negativity of (unlike the naive linear CI, which can be negative for small ).
If , the first-order delta method gives a degenerate limit — which is useless. The fix is a second-order delta method: if , then . Now the convergence rate is , not , and the limit is a (shifted, scaled) chi-squared, not a Normal. This happens, for example, with at : the sample mean is centered at zero, so has a non-Normal limit.
11.10 Connections to Machine Learning
1. Confidence intervals for model accuracy. After evaluating a classifier on test examples, the error rate is a Binomial average. By de Moivre–Laplace, is an approximate 95% CI for the true error rate. Every reported accuracy on a test set has Binomial error bars whether the reporter acknowledges them or not — and those bars are on a 1000-example test set even for a perfectly classified model, growing to at .
2. SGD mini-batch gradient noise. The stochastic gradient is an average of iid gradient samples (assuming the batch is uniform at random). By the multivariate CLT, where is the per-sample gradient covariance. The optimization noise is Gaussian to leading order — this is why SGD trajectories look like Brownian motion on an energy landscape, and why the learning-rate schedule is the canonical rate. formalML: Stochastic Gradient Descent has the full story.
3. Bayesian posteriors (Bernstein–von Mises). The Bernstein–von Mises theorem is the CLT for posteriors: under regularity, where is the Fisher information. The posterior concentrates around the MLE at rate with a Gaussian shape — regardless of the prior. This is why Bayesian credible intervals and frequentist confidence intervals agree asymptotically, and why casual users can get away with ignoring prior choice on large datasets. See formalML: Bayesian Neural Networks for where this machinery is indispensable (and where it breaks — small data, non-identifiable models).
The Bernstein–von Mises theorem formalizes a result practitioners use implicitly every day: on large-enough data, frequentist and Bayesian inference coincide. The proof (Van der Vaart 1998, Chapter 10) uses the CLT via the score function — the derivative of the log-likelihood at the MLE is asymptotically Normal by the CLT, and Laplace’s method extracts a Gaussian posterior from the likelihood. The rate is inherited directly from the CLT rate. When Bernstein–von Mises fails — heavy-tailed priors, non-identifiable models, high-dimensional parameters with — it is precisely because one of the CLT’s hypotheses fails for the score.
11.11 Summary
Two centuries of probability, compressed:
| Theorem | Year | Says | Requires |
|---|---|---|---|
| De Moivre–Laplace | 1733 | Binomial → Normal | Bernoulli sum |
| Lindeberg–Lévy | 1922 | iid, | |
| Lindeberg | 1922 | Same for non-identical summands | Lindeberg condition |
| Lyapunov | 1901 | Sufficient version of Lindeberg | bound |
| Lévy CF continuity | 1925 | CF convergence ⟺ convergence in distribution | CF continuous at |
| Berry–Esseen | 1941 | Finite third moment | |
| Multivariate CLT | — | iid vectors, finite | |
| Delta method | — | Nonlinear transforms inherit CLT normality | differentiable, |
Experiment with the CLT for any of nine underlying distributions below. Toggle between distributions to see the Berry–Esseen rate in action; toggle between values to watch the QQ plot go from curved (non-Normal) to straight (Normal).
What comes next. Large Deviations & Tail Bounds complements the CLT: where the CLT gives the shape of fluctuations at scale , large-deviations theory gives exponential rates at all scales, including the tails where the CLT is weakest. Together, they give a complete picture of how sample means behave. Beyond Track 3, the CLT is the foundation for confidence intervals, hypothesis testing, the bootstrap, and every asymptotic result in the statistical-inference tracks.
References
- Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
- Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.
- Wasserman, L. (2004). All of Statistics. Springer.
- Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. 2 (2nd ed.). Wiley.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.