intermediate 45 min read · April 15, 2026

Central Limit Theorem

Why normality emerges from chaos — the shape of fluctuations, the rate of convergence, and why almost all of classical statistics works.

11.1 Why the LLN Isn’t Enough

Topic 10 closed with a qualitative statement: the sample mean Xˉn\bar{X}_n converges to the population mean μ\mu. The Strong Law of Large Numbers promises it happens almost surely, the Weak Law in probability, and the law of the iterated logarithm even pins down the precise a.s. oscillation rate σ2loglogn/n\sigma\sqrt{2\log\log n / n}. What none of these tell us is the shape of the fluctuations at any given nn.

That gap matters. When we report p^=0.42\hat{p} = 0.42 from a survey of n=1000n = 1000, we want to say how confident we are that pp is near 0.420.42 — not merely that p^p\hat{p} \to p eventually. Confidence requires a distribution on p^p\hat{p} - p, and the LLN gives us nothing of the kind.

The Central Limit Theorem fills that gap with one of the most surprising results in probability. Standardize the sample mean: subtract the true mean μ\mu, divide by the true standard error σ/n\sigma/\sqrt{n}. Call the result ZnZ_n. Then — regardless of whether XiX_i is Bernoulli, Exponential, Poisson, Uniform, or almost anything else with finite variance —

Zn  =  n(Xˉnμ)σ  d  N(0,1).Z_n \;=\; \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \;\xrightarrow{d}\; \mathcal{N}(0, 1).

The limiting distribution doesn’t depend on the shape of XiX_i. Skewed, symmetric, heavy-tailed, bounded, discrete — they all converge to the same Gaussian. The CLT is the mathematical reason confidence intervals, zz-tests, pp-values, and the entire apparatus of frequentist inference work.

Three-panel CLT overview: (left) sample means shrinking toward μ (LLN), (center) standardized means converging in distribution to N(0,1), (right) convergence speed across distributions

Here is the roadmap:

SectionResultWhat it says
§2De Moivre–LaplaceThe Binomial becomes Normal — the CLT’s historical root
§3Classical CLT (Lindeberg–Lévy)iid + finite variance ⟹ n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)
§4MGF proofTaylor-expand logM(t/σn)\log M(t/\sigma\sqrt{n}), apply Lévy continuity
§5CF proofSame structure with characteristic functions; no MGF required
§6Lindeberg CLTiid is overkill — only need no single summand to dominate
§7Berry–EsseenThe rate is O(1/n)O(1/\sqrt{n}), and skewness controls the constant
§8Multivariate CLTRandom vectors converge to N(0,Σ)\mathcal{N}(\mathbf{0}, \Sigma)
§9Delta methodNonlinear transformations inherit the CLT with variance [g(μ)]2σ2[g'(\mu)]^2 \sigma^2
§10ML connectionsConfidence intervals, SGD noise, Bayesian CLT
§11SummaryInteractive explorer and reference table

Throughout, we use Φ\Phi for the standard Normal CDF, φ\varphi for characteristic functions (not to be confused with Φ\Phi), MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] for moment-generating functions, and d\xrightarrow{d} for convergence in distribution (Topic 9, Definition 9.6).


11.2 De Moivre–Laplace: The First CLT

The first CLT predates the general theory by almost two centuries. In 1733, Abraham de Moivre — working on a problem of fair games of chance — proved that the Binomial distribution, properly standardized, approaches the Normal. Laplace generalized it to arbitrary pp in 1812, and the result became known as the de Moivre–Laplace theorem. Every bell-curve-from-coin-flips demonstration the reader has ever seen is a visualization of this theorem.

Theorem 1 De Moivre–Laplace

Let SnBinomial(n,p)S_n \sim \text{Binomial}(n, p) for fixed p(0,1)p \in (0, 1). Then

Snnpnp(1p)  d  N(0,1).\frac{S_n - np}{\sqrt{np(1-p)}} \;\xrightarrow{d}\; \mathcal{N}(0, 1).

Equivalently, for any a<ba < b,

P ⁣(aSnnpnp(1p)b)    Φ(b)Φ(a).\mathbb{P}\!\left(a \le \frac{S_n - np}{\sqrt{np(1-p)}} \le b\right) \;\longrightarrow\; \Phi(b) - \Phi(a).
Proof [show]

Write Sn=X1++XnS_n = X_1 + \cdots + X_n with XiBernoulli(p)X_i \sim \text{Bernoulli}(p) iid. Let q=1pq = 1 - p. The Binomial PMF is

P(Sn=k)  =  (nk)pkqnk.\mathbb{P}(S_n = k) \;=\; \binom{n}{k} p^k q^{n-k}.

Near the mode knpk \approx np we apply Stirling’s approximation, n!2πn(n/e)nn! \sim \sqrt{2\pi n}\,(n/e)^n, to the three factorials:

(nk)  =  n!k!(nk)!    2πn(n/e)n2πk(k/e)k2π(nk)((nk)/e)nk.\binom{n}{k} \;=\; \frac{n!}{k!\,(n-k)!} \;\sim\; \frac{\sqrt{2\pi n}\,(n/e)^n}{\sqrt{2\pi k}\,(k/e)^k \cdot \sqrt{2\pi(n-k)}\,((n-k)/e)^{n-k}}.

Simplify the prefactor:

(nk)    n2πk(nk)nnkk(nk)nk.\binom{n}{k} \;\sim\; \sqrt{\frac{n}{2\pi k (n-k)}} \cdot \frac{n^n}{k^k (n-k)^{n-k}}.

Change variables to x=(knp)/npqx = (k - np)/\sqrt{npq}, so k=np+xnpqk = np + x\sqrt{npq} and nk=nqxnpqn - k = nq - x\sqrt{npq}. The prefactor becomes

n2πk(nk)  =  n2πnpnq(1+O(1/n))  =  12πnpq(1+o(1)).\sqrt{\frac{n}{2\pi k(n-k)}} \;=\; \sqrt{\frac{n}{2\pi \cdot np\cdot nq \cdot (1 + O(1/\sqrt{n}))}} \;=\; \frac{1}{\sqrt{2\pi npq}} \cdot (1 + o(1)).

For the exponential factor, write k=np(1+xq/np)k = np(1 + x\sqrt{q/np}) and nk=nq(1xp/nq)n-k = nq(1 - x\sqrt{p/nq}). Then

log ⁣(pkqnknnkk(nk)nk)  =  klognpk+(nk)lognqnk.\log\!\left(\frac{p^k q^{n-k} \cdot n^n}{k^k(n-k)^{n-k}}\right) \;=\; k\log\frac{np}{k} + (n-k)\log\frac{nq}{n-k}.

Expand each logarithm using log(1+u)=uu2/2+O(u3)\log(1 + u) = u - u^2/2 + O(u^3) with u1=xq/npu_1 = x\sqrt{q/np} and u2=xp/nqu_2 = -x\sqrt{p/nq}. The first-order terms cancel (the mean of SnS_n is npnp by construction), and the second-order terms combine to

12x2(q+p)  =  12x2.-\tfrac{1}{2}\,x^2 \,(q + p) \;=\; -\tfrac{1}{2}\,x^2.

Higher-order terms are O(1/n)O(1/\sqrt{n}) and vanish in the limit. Multiplying the prefactor and the exponential factor:

P(Sn=k)    12πnpqex2/2,x=knpnpq.\mathbb{P}(S_n = k) \;\sim\; \frac{1}{\sqrt{2\pi npq}} \, e^{-x^2/2}, \qquad x = \frac{k - np}{\sqrt{npq}}.

The factor 1/npq1/\sqrt{npq} is the spacing between standardized values — it converts the PMF to a Riemann approximation of the density. Summing over kk with axba \le x \le b gives

P ⁣(aSnnpnpqb)    ab12πex2/2dx  =  Φ(b)Φ(a),\mathbb{P}\!\left(a \le \frac{S_n - np}{\sqrt{npq}} \le b\right) \;\longrightarrow\; \int_a^b \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\, dx \;=\; \Phi(b) - \Phi(a),

which is exactly convergence in distribution to N(0,1)\mathcal{N}(0,1). ◼

Three-panel de Moivre–Laplace: Binomial PMF bars approaching the Normal density as n grows; standardized Binomial CDF vs Φ; continuity correction effect
Example 1 Binomial(100, 0.3): the 95 percent band

With n=100n = 100, p=0.3p = 0.3: np=30np = 30, npq=214.58\sqrt{npq} = \sqrt{21} \approx 4.58. The standardization says S100S_{100} is approximately N(30,21)\mathcal{N}(30, 21). A 95% Normal band gives

30±1.964.58    [21.02,38.98].30 \pm 1.96 \cdot 4.58 \;\approx\; [21.02, 38.98].

The exact Binomial probability P(21S10039)\mathbb{P}(21 \le S_{100} \le 39) is 0.95070.9507 — the Normal approximation is off by less than 0.0010.001. For discrete-data accuracy, a continuity correction replaces the integer endpoints by half-integer ones: P(aSnb)Φ((b+0.5np)/npq)Φ((a0.5np)/npq)\mathbb{P}(a \le S_n \le b) \approx \Phi((b + 0.5 - np)/\sqrt{npq}) - \Phi((a - 0.5 - np)/\sqrt{npq}). This is what the explorer below toggles on and off.

np
12.00
√(np(1−p))
2.898
max|F_Bin − Φ|
0.0772
continuity
off
Remark 1 Historical arc

De Moivre (1733) proved the p=1/2p = 1/2 case. Laplace (1812) extended to general pp. Lyapunov (1901) gave the first general CLT under a third-moment condition. Lindeberg (1922) gave the sharp necessary-and-sufficient condition. Lévy (1925) supplied the characteristic function machinery, and Berry (1941) and Esseen (1942) proved the rate. Two centuries of incremental sharpening separate de Moivre’s coin-flip argument from the modern graduate-level CLT.


11.3 The Classical CLT (Lindeberg–Lévy)

The general CLT removes the Bernoulli restriction and asks only that the summands be iid with finite variance.

Theorem 2 Classical CLT (Lindeberg–Lévy)

Let X1,X2,X_1, X_2, \ldots be iid with E[X1]=μ\mathbb{E}[X_1] = \mu and 0<Var(X1)=σ2<0 < \text{Var}(X_1) = \sigma^2 < \infty. Let Xˉn=(1/n)i=1nXi\bar{X}_n = (1/n)\sum_{i=1}^n X_i. Then

Zn  =  n(Xˉnμ)σ  d  N(0,1).Z_n \;=\; \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \;\xrightarrow{d}\; \mathcal{N}(0, 1).

Two conditions, and only two: independence + identical distribution, and finite variance. No shape restrictions, no moment conditions beyond σ2<\sigma^2 < \infty. The theorem is blind to whether X1X_1 is discrete or continuous, bounded or unbounded, symmetric or skewed.

Example 2 Five distributions, one limit

We simulate M=5000M = 5000 replications of the standardized mean ZnZ_n at several values of nn:

Distributionn=5n = 5n=10n = 10n=30n = 30n=100n = 100
Uniform(0,1)KS ≈ 0.02KS ≈ 0.01KS ≈ 0.01KS < 0.01
Exponential(1)KS ≈ 0.13KS ≈ 0.08KS ≈ 0.04KS ≈ 0.02
Bernoulli(0.3)KS ≈ 0.09KS ≈ 0.05KS ≈ 0.03KS ≈ 0.02
Poisson(5)KS ≈ 0.06KS ≈ 0.04KS ≈ 0.02KS ≈ 0.01
Chi²(3)KS ≈ 0.18KS ≈ 0.11KS ≈ 0.06KS ≈ 0.03

(KS = Kolmogorov–Smirnov distance from N(0,1)\mathcal{N}(0,1); smaller is better.)

Symmetric, bounded distributions (Uniform) hit target accuracy almost immediately. Skewed distributions (Chi², Exponential) take longer. The rate is O(1/n)O(1/\sqrt{n}) uniformly — §7 will pin it down.

Remark 2 Not actually about iid

The Lindeberg–Lévy hypothesis is restrictive — real data rarely comes iid. But the iid assumption is not necessary; §6 replaces it with the Lindeberg condition, which is the true mechanism making the CLT work: no single summand should contribute an appreciable fraction of the total variance. iid is simply the simplest setting in which this happens automatically.


11.4 Proof via Moment-Generating Functions

The MGF proof is the most concrete. It assumes MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] exists in a neighborhood of zero (which excludes heavy-tailed distributions like Cauchy and Pareto without moments) but is otherwise a calculus exercise. The structure is identical to the Poisson limit proof of Theorem 9.13 — only the Taylor expansion differs.

Proof [show]

Without loss of generality take μ=0\mu = 0 and σ=1\sigma = 1 (otherwise work with Yi=(Xiμ)/σY_i = (X_i - \mu)/\sigma). Then Zn=(X1++Xn)/nZ_n = (X_1 + \cdots + X_n)/\sqrt{n}. By independence,

MZn(t)  =  E ⁣[exp ⁣(tni=1nXi)]  =  i=1nE ⁣[etXi/n]  =  MX(t/n)n.M_{Z_n}(t) \;=\; \mathbb{E}\!\left[\exp\!\left(\tfrac{t}{\sqrt{n}}\sum_{i=1}^n X_i\right)\right] \;=\; \prod_{i=1}^n \mathbb{E}\!\left[e^{tX_i/\sqrt{n}}\right] \;=\; M_X(t/\sqrt{n})^n.

Take logarithms:

logMZn(t)  =  nlogMX(t/n).\log M_{Z_n}(t) \;=\; n \log M_X(t/\sqrt{n}).

Since MXM_X is smooth near 00 and MX(0)=1M_X(0) = 1, we can Taylor-expand MXM_X around zero. Using MX(0)=E[X]=0M_X'(0) = \mathbb{E}[X] = 0 and MX(0)=E[X2]=σ2=1M_X''(0) = \mathbb{E}[X^2] = \sigma^2 = 1:

MX(s)  =  1+0s+12s2+O(s3)  =  1+12s2+O(s3)as s0.M_X(s) \;=\; 1 + 0 \cdot s + \tfrac{1}{2} s^2 + O(s^3) \;=\; 1 + \tfrac{1}{2} s^2 + O(s^3) \qquad \text{as } s \to 0.

Substitute s=t/ns = t/\sqrt{n}:

MX(t/n)  =  1+t22n+O(n3/2).M_X(t/\sqrt{n}) \;=\; 1 + \frac{t^2}{2n} + O(n^{-3/2}).

Now apply log(1+u)=uu2/2+O(u3)\log(1 + u) = u - u^2/2 + O(u^3) with u=t2/(2n)+O(n3/2)u = t^2/(2n) + O(n^{-3/2}):

logMX(t/n)  =  t22n+O(n3/2).\log M_X(t/\sqrt{n}) \;=\; \frac{t^2}{2n} + O(n^{-3/2}).

Multiply by nn:

logMZn(t)  =  nlogMX(t/n)  =  t22+O(n1/2)    t22.\log M_{Z_n}(t) \;=\; n \cdot \log M_X(t/\sqrt{n}) \;=\; \frac{t^2}{2} + O(n^{-1/2}) \;\longrightarrow\; \frac{t^2}{2}.

Exponentiating gives the pointwise limit MZn(t)et2/2M_{Z_n}(t) \to e^{t^2/2}, which is the MGF of N(0,1)\mathcal{N}(0, 1). By MGF uniqueness (Expectation, Variance & Moments: Theorem 17) plus the MGF version of Lévy’s continuity theorem, ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0, 1). ◼

Three-panel MGF proof: residual of log M(t/√n) vs t²/(2n); n·log M(t/√n) approaching t²/2 for several distributions; MGF of Zₙ approaching e^{t²/2}
Example 3 Exponential(1): explicit Taylor expansion

For XExp(1)X \sim \text{Exp}(1), μ=1\mu = 1 and σ=1\sigma = 1. Work with the centered variable Y=X1Y = X - 1. Then

MY(t)  =  E[et(X1)]  =  et11t(t<1).M_Y(t) \;=\; \mathbb{E}[e^{t(X-1)}] \;=\; e^{-t} \cdot \frac{1}{1 - t} \qquad (t < 1).

Expand: et=1t+t2/2t3/6+O(t4)e^{-t} = 1 - t + t^2/2 - t^3/6 + O(t^4) and 1/(1t)=1+t+t2+t3+O(t4)1/(1-t) = 1 + t + t^2 + t^3 + O(t^4). Multiply:

MY(t)  =  1+0t+12t2+13t3+O(t4).M_Y(t) \;=\; 1 + 0\cdot t + \tfrac{1}{2} t^2 + \tfrac{1}{3} t^3 + O(t^4).

The coefficient of t2t^2 is 1/21/2 — consistent with σ2=1\sigma^2 = 1. Substituting s=t/ns = t/\sqrt{n} and taking logs:

nlogMY(t/n)  =  t22+t33n+O(n1).n\log M_Y(t/\sqrt{n}) \;=\; \frac{t^2}{2} + \frac{t^3}{3\sqrt{n}} + O(n^{-1}).

The cubic term is O(1/n)O(1/\sqrt{n}) — it vanishes as nn \to \infty, giving the N(0,1)\mathcal{N}(0,1) limit as expected. But the cubic coefficient is the reason Exponential converges slower than Uniform (where the cubic term vanishes by symmetry). This is the Berry–Esseen rate effect of §7.

Remark 3 The Poisson proof is the prototype

Topic 9, Theorem 9.13 proved the Poisson limit theorem by the same five-step recipe: (1) compute the MGF of the standardized sum, (2) take the log, (3) Taylor-expand in powers of 1/n1/n, (4) show the nn \to \infty limit is the target MGF, (5) invoke Lévy continuity and MGF uniqueness. Only step (3) differs between the two proofs: for Poisson, the Taylor expansion gives λ(et1)\lambda(e^t - 1) in the limit; for the CLT, it gives t2/2t^2/2.


11.5 Proof via Characteristic Functions

MGFs have one flaw: they may not exist. The MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] is an integral that can diverge — Cauchy-distributed variables, Pareto, and anything with sub-exponential tails break it. The Fourier-analytic cousin, the characteristic function, is bulletproof: the integrand eitXe^{itX} has modulus one, so the integral always converges. This generality is why the textbook CLT proof is via characteristic functions.

Definition 1 Characteristic function

For a random variable XX, the characteristic function is

φX(t)  =  E[eitX]  =  E[cos(tX)]+iE[sin(tX)],tR,\varphi_X(t) \;=\; \mathbb{E}[e^{itX}] \;=\; \mathbb{E}[\cos(tX)] + i\,\mathbb{E}[\sin(tX)], \qquad t \in \mathbb{R},

where i=1i = \sqrt{-1}. Since eitX=1|e^{itX}| = 1, the expectation exists for every distribution. The characteristic function uniquely determines the distribution: φX=φY\varphi_X = \varphi_Y implies X=dYX \stackrel{d}{=} Y.

The CF of N(0,1)\mathcal{N}(0, 1) is φ(t)=et2/2\varphi(t) = e^{-t^2/2} (compute by contour integration or Hermite polynomials — standard). The CLT target is therefore φZn(t)et2/2\varphi_{Z_n}(t) \to e^{-t^2/2} pointwise, and we need a continuity theorem to lift pointwise CF convergence to convergence in distribution.

Theorem 3 Lévy's continuity theorem (CF version)

Let Xn,XX_n, X be random variables with characteristic functions φn,φ\varphi_n, \varphi. Then XndXX_n \xrightarrow{d} X if and only if φn(t)φ(t)\varphi_n(t) \to \varphi(t) for every tRt \in \mathbb{R} and φ\varphi is continuous at 00.

The CF version is strictly stronger than the MGF version in Topic 9, Remark 2: no moment conditions required, the result always applies. The continuity-at-zero requirement on φ\varphi is usually automatic — any CF of a proper probability distribution is continuous.

Proof [show]

Take μ=0\mu = 0, σ=1\sigma = 1, so Zn=(X1++Xn)/nZ_n = (X_1 + \cdots + X_n)/\sqrt{n}. By independence,

φZn(t)  =  φX(t/n)n.\varphi_{Z_n}(t) \;=\; \varphi_X(t/\sqrt{n})^n.

Since E[X]=0\mathbb{E}[X] = 0 and E[X2]=1\mathbb{E}[X^2] = 1, Taylor-expand φX\varphi_X around zero using φX(0)=iE[X]=0\varphi_X'(0) = i\,\mathbb{E}[X] = 0 and φX(0)=E[X2]=1\varphi_X''(0) = -\mathbb{E}[X^2] = -1:

φX(s)  =  112s2+o(s2)as s0.\varphi_X(s) \;=\; 1 - \tfrac{1}{2} s^2 + o(s^2) \qquad \text{as } s \to 0.

This expansion holds whenever E[X2]<\mathbb{E}[X^2] < \infty — no higher moments needed. Substitute s=t/ns = t/\sqrt{n}:

φX(t/n)  =  1t22n+o(1/n).\varphi_X(t/\sqrt{n}) \;=\; 1 - \frac{t^2}{2n} + o(1/n).

Take logarithms (the principal branch is well-defined once t/n|t|/\sqrt{n} is small, which it is for all fixed tt and large enough nn):

logφX(t/n)  =  t22n+o(1/n).\log \varphi_X(t/\sqrt{n}) \;=\; -\frac{t^2}{2n} + o(1/n).

Multiply by nn:

nlogφX(t/n)  =  t22+o(1)    t22.n \log \varphi_X(t/\sqrt{n}) \;=\; -\frac{t^2}{2} + o(1) \;\longrightarrow\; -\frac{t^2}{2}.

Exponentiate: φZn(t)et2/2=φN(0,1)(t)\varphi_{Z_n}(t) \to e^{-t^2/2} = \varphi_{\mathcal{N}(0,1)}(t), pointwise in tt. Since φN(0,1)\varphi_{\mathcal{N}(0,1)} is continuous at zero, Lévy’s continuity theorem (Theorem 3) gives ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0, 1). ◼

Three-panel CF proof: real and imaginary parts of φ_Zn(t); |φn(t) − e^{−t²/2}| shrinking with n; side-by-side comparison of CF vs MGF convergence
Remark 4 CFs vs MGFs — when to use which

MGFs are real-valued, which makes Taylor expansion concrete and helps with first intuition. CFs are complex-valued but always exist. The MGF proof assumes MX(t)M_X(t) finite in a neighborhood of zero — a nontrivial restriction (no Cauchy, no power-law tails without moments). The CF proof requires only E[X2]<\mathbb{E}[X^2] < \infty. In graduate probability, the CF proof is standard; in a first exposure, the MGF proof is more transparent. We do both.


11.6 The Lindeberg CLT

The iid assumption is unnecessarily strong. What the CLT actually needs is that no single summand contributes an outsize share of the total variance. Lindeberg (1922) pinned this down with a truncation condition that is both sufficient and — by a theorem of Feller — necessary.

Definition 2 Lindeberg condition

Let X1,X2,X_1, X_2, \ldots be independent (not necessarily identically distributed) with E[Xk]=0\mathbb{E}[X_k] = 0 and Var(Xk)=σk2<\text{Var}(X_k) = \sigma_k^2 < \infty. Set sn2=k=1nσk2s_n^2 = \sum_{k=1}^n \sigma_k^2. The Lindeberg condition is

Ln(ε)  =  1sn2k=1nE ⁣[Xk21{Xk>εsn}]    0as n, every ε>0.L_n(\varepsilon) \;=\; \frac{1}{s_n^2} \sum_{k=1}^n \mathbb{E}\!\left[X_k^2 \cdot \mathbf{1}\{|X_k| > \varepsilon s_n\}\right] \;\longrightarrow\; 0 \qquad \text{as } n \to \infty, \text{ every } \varepsilon > 0.

The condition asks that the fraction of variance carried by summands that are individually large compared with sns_n vanishes. It does not ask any XkX_k to be bounded, only that tails are “not too concentrated in a few terms.”

Theorem 4 Lindeberg CLT

Under the setup of Definition 2, if the Lindeberg condition holds, then

Snsn  =  X1++Xnsn  d  N(0,1).\frac{S_n}{s_n} \;=\; \frac{X_1 + \cdots + X_n}{s_n} \;\xrightarrow{d}\; \mathcal{N}(0, 1).
Proof [show]

(Outline — the full proof is technical. See Durrett 2019, §3.4 for details.)

Define Gaussian surrogates Z1,,ZnZ_1, \ldots, Z_n independent of each other and of X1,,XnX_1, \ldots, X_n, with ZkN(0,σk2)Z_k \sim \mathcal{N}(0, \sigma_k^2). Let Tn=ZkT_n = \sum Z_k, which is exactly N(0,sn2)\mathcal{N}(0, s_n^2) by independence. The strategy is to show that the CFs of Sn/snS_n/s_n and Tn/snT_n/s_n differ by o(1)o(1).

Writing φX(t)=112σ2t2+r(t)\varphi_X(t) = 1 - \tfrac{1}{2}\sigma^2 t^2 + r(t) where r(t)t2min(σ2,tE[X3])|r(t)| \le t^2 \cdot \min(\sigma^2, t \cdot \mathbb{E}[|X|^3]) for the Taylor remainder of a mean-zero variable (Feller 1971, Lemma XV.4.1), one bounds the difference φXk(t/sn)φZk(t/sn)\varphi_{X_k}(t/s_n) - \varphi_{Z_k}(t/s_n) by the kk-th Lindeberg contribution E[Xk21{Xk>εsn}]/sn2\mathbb{E}[X_k^2 \mathbf{1}\{|X_k| > \varepsilon s_n\}]/s_n^2 plus a term of order εσk2/sn2\varepsilon \sigma_k^2/s_n^2.

Summing and using kφZk(t/sn)=et2/2\prod_k \varphi_{Z_k}(t/s_n) = e^{-t^2/2} gives

φSn/sn(t)et2/2    12t2ε+t2Ln(ε)+o(1).\left|\varphi_{S_n/s_n}(t) - e^{-t^2/2}\right| \;\le\; \tfrac{1}{2} t^2 \varepsilon + t^2 L_n(\varepsilon) + o(1).

Letting nn \to \infty first (so Ln(ε)0L_n(\varepsilon) \to 0) and then ε0\varepsilon \downarrow 0 gives pointwise CF convergence to et2/2e^{-t^2/2}. Lévy continuity finishes. ◼

The Lindeberg condition is often inconvenient to check because it involves truncated second moments. A sufficient condition using a higher moment — easier in practice — is due to Lyapunov.

Theorem 5 Lyapunov CLT

Under the Lindeberg setup, if there exists δ>0\delta > 0 such that

1sn2+δk=1nE[Xk2+δ]    0,\frac{1}{s_n^{2+\delta}} \sum_{k=1}^n \mathbb{E}[|X_k|^{2+\delta}] \;\longrightarrow\; 0,

then Sn/sndN(0,1)S_n/s_n \xrightarrow{d} \mathcal{N}(0, 1).

Corollary 1 Lyapunov implies Lindeberg

The Lyapunov condition implies the Lindeberg condition.

Proof [show]

On the event {Xk>εsn}\{|X_k| > \varepsilon s_n\}, Xk2+δ=Xk2Xkδ>Xk2(εsn)δ|X_k|^{2+\delta} = |X_k|^2 \cdot |X_k|^\delta > |X_k|^2 \cdot (\varepsilon s_n)^\delta. So

E[Xk21{Xk>εsn}]    1(εsn)δE[Xk2+δ].\mathbb{E}[X_k^2 \mathbf{1}\{|X_k| > \varepsilon s_n\}] \;\le\; \frac{1}{(\varepsilon s_n)^\delta} \mathbb{E}[|X_k|^{2+\delta}].

Summing over kk and dividing by sn2s_n^2:

Ln(ε)    1εδkE[Xk2+δ]sn2+δ    0.L_n(\varepsilon) \;\le\; \frac{1}{\varepsilon^\delta} \cdot \frac{\sum_k \mathbb{E}[|X_k|^{2+\delta}]}{s_n^{2+\delta}} \;\longrightarrow\; 0.

Three-panel Lindeberg: CLT holds for mixed distributions; fails when one variable dominates; truncation fraction L_n(ε) as a diagnostic
Example 4 Dominance breaks the CLT

Take X1N(0,n2)X_1 \sim \mathcal{N}(0, n^2) and X2,,XnN(0,1)X_2, \ldots, X_n \sim \mathcal{N}(0, 1) all independent. Then

sn2  =  n2+(n1)    n2,σ12sn2    1.s_n^2 \;=\; n^2 + (n-1) \;\approx\; n^2, \qquad \frac{\sigma_1^2}{s_n^2} \;\to\; 1.

The Lindeberg condition fails because X1X_1 carries essentially all the variance. And sure enough, Sn/snX1/nN(0,1)S_n/s_n \approx X_1/n \sim \mathcal{N}(0, 1) already — but not because of the CLT. It’s simply inheriting the single Gaussian’s distribution. Replace X1X_1 with X1t(3)nX_1 \sim t(3) \cdot n (Student tt with three degrees of freedom scaled by nn), still with variance O(n2)O(n^2), and Sn/snS_n/s_n will be approximately t(3)t(3), not Normal — the CLT truly breaks.

Lindeberg: holds
X₁ ~ N(0,1), X₂ ~ Exp(1)−1, X₃ ~ Uniform(−√3, √3). All variance 1. Lindeberg holds.
s_n²
30.00
max Var(Xₖ)/s_n²
0.033
KS vs N(0,1)
0.0231
status
Lindeberg OK
Remark 5 Necessary, not just sufficient

Feller (1935) proved the converse: if Sn/sndN(0,1)S_n/s_n \xrightarrow{d} \mathcal{N}(0, 1) and maxknσk2/sn20\max_{k \le n} \sigma_k^2 / s_n^2 \to 0 (a “negligibility” condition), then the Lindeberg condition holds. Together these constitute an if-and-only-if characterization: under negligibility, the Lindeberg condition is the precise mechanism making the CLT work. The Lyapunov condition is merely a more tractable sufficient one — it gives up some generality for ease of verification.


11.7 Berry–Esseen: How Fast Is Convergence?

The CLT is a limit theorem — it tells us where we end up, not how fast we get there. For applied work, the rate matters. A 95% confidence interval is meaningless if at our actual nn the standardized mean is only vaguely Normal. Berry (1941) and Esseen (1942) gave the definitive answer: the CLT converges at rate 1/n1/\sqrt{n}, with a constant driven by the absolute third moment.

Theorem 6 Berry–Esseen

Let X1,X2,X_1, X_2, \ldots be iid with E[X1]=μ\mathbb{E}[X_1] = \mu, Var(X1)=σ2>0\text{Var}(X_1) = \sigma^2 > 0, and finite absolute third moment E[X1μ3]<\mathbb{E}[|X_1 - \mu|^3] < \infty. Set ρ=E[X1μ3]/σ3\rho = \mathbb{E}[|X_1 - \mu|^3]/\sigma^3. Let FnF_n be the CDF of Zn=n(Xˉnμ)/σZ_n = \sqrt{n}(\bar{X}_n - \mu)/\sigma. Then there is an absolute constant CC such that

supxRFn(x)Φ(x)    Cρnfor every n1.\sup_{x \in \mathbb{R}} |F_n(x) - \Phi(x)| \;\le\; \frac{C\,\rho}{\sqrt{n}} \qquad \text{for every } n \ge 1.

The best known bound is C0.4748C \le 0.4748 (Shevtsova, 2011).

Two reads:

  • Rate. The 1/n1/\sqrt{n} factor is the headline: double your sample size, cut the worst-case deviation from the Normal approximation by a factor of 21.41\sqrt{2} \approx 1.41.
  • Constant. ρ\rho is the absolute third moment normalized by σ3\sigma^3. For symmetric distributions ρ=E[X3]/σ3\rho = \mathbb{E}[|X|^3]/\sigma^3 is modest; for right-skewed distributions like Exponential (ρ=6\rho = 6) or Chi²(1) (ρ=8\rho = 8), ρ\rho is large and the approximation is correspondingly slower. Skewness is the main enemy of normality.
Three-panel Berry–Esseen: sup|Fₙ − Φ| for Uniform vs Exponential; convergence rate 1/√n with Berry–Esseen envelope; skewness ρ as the driver of the constant
Example 5 Uniform vs Exponential at the same n

At n=50n = 50 with C=0.4748C = 0.4748:

  • Uniform(0, 1): ρ=1.8\rho = 1.8, bound =0.47481.8/500.121= 0.4748 \cdot 1.8 / \sqrt{50} \approx 0.121. Empirical sup deviation: 0.02\approx 0.02.
  • Exponential(1): ρ=6\rho = 6, bound =0.47486/500.403= 0.4748 \cdot 6 / \sqrt{50} \approx 0.403. Empirical sup deviation: 0.05\approx 0.05.

The bound is an upper envelope — the empirical deviation is typically 5–10× smaller than the bound. But the relative ordering is preserved: the ratio of Uniform to Exponential sup deviations tracks the ratio of their ρ\rho values. The bound is not tight in general (it overstates the deviation), but it is tight in the worst case — there exist distributions (the Bernoulli family) for which the constant CC cannot be improved.

Uniform(0, 1)
ρ = E[|X|³]/σ³
1.30
skewness
0.00
speed
fast
Exponential(1)
ρ = E[|X|³]/σ³
2.41
skewness
2.00
speed
slow
Remark 6 The Berry–Esseen constant saga

The original bound had C7.59C \le 7.59 (Esseen, 1942). Over eighty years of refinement, the constant has come down to C0.4748C \le 0.4748 (Shevtsova, 2011). The true value is known to satisfy C(10+3)/(62π)0.40973C \ge (\sqrt{10} + 3)/(6\sqrt{2\pi}) \approx 0.40973 by a sharp Bernoulli(p) example. The gap between 0.409730.40973 and 0.47480.4748 is the open problem.


11.8 The Multivariate CLT

Sample means of random vectors converge to the multivariate Normal. The statement is a verbatim translation of the univariate CLT with σ2\sigma^2 replaced by the covariance matrix Σ\Sigma.

Theorem 7 Multivariate CLT

Let X1,X2,Rd\mathbf{X}_1, \mathbf{X}_2, \ldots \in \mathbb{R}^d be iid with E[X1]=μ\mathbb{E}[\mathbf{X}_1] = \boldsymbol{\mu} and finite covariance matrix Σ=Cov(X1)\Sigma = \text{Cov}(\mathbf{X}_1). Let Xˉn=(1/n)i=1nXi\bar{\mathbf{X}}_n = (1/n)\sum_{i=1}^n \mathbf{X}_i. Then

n(Xˉnμ)  d  Nd(0,Σ).\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \;\xrightarrow{d}\; \mathcal{N}_d(\mathbf{0}, \Sigma).

The proof reduces to the univariate case by the Cramér–Wold device: YndY\mathbf{Y}_n \xrightarrow{d} \mathbf{Y} in Rd\mathbb{R}^d if and only if for every aRd\mathbf{a} \in \mathbb{R}^d, aYndaY\mathbf{a}^\top \mathbf{Y}_n \xrightarrow{d} \mathbf{a}^\top \mathbf{Y}. The linear projection an(Xˉnμ)\mathbf{a}^\top \sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) is a scalar iid sum, to which the univariate CLT applies, giving N(0,aΣa)\mathcal{N}(0, \mathbf{a}^\top \Sigma \mathbf{a}) — which is exactly aNd(0,Σ)\mathbf{a}^\top \mathcal{N}_d(\mathbf{0}, \Sigma).

Three-panel multivariate CLT: bivariate sample means converging to point; standardized means forming 2D Gaussian cloud; Mahalanobis distance distribution approaching χ²(2)
Example 6 Bivariate example

Take Xi=(Xi,Yi)\mathbf{X}_i = (X_i, Y_i) with XiExp(1)X_i \sim \text{Exp}(1), Yi=Xi2E[Xi2]Y_i = X_i^2 - \mathbb{E}[X_i^2]. Then

μ=(1,0),Σ=(14420).\boldsymbol{\mu} = (1, 0), \qquad \Sigma = \begin{pmatrix} 1 & 4 \\ 4 & 20 \end{pmatrix}.

(The cross-covariance is E[(X1)(X22)]=E[X3X22X+2]=622+2=4\mathbb{E}[(X - 1)(X^2 - 2)] = \mathbb{E}[X^3 - X^2 - 2X + 2] = 6 - 2 - 2 + 2 = 4.) By the multivariate CLT, for n=100n = 100 the sample mean vector is approximately N2(μ,Σ/100)\mathcal{N}_2(\boldsymbol{\mu}, \Sigma/100). The Mahalanobis distance n(Xˉnμ)Σ1(Xˉnμ)dχ2(2)n(\bar{\mathbf{X}}_n - \boldsymbol{\mu})^\top \Sigma^{-1} (\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \chi^2(2) — the standard recipe for multivariate hypothesis tests.

Remark 7 Cramér–Wold is the workhorse

The Cramér–Wold device reduces every multivariate convergence problem to a family of univariate problems. It does for convergence in distribution what linearity does for expectation: convert a dd-dimensional question into a one-dimensional question indexed by the unit sphere. This is why the multivariate Normal is characterized by its one-dimensional projections, why quadratic forms of multivariate Normals are χ2\chi^2-distributed, and why multivariate Slutsky works.


11.9 The Delta Method Revisited

Topic 9 proved the delta method as Theorems 9.11–9.12. With the CLT now established, its most useful form — which is how every statistics textbook actually applies it — becomes rigorous.

Theorem 8 Delta method (CLT form)

Let θ^n\hat{\theta}_n be a sequence of estimators with n(θ^nθ)dN(0,σ2)\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2), and let g:RRg : \mathbb{R} \to \mathbb{R} be differentiable at θ\theta with g(θ)0g'(\theta) \ne 0. Then

n(g(θ^n)g(θ))  d  N ⁣(0,[g(θ)]2σ2).\sqrt{n}(g(\hat{\theta}_n) - g(\theta)) \;\xrightarrow{d}\; \mathcal{N}\!\left(0, [g'(\theta)]^2 \sigma^2\right).

The statement follows directly from Topic 9, Theorem 9.11 with the CLT supplying the root-nn normality hypothesis. The key consequence for applied statistics: any smooth transformation of an asymptotically Normal estimator is itself asymptotically Normal, with variance multiplied by [g(θ)]2[g'(\theta)]^2.

Three-panel delta method: CLT around μ; CLT around g(μ) with transformed variance; asymptotic variance [g'(μ)]²σ²/n matching direct simulation
Example 7 Log-transform confidence interval

Suppose XiExp(λ)X_i \sim \text{Exp}(\lambda) iid, and we want a 95% CI for the rate parameter λ\lambda. The MLE is λ^n=1/Xˉn\hat{\lambda}_n = 1/\bar{X}_n. By the CLT, n(Xˉn1/λ)dN(0,1/λ2)\sqrt{n}(\bar{X}_n - 1/\lambda) \xrightarrow{d} \mathcal{N}(0, 1/\lambda^2). Apply the delta method with g(x)=1/xg(x) = 1/x so g(x)=1/x2g'(x) = -1/x^2.

The key subtlety: gg' must be evaluated at the true mean μ=1/λ\mu = 1/\lambda, not left as a function of xx. A common pitfall is to mechanically plug g(x)=1/x2g'(x) = -1/x^2 into [g(μ)]2σ2[g'(\mu)]^2 \sigma^2 with the symbol xx still floating, producing an expression like λ4λ2\lambda^{-4} \cdot \lambda^{-2} that has no stable interpretation. Evaluate first:

g(1/λ)  =  λ2,[g(1/λ)]2σ2  =  λ4λ2  =  λ2.g'(1/\lambda) \;=\; -\lambda^2, \qquad [g'(1/\lambda)]^2 \,\sigma^2 \;=\; \lambda^4 \cdot \lambda^{-2} \;=\; \lambda^2.

With gg' pinned to μ\mu, the delta method gives

n ⁣(λ^nλ)  d  N(0,λ2).\sqrt{n}\!\left(\hat{\lambda}_n - \lambda\right) \;\xrightarrow{d}\; \mathcal{N}(0, \lambda^2).

The 95% CI is λ^n±1.96λ^n/n\hat{\lambda}_n \pm 1.96 \,\hat{\lambda}_n/\sqrt{n}. For skewed estimators like this, a log transform often gives a better small-nn CI: g(x)=log(1/x)=logxg(x) = \log(1/x) = -\log x has derivative 1/x-1/x, so n(logλ^nlogλ)dN(0,1)\sqrt{n}(\log \hat{\lambda}_n - \log \lambda) \xrightarrow{d} \mathcal{N}(0, 1). Build the CI on the log scale, then exponentiate: exp(logλ^n±1.96/n)\exp(\log \hat{\lambda}_n \pm 1.96/\sqrt{n}). This symmetric-on-log-scale CI respects the non-negativity of λ\lambda (unlike the naive linear CI, which can be negative for small nn).

μ
5.000
σ²
4.000
g(μ)
1.609
g′(μ)
0.200
[g′(μ)]² σ² (theory)
0.1600
simulated var
0.1673
ratio sim/theory
1.046
replications kept
3,000 / 3,000
Remark 8 When the delta method fails

If g(θ)=0g'(\theta) = 0, the first-order delta method gives a degenerate N(0,0)\mathcal{N}(0, 0) limit — which is useless. The fix is a second-order delta method: if g(θ)0g''(\theta) \ne 0, then n(g(θ^n)g(θ))d12g(θ)σ2χ2(1)n(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} \tfrac{1}{2} g''(\theta) \sigma^2 \chi^2(1). Now the convergence rate is 1/n1/n, not 1/n1/\sqrt{n}, and the limit is a (shifted, scaled) chi-squared, not a Normal. This happens, for example, with g(x)=x2g(x) = x^2 at θ=0\theta = 0: the sample mean is centered at zero, so (Xˉn)2(\bar{X}_n)^2 has a non-Normal limit.


11.10 Connections to Machine Learning

Three-panel ML: confidence interval coverage simulation; SGD mini-batch gradient Gaussian noise; Bayesian posterior shrinking at 1/√n
Example 8 Three places the CLT shows up in ML

1. Confidence intervals for model accuracy. After evaluating a classifier on nn test examples, the error rate p^=(1/n)1{y^iyi}\hat{p} = (1/n)\sum \mathbf{1}\{\hat{y}_i \ne y_i\} is a Binomial average. By de Moivre–Laplace, p^±1.96p^(1p^)/n\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n} is an approximate 95% CI for the true error rate. Every reported accuracy on a test set has Binomial error bars whether the reporter acknowledges them or not — and those bars are 0.03\approx 0.03 on a 1000-example test set even for a perfectly classified model, growing to 0.050.05 at n=400n = 400.

2. SGD mini-batch gradient noise. The stochastic gradient g^B=(1/B)iB(θ;xi)\hat{g}_B = (1/B)\sum_{i \in B} \nabla \ell(\theta; x_i) is an average of BB iid gradient samples (assuming the batch is uniform at random). By the multivariate CLT, g^BL(θ)N(0,Σ/B)\hat{g}_B - \nabla L(\theta) \approx \mathcal{N}(\mathbf{0}, \Sigma/B) where Σ\Sigma is the per-sample gradient covariance. The optimization noise is Gaussian to leading order — this is why SGD trajectories look like Brownian motion on an energy landscape, and why the learning-rate schedule η1/t\eta \propto 1/\sqrt{t} is the canonical rate. formalML: Stochastic Gradient Descent has the full story.

3. Bayesian posteriors (Bernstein–von Mises). The Bernstein–von Mises theorem is the CLT for posteriors: under regularity, n(θθ0)X1,,XndN(0,I(θ0)1)\sqrt{n}(\theta - \theta_0) \mid X_1, \ldots, X_n \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}) where I(θ0)I(\theta_0) is the Fisher information. The posterior concentrates around the MLE at rate 1/n1/\sqrt{n} with a Gaussian shape — regardless of the prior. This is why Bayesian credible intervals and frequentist confidence intervals agree asymptotically, and why casual users can get away with ignoring prior choice on large datasets. See formalML: Bayesian Neural Networks for where this machinery is indispensable (and where it breaks — small data, non-identifiable models).

Remark 9 Bernstein–von Mises — the Bayesian CLT

The Bernstein–von Mises theorem formalizes a result practitioners use implicitly every day: on large-enough data, frequentist and Bayesian inference coincide. The proof (Van der Vaart 1998, Chapter 10) uses the CLT via the score function — the derivative of the log-likelihood at the MLE is asymptotically Normal by the CLT, and Laplace’s method extracts a Gaussian posterior from the likelihood. The rate 1/n1/\sqrt{n} is inherited directly from the CLT rate. When Bernstein–von Mises fails — heavy-tailed priors, non-identifiable models, high-dimensional parameters with dnd \sim n — it is precisely because one of the CLT’s hypotheses fails for the score.


11.11 Summary

Two centuries of probability, compressed:

TheoremYearSaysRequires
De Moivre–Laplace1733Binomial → NormalBernoulli sum
Lindeberg–Lévy1922n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)iid, σ2<\sigma^2 < \infty
Lindeberg1922Same for non-identical summandsLindeberg condition
Lyapunov1901Sufficient version of LindebergE[Xk2+δ]\mathbb{E}[\|X_k\|^{2+\delta}] bound
Lévy CF continuity1925CF convergence ⟺ convergence in distributionCF continuous at 00
Berry–Esseen1941supFnΦCρ/n\sup\|F_n - \Phi\| \le C\rho/\sqrt{n}Finite third moment
Multivariate CLTn(Xˉnμ)dNd(0,Σ)\sqrt{n}(\bar{\mathbf{X}}_n - \boldsymbol{\mu}) \xrightarrow{d} \mathcal{N}_d(\mathbf{0}, \Sigma)iid vectors, Σ\Sigma finite
Delta methodNonlinear transforms inherit CLT normalitygg differentiable, g(θ)0g'(\theta) \ne 0
Convergence speed comparison across distributions — skewness-driven ordering of the Berry–Esseen rate

Experiment with the CLT for any of nine underlying distributions below. Toggle between distributions to see the Berry–Esseen rate in action; toggle between nn values to watch the QQ plot go from curved (non-Normal) to straight (Normal).

Exponential(1)
Skewed right. Moderate convergence — visible non-normality at n = 10, good by n = 50.
KS vs Φ0.0393
Sample skew0.385
Excess kurtosis0.303
n = 30 · M = 5,000 · true skewness 2.00 · ρ = 2.41

What comes next. Large Deviations & Tail Bounds complements the CLT: where the CLT gives the shape of fluctuations at scale 1/n1/\sqrt{n}, large-deviations theory gives exponential rates at all scales, including the tails where the CLT is weakest. Together, they give a complete picture of how sample means behave. Beyond Track 3, the CLT is the foundation for confidence intervals, hypothesis testing, the bootstrap, and every asymptotic result in the statistical-inference tracks.


References

  1. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  2. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  3. Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Vol. 2 (2nd ed.). Wiley.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.