intermediate 50 min read · April 16, 2026

Large Deviations & Tail Bounds

From Chebyshev to Hoeffding to sub-Gaussian — exponential-rate concentration and the finite-sample tools that power statistical learning theory.

12.1 Why Chebyshev Isn’t Enough

The Law of Large Numbers tells us that Xˉnμ\bar{X}_n \to \mu, and Chebyshev’s inequality gives us a rate: P(Xˉnμε)σ2/(nε2)P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \sigma^2 / (n\varepsilon^2). That rate is O(1/n)O(1/n) — polynomial, not exponential. For an ML practitioner designing a model that has to perform within ε=0.01\varepsilon = 0.01 of its expected risk with failure probability δ=0.05\delta = 0.05, Chebyshev says you need roughly σ2/(ε2δ)50,000\sigma^2 / (\varepsilon^2 \delta) \approx 50{,}000 samples. For a bounded random variable on [0,1][0, 1] with σ21/4\sigma^2 \leq 1/4, the true answer (we’ll see) is closer to 15,00015{,}000. Chebyshev is off by more than a factor of three — and it gets worse as δ\delta shrinks, because δ\delta enters polynomially rather than logarithmically.

The gap matters. A factor of three in sample complexity is the difference between running an experiment overnight and running it for a week. A factor of twenty (at δ=103\delta = 10^{-3}) is the difference between running it at all and giving up. The fundamental tension is this: Chebyshev uses only the variance, and the variance is the coarsest summary of a distribution that has any connection to its tails. The true tail probability depends on the whole shape of the distribution — and for many distributions that shape is much better than what the variance suggests.

Concentration inequalities are the tools that exploit this gap. They trade additional structural assumptions (boundedness, sub-Gaussian behavior, a known variance bound) for exponentially tighter rates. The payoff is not incremental: it transforms the rate from O(1/n)O(1/n) to O(ecnε2)O(e^{-cn\varepsilon^2}), which at n=100,ε=0.1n = 100, \varepsilon = 0.1 is e20.14e^{-2} \approx 0.14 versus Chebyshev’s σ2/nε2=0.25/1=0.25\sigma^2 / n\varepsilon^2 = 0.25 / 1 = 0.25 (for Bernoulli). At n=1000n = 1000, it’s e202×109e^{-20} \approx 2 \times 10^{-9} versus 0.0250.025. That’s the difference between “maybe safe” and “safe to a billion trials.”

Three-panel overview: (left) Chebyshev σ²/(nε²) vs Hoeffding 2e^(−2nε²) showing polynomial vs exponential decay; (center) sample-size requirements — how many samples for P<0.05 under each bound; (right) CLT shape at 1/√n vs concentration rate at e^(−cn) — the complementary-theorems story

The structural thesis of this topic. Where the CLT (Topic 11) describes the shape of the distribution of Xˉn\bar{X}_n at scale 1/n1/\sqrt{n} — it’s Gaussian — concentration inequalities describe the rate at which the tails decay, at all scales. These are complementary: the CLT tells you “for n=100n = 100, the histogram of replications looks roughly Gaussian with spread σ/n\sigma/\sqrt{n}”; concentration tells you “the probability of a deviation of size ε\varepsilon is at most 2ecnε22 e^{-cn\varepsilon^2}, and this is non-asymptotic — it holds for every nn, not just in the limit.” Together, they give a complete picture of how sample means behave.

The rest of this topic develops four families of exponential-rate bounds, in increasing order of specificity: Chernoff (generic, MGF-based), Hoeffding (bounded), Bernstein (bounded + known variance), and sub-Gaussian (modern parametrization). Each uses more information about the distribution than the last, and each gives a correspondingly tighter bound. Then we step up to the Cramér rate function — the optimal exponent, captured as a Legendre transform — and McDiarmid’s inequality for functions of independent variables. The topic closes with PAC learning, Johnson–Lindenstrauss, and the other ML applications that made these tools indispensable.

12.2 The Chernoff Bounding Technique

The engine behind every exponential tail bound is a single idea: Markov’s inequality applied to the exponential of the random variable. We take a positive function of XX whose expectation we can control — namely etXe^{tX} — and let Markov do the rest. Optimizing over the free parameter tt gives the tightest bound in the family.

Three-panel Chernoff technique: (left) upper bound e^(−ta)·M(t) as a function of t for Bernoulli(0.3) with optimal t* marked; (center) Chernoff bound vs true tail probability for the Binomial, showing tightness; (right) geometric interpretation — the exponential tilt e^(tx) reweights the distribution
Theorem 1 Chernoff bound (generic)

Let XX be a random variable whose moment-generating function MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] is finite in a neighborhood of 00. Then for any aRa \in \mathbb{R}:

P(Xa)    inft>0etaMX(t).P(X \geq a) \;\leq\; \inf_{t > 0} \, e^{-ta} \, M_X(t).

For sums of independent random variables Sn=X1++XnS_n = X_1 + \cdots + X_n with iid summands, this becomes

P(Snna)    inft>0etnaMX(t)n.P(S_n \geq na) \;\leq\; \inf_{t > 0} \, e^{-tna} \, M_X(t)^n.
Proof [show]

The core step uses Markov’s inequality applied to the non-negative random variable etXe^{tX}. For any t>0t > 0, the event {Xa}\{X \geq a\} is identical to the event {etXeta}\{e^{tX} \geq e^{ta}\} because the exponential is strictly increasing. So

P(Xa)  =  P(etXeta).P(X \geq a) \;=\; P(e^{tX} \geq e^{ta}).

Markov’s inequality applied to etXe^{tX} with threshold etae^{ta} gives

P(etXeta)    E[etX]eta  =  etaMX(t).P(e^{tX} \geq e^{ta}) \;\leq\; \frac{\mathbb{E}[e^{tX}]}{e^{ta}} \;=\; e^{-ta} \, M_X(t).

This holds for every t>0t > 0, so we may take the infimum over tt:

P(Xa)    inft>0etaMX(t).P(X \geq a) \;\leq\; \inf_{t > 0} \, e^{-ta} \, M_X(t).

For the iid sum: by independence, the MGF factorizes, MSn(t)=MX(t)nM_{S_n}(t) = M_X(t)^n. Apply the single-variable result to SnS_n with threshold nana and factor the MGF. \blacksquare

The Chernoff bound is not one bound — it is a family of bounds parametrized by tt. The art is choosing tt to make the bound as tight as possible, which is a convex optimization problem: the log of the objective, log(etaMX(t))=ta+logMX(t)\log(e^{-ta} M_X(t)) = -ta + \log M_X(t), is convex in tt (the cumulant-generating function Λ(t)=logMX(t)\Lambda(t) = \log M_X(t) is convex, as follows from Hölder’s inequality applied to the MGF). The minimum is interior whenever the first-order condition Λ(t)=a\Lambda'(t^*) = a has a solution in the MGF’s domain. This tt^* is called the tilting parameter: it defines a new probability measure under which XX has mean aa instead of μ\mu, making the event {Xa}\{X \geq a\} typical rather than rare.

Chernoff Optimization Explorer
Optimal t* = 0.442; Chernoff bound at t* = 3.233e-1. The tilted distribution (red) shifts mass toward the threshold μ + ε, making the rare event typical.
Example 1 Chernoff bound for coin flips

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)(p) and Sn=XiS_n = \sum X_i. The MGF of a single Bernoulli is M(t)=1p+petM(t) = 1 - p + p e^t, so

etnp(1+δ)M(t)n  =  exp ⁣(tnp(1+δ)+nlog(1p+pet)),e^{-tnp(1+\delta)} M(t)^n \;=\; \exp\!\left(-tn p(1+\delta) + n\log(1 - p + p e^t)\right),

where we’ve written the threshold as np(1+δ)np(1+\delta) for δ>0\delta > 0. The first-order condition Λ(t)=p(1+δ)\Lambda'(t) = p(1+\delta) solves explicitly:

et  =  (1+δ)(1p)1p(1+δ).e^{t^*} \;=\; \frac{(1+\delta)(1-p)}{1 - p(1+\delta)}.

Substituting back, the Chernoff bound for the upper tail of a sum of iid Bernoullis is

P(Snnp(1+δ))    (eδ(1+δ)1+δ)np.P(S_n \geq np(1+\delta)) \;\leq\; \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^{np}.

For p=0.3,n=100,δ=0.5p = 0.3, n = 100, \delta = 0.5 (so threshold =45= 45), this gives a bound of about 0.01170.0117. The exact probability from the Binomial CDF is about 0.00680.0068 — the Chernoff bound is loose by less than a factor of two, while Chebyshev for the same event gives a bound of σ2/(n(pδ)2)=(0.21/100)/(0.15)20.093\sigma^2 / (n (p\delta)^2) = (0.21/100) / (0.15)^2 \approx 0.093, almost 14×14\times worse.

Remark 1 The optimal t and the rate function

The Chernoff bound is tight in a precise asymptotic sense. Substituting the optimal tt^* gives the exponent

n(taΛ(t))  =  nI(a),n \cdot \big(t^* a - \Lambda(t^*)\big) \;=\; n \cdot I(a),

where I(a)=supt(taΛ(t))I(a) = \sup_t (ta - \Lambda(t)) is the Cramér rate function. So the Chernoff bound for the iid sum is

P(Xˉna)    enI(a),P(\bar{X}_n \geq a) \;\leq\; e^{-n I(a)},

and Cramér’s theorem (§12.6) says the bound is actually achieved at the exponential rate: limn1nlogP(Xˉna)=I(a)\lim_{n \to \infty} \frac{1}{n} \log P(\bar{X}_n \geq a) = -I(a). The Chernoff technique is the optimal exponential bound — nothing better is possible using the MGF alone.

12.3 Hoeffding’s Inequality

For bounded random variables we can control the MGF without computing it exactly. This is the content of Hoeffding’s lemma — a statement about the MGF of a bounded, centered random variable — and it’s the key step that makes the Chernoff technique yield a clean, distribution-free bound. Combining Hoeffding’s lemma with the Chernoff technique and independence gives Hoeffding’s inequality, which is the workhorse of statistical learning theory.

Three-panel Hoeffding proof: (left) the MGF of a bounded centered variable vs the Gaussian MGF envelope e^(t²(b−a)²/8); (center) Hoeffding bound for [0,1] Bernoulli at several p values; (right) Chebyshev vs Hoeffding comparison extending the convergence-rates figure from Topic 10
Theorem 2 Hoeffding's lemma

Let XX be a random variable with aXba \leq X \leq b almost surely and E[X]=0\mathbb{E}[X] = 0. Then for every tRt \in \mathbb{R}:

E[etX]    exp ⁣(t2(ba)28).\mathbb{E}[e^{tX}] \;\leq\; \exp\!\left(\frac{t^2 (b - a)^2}{8}\right).
Proof [show]

The key observation is that convexity of etxe^{tx} lets us bound etXe^{tX} above by its chord on [a,b][a, b]. For any x[a,b]x \in [a, b], write

x  =  bxbaa  +  xabab,x \;=\; \frac{b - x}{b - a} \cdot a \;+\; \frac{x - a}{b - a} \cdot b,

a convex combination of the endpoints with weights summing to 11. By convexity of the exponential,

etx    bxbaeta  +  xabaetb.e^{tx} \;\leq\; \frac{b - x}{b - a} \, e^{ta} \;+\; \frac{x - a}{b - a} \, e^{tb}.

Taking expectations, using linearity and E[X]=0\mathbb{E}[X] = 0:

E[etX]    bbaeta    abaetb.\mathbb{E}[e^{tX}] \;\leq\; \frac{b}{b - a} \, e^{ta} \;-\; \frac{a}{b - a} \, e^{tb}.

It remains to bound this right-hand side by a Gaussian-style quantity. Define

φ(u)  =  log ⁣(beuaaeubba)\varphi(u) \;=\; \log\!\left(\frac{b \, e^{ua} - a \, e^{ub}}{b - a}\right)

where we’ve substituted u=tu = t for bookkeeping. We will show φ(u)u2(ba)2/8\varphi(u) \leq u^2 (b-a)^2 / 8 for all uRu \in \mathbb{R}. Direct computation gives φ(0)=0\varphi(0) = 0, φ(0)=0\varphi'(0) = 0 (check: φ(u)=(abeuaabeub)/(beuaaeub)\varphi'(u) = (ab \, e^{ua} - ab \, e^{ub})/(b e^{ua} - a e^{ub}), which vanishes at u=0u = 0), and — after a short algebraic manipulation letting p=a/(ba)p = -a/(b-a) and q=1p=b/(ba)q = 1 - p = b/(b-a):

φ(u)  =  pq(ba)2eu(a+b)(qeua+peub)2.\varphi''(u) \;=\; \frac{pq(b-a)^2 \, e^{u(a+b)}}{(q \, e^{ua} + p \, e^{ub})^2}.

By the arithmetic–geometric mean inequality, qeua+peubeu(qa+pb)q e^{ua} + p e^{ub} \geq e^{u(qa + pb)}… actually, a cleaner bound comes from the identity 4pq14pq \leq 1 for p,q0,p+q=1p, q \geq 0, p + q = 1 combined with qeua+peub2pqeu(a+b)/2q e^{ua} + p e^{ub} \geq 2\sqrt{pq} \, e^{u(a+b)/2} (AM-GM). Squaring:

(qeua+peub)2    4pqeu(a+b).(q e^{ua} + p e^{ub})^2 \;\geq\; 4pq \, e^{u(a+b)}.

Therefore

φ(u)    pq(ba)2eu(a+b)4pqeu(a+b)  =  (ba)24.\varphi''(u) \;\leq\; \frac{pq(b-a)^2 e^{u(a+b)}}{4pq \, e^{u(a+b)}} \;=\; \frac{(b-a)^2}{4}.

Now use Taylor’s theorem with the Lagrange remainder at u=0u = 0: for some ξ\xi between 00 and uu,

φ(u)  =  φ(0)+uφ(0)+u22φ(ξ)    0+0+u22(ba)24  =  u2(ba)28.\varphi(u) \;=\; \varphi(0) + u \varphi'(0) + \frac{u^2}{2} \varphi''(\xi) \;\leq\; 0 + 0 + \frac{u^2}{2} \cdot \frac{(b-a)^2}{4} \;=\; \frac{u^2 (b-a)^2}{8}.

Exponentiating and reading off u=tu = t gives

E[etX]    eφ(t)    exp ⁣(t2(ba)28).\mathbb{E}[e^{tX}] \;\leq\; e^{\varphi(t)} \;\leq\; \exp\!\left(\frac{t^2 (b-a)^2}{8}\right). \quad \blacksquare

Hoeffding’s lemma says a centered bounded random variable has a sub-Gaussian MGF — its MGF is at most the MGF of a Gaussian with variance proxy (ba)2/4(b-a)^2/4. This is the key fact. Every subsequent proof in this section just plugs this MGF bound into the Chernoff technique.

Theorem 3 Hoeffding's inequality

Let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i] almost surely. Write μi=E[Xi]\mu_i = \mathbb{E}[X_i] and μ=1niμi\mu = \frac{1}{n}\sum_i \mu_i. Then for every ε>0\varepsilon > 0:

P ⁣(Xˉnμε)    exp ⁣(2n2ε2i=1n(biai)2),P\!\left(\bar{X}_n - \mu \geq \varepsilon\right) \;\leq\; \exp\!\left(-\frac{2 n^2 \varepsilon^2}{\sum_{i=1}^n (b_i - a_i)^2}\right),

and by symmetry

P ⁣(Xˉnμε)    2exp ⁣(2n2ε2i=1n(biai)2).P\!\left(\left|\bar{X}_n - \mu\right| \geq \varepsilon\right) \;\leq\; 2 \exp\!\left(-\frac{2 n^2 \varepsilon^2}{\sum_{i=1}^n (b_i - a_i)^2}\right).

In the iid case with Xi[a,b]X_i \in [a, b] for all ii, this simplifies to the form previewed in Theorem 10.7:

P ⁣(Xˉnμε)    2exp ⁣(2nε2(ba)2).P\!\left(\left|\bar{X}_n - \mu\right| \geq \varepsilon\right) \;\leq\; 2 \exp\!\left(-\frac{2n\varepsilon^2}{(b - a)^2}\right).
Proof [show]

We bound the upper tail P(Xˉnμε)P(\bar{X}_n - \mu \geq \varepsilon); the lower tail follows by symmetry applied to Xi-X_i, and the two-sided bound by the union bound.

Let Yi=XiμiY_i = X_i - \mu_i, so Yi[aiμi,biμi]Y_i \in [a_i - \mu_i, b_i - \mu_i] and E[Yi]=0\mathbb{E}[Y_i] = 0. The centered range is the same: (biμi)(aiμi)=biai(b_i - \mu_i) - (a_i - \mu_i) = b_i - a_i. The sum iYi=n(Xˉnμ)\sum_i Y_i = n(\bar{X}_n - \mu), so

P(Xˉnμε)  =  P ⁣(i=1nYinε).P(\bar{X}_n - \mu \geq \varepsilon) \;=\; P\!\left(\sum_{i=1}^n Y_i \geq n\varepsilon\right).

Apply the Chernoff technique (Theorem 12.1) to Yi\sum Y_i. For any t>0t > 0:

P ⁣(iYinε)    etnεE ⁣[exp ⁣(tiYi)].P\!\left(\sum_i Y_i \geq n\varepsilon\right) \;\leq\; e^{-tn\varepsilon} \, \mathbb{E}\!\left[\exp\!\left(t \sum_i Y_i\right)\right].

By independence, the joint MGF factorizes:

E ⁣[etiYi]  =  i=1nE[etYi].\mathbb{E}\!\left[e^{t \sum_i Y_i}\right] \;=\; \prod_{i=1}^n \mathbb{E}[e^{t Y_i}].

Applying Hoeffding’s lemma (Theorem 12.2) to each YiY_i:

E[etYi]    exp ⁣(t2(biai)28),\mathbb{E}[e^{t Y_i}] \;\leq\; \exp\!\left(\frac{t^2 (b_i - a_i)^2}{8}\right),

so the product is

i=1nE[etYi]    exp ⁣(t28i=1n(biai)2).\prod_{i=1}^n \mathbb{E}[e^{t Y_i}] \;\leq\; \exp\!\left(\frac{t^2}{8} \sum_{i=1}^n (b_i - a_i)^2\right).

Combining:

P ⁣(iYinε)    exp ⁣(tnε+t28i=1n(biai)2).P\!\left(\sum_i Y_i \geq n\varepsilon\right) \;\leq\; \exp\!\left(-tn\varepsilon + \frac{t^2}{8} \sum_{i=1}^n (b_i - a_i)^2\right).

This is a quadratic in tt that opens upward. Minimizing over t>0t > 0 by setting the derivative to zero:

nε+t4i=1n(biai)2  =  0t  =  4nεi=1n(biai)2.-n\varepsilon + \frac{t}{4} \sum_{i=1}^n (b_i - a_i)^2 \;=\; 0 \quad \Longrightarrow \quad t^* \;=\; \frac{4 n \varepsilon}{\sum_{i=1}^n (b_i - a_i)^2}.

Substituting tt^* back into the exponent:

tnε+(t)28i(biai)2  =  2n2ε2i(biai)2.-t^* n \varepsilon + \frac{(t^*)^2}{8} \sum_i (b_i - a_i)^2 \;=\; -\frac{2 n^2 \varepsilon^2}{\sum_i (b_i - a_i)^2}.

Therefore

P(Xˉnμε)    exp ⁣(2n2ε2i(biai)2).P(\bar{X}_n - \mu \geq \varepsilon) \;\leq\; \exp\!\left(-\frac{2 n^2 \varepsilon^2}{\sum_i (b_i - a_i)^2}\right).

Applying the same argument to Yi-Y_i (which has the same centered range) yields the lower tail, and adding the two bounds gives the two-sided statement. In the iid case, i(biai)2=n(ba)2\sum_i (b_i - a_i)^2 = n(b-a)^2, so the exponent becomes 2nε2/(ba)2-2n\varepsilon^2/(b-a)^2 as claimed. \blacksquare

This completes the proof deferred in Theorem 10.7. The bound is remarkably clean: no variance appears, only the range. That’s both Hoeffding’s strength (you don’t need to know the variance) and its weakness (you can’t benefit from knowing the variance is small).

Example 2 Hoeffding and survey sampling

You want to estimate the fraction of voters supporting candidate A to within ε=0.03\varepsilon = 0.03 with confidence 1δ=0.951 - \delta = 0.95. Each response is a Bernoulli(p)(p) with unknown p[0,1]p \in [0, 1]. Hoeffding says:

n    (ba)2ln(2/δ)2ε2  =  1ln(40)20.0009    2,050.n \;\geq\; \frac{(b - a)^2 \ln(2/\delta)}{2\varepsilon^2} \;=\; \frac{1 \cdot \ln(40)}{2 \cdot 0.0009} \;\approx\; 2{,}050.

That matches the “margin of error ±3%\pm 3\% with 95% confidence, n2000n \approx 2000” rule of thumb reported by polling organizations. The beauty of the Hoeffding derivation is that it’s distribution-free: it works for any pp, doesn’t require estimating variance, and doesn’t rely on a normal approximation. Compare with Chebyshev, which would give nσ2/(ε2δ)0.25/(0.00090.05)5,555n \geq \sigma^2 / (\varepsilon^2 \delta) \leq 0.25 / (0.0009 \cdot 0.05) \approx 5{,}555 — the factor of three difference we advertised in §12.1.

Remark 2 Hoeffding is distribution-free, and that's the trade

Hoeffding’s inequality uses only the range [a,b][a, b], not the variance. For a Bernoulli(p)(p) with pp small, this is wasteful: the variance is p(1p)p(1-p), which can be much smaller than (ba)2/4=1/4(b-a)^2/4 = 1/4. Hoeffding doesn’t care — it treats the worst-case range as though it always applied. This is sometimes exactly what you want (in PAC-learning scenarios where the distribution is unknown) and sometimes a real loss. Bernstein’s inequality (§12.4) fixes this by letting the bound shrink when the variance shrinks.

12.4 Bernstein’s Inequality

Hoeffding wastes information when the variance is small. For a rare event like Bernoulli(0.01)(0.01), the variance is 0.00990.0099 while the worst-case range-squared is 11 — two orders of magnitude. Bernstein’s inequality recovers this loss by using both the variance and the range, and it produces a two-regime bound: sub-Gaussian behavior for small ε\varepsilon (dominated by variance) and sub-exponential behavior for large ε\varepsilon (dominated by the range).

Three-panel Bernstein comparison: (left) Bernstein vs Hoeffding for Bernoulli(0.01) — low variance makes Bernstein much tighter; (center) two-regime behavior — sub-Gaussian for small ε, sub-exponential for large ε; (right) Bernstein condition region where Bernstein beats Hoeffding as a function of σ²/(b−a)²
Theorem 4 Bernstein's inequality

Let X1,,XnX_1, \ldots, X_n be independent with E[Xi]=μ\mathbb{E}[X_i] = \mu, Var(Xi)=σ2\text{Var}(X_i) = \sigma^2, and XiμM|X_i - \mu| \leq M almost surely. Then for every ε>0\varepsilon > 0:

P ⁣(Xˉnμε)    2exp ⁣(nε22σ2+2Mε3).P\!\left(\left|\bar{X}_n - \mu\right| \geq \varepsilon\right) \;\leq\; 2 \exp\!\left(-\frac{n \varepsilon^2}{2\sigma^2 + \tfrac{2 M \varepsilon}{3}}\right).
Proof [show]

As in Hoeffding, center the variables: let Yi=XiμY_i = X_i - \mu so YiM|Y_i| \leq M and E[Yi]=0\mathbb{E}[Y_i] = 0, E[Yi2]=σ2\mathbb{E}[Y_i^2] = \sigma^2. We bound the upper tail via the Chernoff technique and treat the lower tail symmetrically.

The key step is a sharper MGF bound than Hoeffding’s lemma, exploiting the variance. Expand etYie^{tY_i} as a power series:

etYi  =  1+tYi+k=2tkYikk!.e^{tY_i} \;=\; 1 + tY_i + \sum_{k=2}^{\infty} \frac{t^k Y_i^k}{k!}.

Taking expectations (since E[Yi]=0\mathbb{E}[Y_i] = 0):

E[etYi]  =  1+k=2tkE[Yik]k!.\mathbb{E}[e^{tY_i}] \;=\; 1 + \sum_{k=2}^{\infty} \frac{t^k \mathbb{E}[Y_i^k]}{k!}.

For k2k \geq 2, the boundedness YiM|Y_i| \leq M gives YikMk2Yi2|Y_i^k| \leq M^{k-2} Y_i^2, so E[Yik]Mk2σ2|\mathbb{E}[Y_i^k]| \leq M^{k-2} \sigma^2. Therefore

E[etYi]    1+σ2k=2tkMk2k!  =  1+σ2M2(etM1tM).\mathbb{E}[e^{tY_i}] \;\leq\; 1 + \sigma^2 \sum_{k=2}^{\infty} \frac{t^k M^{k-2}}{k!} \;=\; 1 + \frac{\sigma^2}{M^2}\left(e^{tM} - 1 - tM\right).

Using the Taylor expansion eu1u=u2/2+u3/6+(u2/2)emax(u,0)/3e^u - 1 - u = u^2/2 + u^3/6 + \cdots \leq (u^2/2) \cdot e^{\max(u, 0)/3}… a cleaner route: for tM<3|tM| < 3, the inequality etM1tM(tM)22(1tM/3)e^{tM} - 1 - tM \leq \frac{(tM)^2}{2(1 - tM/3)} holds (verify by expanding both sides as series). Using log(1+x)x\log(1 + x) \leq x:

logE[etYi]    σ2t2/21tM/3,0<t<3/M.\log \mathbb{E}[e^{tY_i}] \;\leq\; \frac{\sigma^2 t^2 / 2}{1 - tM/3}, \qquad 0 < t < 3/M.

This is the sub-exponential MGF bound: it grows like a sub-Gaussian MGF for small tt but blows up as t3/Mt \to 3/M. Summing over independent variables:

logE ⁣[exp ⁣(tiYi)]  =  ilogE[etYi]    nσ2t2/21tM/3.\log \mathbb{E}\!\left[\exp\!\left(t \sum_i Y_i\right)\right] \;=\; \sum_i \log \mathbb{E}[e^{tY_i}] \;\leq\; \frac{n \sigma^2 t^2 / 2}{1 - tM/3}.

Chernoff: for t(0,3/M)t \in (0, 3/M),

P ⁣(Yinε)    exp ⁣(tnε+nσ2t2/21tM/3).P\!\left(\sum Y_i \geq n\varepsilon\right) \;\leq\; \exp\!\left(-tn\varepsilon + \frac{n \sigma^2 t^2 / 2}{1 - tM/3}\right).

Optimizing over tt: setting the derivative of the exponent to zero gives

t  =  εσ2+Mε/3.t^* \;=\; \frac{\varepsilon}{\sigma^2 + M \varepsilon / 3}.

Substituting back (a calculation the reader is welcome to verify term-by-term):

tnε+nσ2(t)2/21tM/3  =  nε22σ2+2Mε/3.-t^* n\varepsilon + \frac{n\sigma^2 (t^*)^2 / 2}{1 - t^* M/3} \;=\; -\frac{n \varepsilon^2}{2\sigma^2 + 2M\varepsilon/3}.

Therefore

P ⁣(Xˉnμε)    exp ⁣(nε22σ2+2Mε/3).P\!\left(\bar{X}_n - \mu \geq \varepsilon\right) \;\leq\; \exp\!\left(-\frac{n\varepsilon^2}{2\sigma^2 + 2M\varepsilon/3}\right).

The two-sided version follows by applying the same argument to Yi-Y_i and union-bounding. \blacksquare

Now the comparison between Chebyshev, Hoeffding, Bernstein, sub-Gaussian, and the exact tail — all on a single chart — makes the full progression concrete.

Concentration-Bound Comparison
Markov (≥1)ChebyshevHoeffdingBernsteinSub-GaussianMC (exact)
Rare event. Low variance (σ² = 0.09) makes Bernstein much tighter than Hoeffding.
Example 3 Bernstein vs Hoeffding for rare events

Take XiX_i \sim Bernoulli(0.01)(0.01), iid, n=1000n = 1000. Then μ=0.01\mu = 0.01, σ2=0.0099\sigma^2 = 0.0099, M=0.99M = 0.99 (the deviation Xiμ|X_i - \mu| is at most 10.01=0.991 - 0.01 = 0.99), ba=1b - a = 1. For ε=0.01\varepsilon = 0.01:

  • Hoeffding: 2exp(210000.0001/1)=2e0.21.642 \exp(-2 \cdot 1000 \cdot 0.0001 / 1) = 2 e^{-0.2} \approx 1.64 — clamped to 11; useless.
  • Bernstein: 2exp ⁣(10000.000120.0099+20.990.01/3)=2exp(0.1/0.02585)2e3.870.0422 \exp\!\big(-\frac{1000 \cdot 0.0001}{2 \cdot 0.0099 + 2 \cdot 0.99 \cdot 0.01 / 3}\big) = 2 \exp(-0.1 / 0.02585) \approx 2 e^{-3.87} \approx 0.042.

Bernstein gives a useful bound while Hoeffding gives nothing. The reason is the variance: when pp is small, σ2=p(1p)p\sigma^2 = p(1-p) \approx p is much smaller than (ba)2/4=0.25(b-a)^2/4 = 0.25. Bernstein exploits this; Hoeffding cannot.

Remark 3 Two regimes: sub-Gaussian and sub-exponential

Bernstein’s denominator 2σ2+2Mε/32\sigma^2 + 2M\varepsilon/3 interpolates two behaviors. For small ε\varepsilon (the sub-Gaussian regime), the 2σ22\sigma^2 term dominates and the exponent is nε2/(2σ2)\approx -n\varepsilon^2/(2\sigma^2) — Gaussian-like tail decay with variance proxy σ2\sigma^2. For large ε\varepsilon (the sub-exponential regime), the 2Mε/32M\varepsilon/3 term dominates and the exponent is 3nε/(2M)\approx -3n\varepsilon/(2M) — exponential (not Gaussian) tail decay. The crossover happens around ε3σ2/M\varepsilon \approx 3\sigma^2/M. This two-regime behavior is exactly what we mean by a sub-exponential random variable, which we define next.

12.5 Sub-Gaussian and Sub-Exponential Random Variables

The modern approach to concentration is not to prove each bound separately but to classify random variables by how their MGF behaves near t=0t = 0. Sub-Gaussian variables have MGFs bounded by a Gaussian; sub-exponential variables have MGFs bounded by a looser one that only works in a neighborhood. Each class has a clean tail bound and closure properties (sums, products, powers) that let you combine simple building blocks into sophisticated bounds.

Three-panel sub-Gaussian classification: (left) tail decay comparison — Gaussian vs χ²(1) vs Cauchy with sub-Gaussian envelope; (center) Orlicz norm ψ₂ and ψ₁ for common distributions; (right) hierarchy — bounded ⊂ sub-Gaussian ⊂ sub-exponential ⊂ finite-variance
Definition 1 Sub-Gaussian random variable

A random variable XX with E[X]=μ\mathbb{E}[X] = \mu is sub-Gaussian with parameter σ0\sigma \geq 0 if for every tRt \in \mathbb{R}:

E ⁣[et(Xμ)]    exp ⁣(σ2t22).\mathbb{E}\!\left[e^{t(X - \mu)}\right] \;\leq\; \exp\!\left(\frac{\sigma^2 t^2}{2}\right).

The parameter σ\sigma is sometimes called the variance proxy. It need not equal the actual standard deviation of XX.

Definition 2 Sub-exponential random variable

A random variable XX with E[X]=μ\mathbb{E}[X] = \mu is sub-exponential with parameters (ν2,b)(\nu^2, b) (both positive) if for every t<1/b|t| < 1/b:

E ⁣[et(Xμ)]    exp ⁣(ν2t22).\mathbb{E}\!\left[e^{t(X - \mu)}\right] \;\leq\; \exp\!\left(\frac{\nu^2 t^2}{2}\right).

Compared to the sub-Gaussian definition, the MGF bound only holds in a neighborhood of t=0t = 0, not for all tt. This allows heavier tails, but the tail probability decays only exponentially (not Gaussian) past a certain scale.

Theorem 5 Sub-Gaussian equivalences

Let XX be a random variable with E[X]=μ\mathbb{E}[X] = \mu. The following are equivalent (up to universal constants in the parameter):

  1. MGF condition: XX is sub-Gaussian with parameter σ\sigma: E[et(Xμ)]eσ2t2/2\mathbb{E}[e^{t(X - \mu)}] \leq e^{\sigma^2 t^2/2} for all tt.
  2. Tail condition: P(Xμu)2eu2/(2σ2)P(|X - \mu| \geq u) \leq 2 e^{-u^2/(2\sigma^2)} for all u0u \geq 0.
  3. Moment condition: (E[Xμp])1/pCσp(\mathbb{E}[|X - \mu|^p])^{1/p} \leq C \sigma \sqrt{p} for all p1p \geq 1, with CC a universal constant.
  4. Orlicz norm condition: Xμψ2<\|X - \mu\|_{\psi_2} < \infty, where Yψ2=inf{c>0:E[exp(Y2/c2)]2}\|Y\|_{\psi_2} = \inf\{c > 0 : \mathbb{E}[\exp(Y^2/c^2)] \leq 2\}.

The proof of Theorem 12.5 is an exercise in manipulating the MGF, the tail, and the moments against each other — the key tools are Markov’s inequality (MGF ⟹ tail, via the Chernoff argument) and integration-by-parts (tail ⟹ moments) — and the universal-constant factors relating the four parameters are small and can be read off from Vershynin (2018, Chapter 2). We will not reproduce the proof here; the practically important thing is the equivalence itself, which means that “sub-Gaussian” is a single property that can be checked in whichever formulation is most convenient.

Sub-Gaussian Classifier
Sub-Gaussian (σ_sg = 1)
Generating 100,000 samples…
● empirical tail-- sub-Gaussian envelope 2e^(−t²/2σ²)
Example 4 Classification of common distributions
  • Normal(μ,σ2)(\mu, \sigma^2): The MGF is exactly eμt+σ2t2/2e^{\mu t + \sigma^2 t^2/2}, so it is sub-Gaussian with parameter σ\sigma — the variance proxy equals the actual standard deviation.
  • Bounded X[a,b]X \in [a, b]: By Hoeffding’s lemma, XX centered is sub-Gaussian with parameter (ba)/2(b-a)/2. So every bounded random variable is sub-Gaussian, though the variance proxy may be loose.
  • Rademacher X{1,+1}X \in \{-1, +1\}: Sub-Gaussian with σ=1\sigma = 1. The key building block of Rademacher complexity.
  • Exponential(λ)(\lambda) centered at 1/λ1/\lambda: Sub-exponential with parameters (ν2,b)=(4/λ2,2/λ)(\nu^2, b) = (4/\lambda^2, 2/\lambda), but not sub-Gaussian — the right tail P(X>t)=eλtP(X > t) = e^{-\lambda t} decays like eλte^{-\lambda t} rather than et2e^{-t^2}.
  • χ2(1)1\chi^2(1) - 1: Sub-exponential but not sub-Gaussian. This matters: χ2\chi^2 variables arise as squared Gaussians, and the square of a sub-Gaussian is always sub-exponential.
  • Poisson(λ)(\lambda) centered at λ\lambda: Sub-exponential, not sub-Gaussian.
  • Cauchy(0,1)(0, 1): Has no finite mean, let alone a sub-Gaussian or sub-exponential MGF. Heavy-tailed. Concentration inequalities do not apply; even the LLN fails for Cauchy.
  • Student-t(ν)t(\nu) for ν3\nu \geq 3: Has finite variance but only polynomial tail decay. Not sub-exponential.
Example 5 Sums of sub-Gaussians are sub-Gaussian

If X1,,XnX_1, \ldots, X_n are independent and sub-Gaussian with parameters σ1,,σn\sigma_1, \ldots, \sigma_n, then iXi\sum_i X_i is sub-Gaussian with parameter iσi2\sqrt{\sum_i \sigma_i^2}, and Xˉn\bar{X}_n is sub-Gaussian with parameter σ/n\sigma/\sqrt{n} (in the iid case with common σ\sigma). The proof is a one-line MGF computation:

E ⁣[eti(Xiμi)]  =  iE[et(Xiμi)]    ieσi2t2/2  =  exp ⁣(t2iσi22).\mathbb{E}\!\left[e^{t \sum_i (X_i - \mu_i)}\right] \;=\; \prod_i \mathbb{E}[e^{t(X_i - \mu_i)}] \;\leq\; \prod_i e^{\sigma_i^2 t^2/2} \;=\; \exp\!\left(\frac{t^2 \sum_i \sigma_i^2}{2}\right).

This is why sub-Gaussian is the right abstraction for statistical learning: every class closed under finite sums gives a rate that decays at the correct 1/n1/\sqrt{n} scale. Apply the tail condition to Xˉn\bar{X}_n with variance proxy σ2/n\sigma^2/n:

P(Xˉnμε)    2exp ⁣(nε22σ2).P(|\bar{X}_n - \mu| \geq \varepsilon) \;\leq\; 2 \exp\!\left(-\frac{n\varepsilon^2}{2\sigma^2}\right).

This is the sub-Gaussian form of the bound — cleaner than Hoeffding because it parametrizes by σ\sigma rather than by (ba)(b-a).

Remark 4 The hierarchy and the Orlicz norm

There is a strict hierarchy of tail classes:

bounded    sub-Gaussian    sub-exponential    {finite variance}    {all}.\text{bounded} \;\subset\; \text{sub-Gaussian} \;\subset\; \text{sub-exponential} \;\subset\; \{\text{finite variance}\} \;\subset\; \{\text{all}\}.

Each inclusion is strict: a Normal is sub-Gaussian but not bounded; an Exponential is sub-exponential but not sub-Gaussian; a Student-t(3)t(3) has finite variance but is not sub-exponential; a Cauchy has neither. Closure properties sit nicely on top: products of sub-Gaussians are sub-exponential (as in χ2=Z2\chi^2 = Z^2), sums of sub-Gaussians are sub-Gaussian, maxima of finitely many sub-Gaussians are sub-Gaussian (with parameter scaled by logn\sqrt{\log n}). The Orlicz norm Xψ2\|X\|_{\psi_2} is a single number that controls the full tail behavior — Vershynin’s Chapter 2 develops this formalism in detail, and it’s the vocabulary of choice in high-dimensional statistics.

12.6 The Cramér Rate Function and Large Deviations

We’ve seen that the Chernoff bound gives an exponential rate of decay for tail probabilities, and that the optimal exponent is I(a)=supt(taΛ(t))I(a) = \sup_t (ta - \Lambda(t)). This exponent deserves a name and a theory of its own. The study of exponential rates — of the form P(XˉnA)enI(A)P(\bar{X}_n \in A) \approx e^{-n I(A)} for sets AA far from the mean — is the subject of large deviations theory. Its central result for iid sample means is Cramér’s theorem: the exponent is exactly the rate function, not just an upper bound.

Three-panel rate function: (left) Cramér rate I(x) for Bernoulli(p) = KL divergence for several p; (center) I(x) for N(μ, σ²) as the parabola (x−μ)²/(2σ²); (right) geometric interpretation — I(x) = sup_t(tx − Λ(t)) is the gap between the tangent line and the cumulant generating function
Definition 3 Cramér rate function

For a random variable XX with cumulant-generating function Λ(t)=logE[etX]\Lambda(t) = \log \mathbb{E}[e^{tX}] (finite in a neighborhood of 00), the Cramér rate function is the Legendre transform of Λ\Lambda:

I(x)  =  suptR(txΛ(t)).I(x) \;=\; \sup_{t \in \mathbb{R}} \, \big(tx - \Lambda(t)\big).

The rate function is non-negative, convex, and vanishes at x=E[X]x = \mathbb{E}[X].

The non-negativity and the zero at the mean are both consequences of the definition: choose t=0t = 0 to get I(x)0xΛ(0)=0I(x) \geq 0x - \Lambda(0) = 0, and at x=μx = \mu the function txΛ(t)=tμΛ(t)tx - \Lambda(t) = t\mu - \Lambda(t) is maximized at t=0t = 0 (since Λ(0)=μ\Lambda'(0) = \mu), giving I(μ)=0I(\mu) = 0. Convexity follows because II is a supremum of affine functions of xx.

Theorem 6 Cramér's theorem (LDP for iid sample means)

Let X1,X2,X_1, X_2, \ldots be iid with cumulant-generating function Λ(t)\Lambda(t) finite in a neighborhood of 00, and let μ=E[X1]\mu = \mathbb{E}[X_1]. Then for every a>μa > \mu:

limn1nlogP(Xˉna)  =  I(a),\lim_{n \to \infty} \frac{1}{n} \log P(\bar{X}_n \geq a) \;=\; -I(a),

and for every a<μa < \mu:

limn1nlogP(Xˉna)  =  I(a).\lim_{n \to \infty} \frac{1}{n} \log P(\bar{X}_n \leq a) \;=\; -I(a).

Equivalently, P(Xˉna)P(\bar{X}_n \geq a) decays exponentially at the rate I(a)I(a).

Cramér’s theorem says the Chernoff upper bound is exponentially tight: there is no better exponent than I(a)I(a). The proof of the upper bound is exactly the Chernoff argument (Theorem 12.1). The proof of the matching lower bound — that P(Xˉna)en(I(a)+o(1))P(\bar{X}_n \geq a) \geq e^{-n(I(a) + o(1))} — uses a change of measure: tilt the distribution so that the mean becomes aa, the event {Xˉna}\{\bar{X}_n \geq a\} becomes typical under the tilted measure, and then bound the likelihood ratio. We refer the reader to Dembo & Zeitouni (2010, Theorem 2.2.3) for the full argument.

Cramér Rate Function I(x)
Example 6 The Bernoulli rate function is the KL divergence

For XiX_i \sim Bernoulli(p)(p), Λ(t)=log(1p+pet)\Lambda(t) = \log(1 - p + p e^t). The first-order condition Λ(t)=x\Lambda'(t^*) = x becomes pet/(1p+pet)=xp e^{t^*}/(1 - p + p e^{t^*}) = x, which solves to et=x(1p)/[p(1x)]e^{t^*} = x(1-p)/[p(1-x)]. Substituting:

I(x)  =  txΛ(t)  =  xlog ⁣xp+(1x)log ⁣1x1p  =  DKL ⁣(Ber(x)Ber(p)).I(x) \;=\; t^* x - \Lambda(t^*) \;=\; x \log\!\frac{x}{p} + (1-x) \log\!\frac{1-x}{1-p} \;=\; D_{\text{KL}}\!\big(\text{Ber}(x) \,\|\, \text{Ber}(p)\big).

So the large-deviation rate for a Bernoulli average deviating to xx is the Kullback–Leibler divergence between the empirical distribution Ber(x)(x) and the true distribution Ber(p)(p). This is not a coincidence — it is the Bernoulli case of Sanov’s theorem (next remark), which says the rate for the empirical measure is always the KL divergence.

For example, at p=0.3,x=0.5p = 0.3, x = 0.5: I(0.5)=0.5log(5/3)+0.5log(5/7)0.0872I(0.5) = 0.5 \log(5/3) + 0.5 \log(5/7) \approx 0.0872, so P(Xˉn0.5)e0.0872nP(\bar{X}_n \geq 0.5) \lesssim e^{-0.0872 n}. For n=100n = 100 this is about 1.7×1041.7 \times 10^{-4}; the true binomial probability at p=0.3p = 0.3 is about 1.8×1051.8 \times 10^{-5}. The Cramér rate captures the exponential order correctly; the sub-exponential prefactor (which Cramér does not capture) accounts for the remaining factor of 1010.

Remark 5 Sanov's theorem — the rate is KL divergence

Sanov’s theorem generalizes Cramér from the scalar sample mean to the full empirical distribution. If X1,,XnX_1, \ldots, X_n are iid from a distribution PP and P^n\hat{P}_n is the empirical measure (point masses at the observations), then for a “nice” set of distributions Q\mathcal{Q} far from PP:

limn1nlogP(P^nQ)  =  infQQDKL(QP).\lim_{n \to \infty} \frac{1}{n} \log P(\hat{P}_n \in \mathcal{Q}) \;=\; -\inf_{Q \in \mathcal{Q}} D_{\text{KL}}(Q \,\|\, P).

In words: the rate at which the empirical distribution deviates to any fixed target QQ is the KL divergence between QQ and PP. This is the information-theoretic interpretation of large deviations — rare events are improbable in proportion to how much information they would contain about the wrong distribution. The proof uses the method of types or topological machinery that we do not develop here; Dembo & Zeitouni (Chapter 6) is the standard reference. For our purposes, the Bernoulli case (Example 12.6) is the concrete instance that matters.

Remark 6 Exponential families and the Legendre transform

The rate function I(x)I(x) is the Legendre transform of the cumulant-generating function Λ(t)=logMX(t)\Lambda(t) = \log M_X(t). For exponential families, Λ(t)\Lambda(t) is the log-partition function (up to a shift in parameterization). So the rate function of an exponential-family sample mean is the conjugate of the log-partition function — a direct mathematical connection between large deviations and the exponential-family machinery of Topic 7. In Bayesian inference, this same conjugacy structure appears: the posterior concentrates around the MLE at the rate given by the KL divergence, and the quadratic approximation to I(x)I(x) at the MLE is the Fisher information — the curvature of the log-likelihood. All of these are manifestations of the same Legendre duality.

12.7 Concentration for Functions of Independent Variables

So far we’ve concentrated sample means — linear functions of the data. But many quantities in statistics and ML are non-linear functions of the data: the empirical CDF Fn(x)F_n(x), the sample median, the kernel density estimator, the supremum deviation supxFn(x)F(x)\sup_x |F_n(x) - F(x)|, the Rademacher complexity of a function class. We need tools that apply to any function of (X1,,Xn)(X_1, \ldots, X_n) that doesn’t change too much when one input is changed.

The machinery is McDiarmid’s inequality, which is the direct corollary of the Azuma–Hoeffding inequality for martingales. We state Azuma–Hoeffding without proving it (the proof requires some martingale theory we don’t develop) and use it to prove McDiarmid.

Three-panel McDiarmid: (left) Doob martingale construction for f(X₁,...,Xₙ) — each step reveals one variable; (center) McDiarmid applied to the empirical mean recovering Hoeffding; (right) McDiarmid applied to the kernel density estimator — bounded differences with cᵢ = sup K(x)/(nh)

A martingale (M0,M1,,Mn)(M_0, M_1, \ldots, M_n) is a sequence of random variables with the property that each MkM_k equals the conditional expectation of Mk+1M_{k+1} given the history: E[Mk+1M0,,Mk]=Mk\mathbb{E}[M_{k+1} \mid M_0, \ldots, M_k] = M_k. Intuitively, it’s a “fair game” process: knowing the past doesn’t help you predict whether the next step will go up or down. The increments MkMk1M_k - M_{k-1} have mean zero given the past, and they can be arbitrarily dependent — what matters is only that they are not too large in magnitude.

Theorem 7 Azuma–Hoeffding inequality

Let (M0,M1,,Mn)(M_0, M_1, \ldots, M_n) be a martingale with MkMk1ck|M_k - M_{k-1}| \leq c_k almost surely for each k=1,,nk = 1, \ldots, n. Then for every t0t \geq 0:

P ⁣(MnM0t)    2exp ⁣(t22k=1nck2).P\!\left(|M_n - M_0| \geq t\right) \;\leq\; 2 \exp\!\left(-\frac{t^2}{2 \sum_{k=1}^n c_k^2}\right).

Azuma–Hoeffding is Hoeffding for martingale increments — the same sub-Gaussian MGF argument goes through, replacing independence with the martingale property (conditional expectation in place of unconditional). The full proof is in Boucheron, Lugosi & Massart (2013, §2.5) or Dembo & Zeitouni (Chapter 2).

Theorem 8 McDiarmid's inequality (bounded differences)

Let f:XnRf : \mathcal{X}^n \to \mathbb{R} be a function of nn variables satisfying the bounded-differences property with constants c1,,cnc_1, \ldots, c_n: for every ii and every x1,,xn,xiXx_1, \ldots, x_n, x_i' \in \mathcal{X},

f(x1,,xi,,xn)f(x1,,xi,,xn)    ci.\left|f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)\right| \;\leq\; c_i.

If X1,,XnX_1, \ldots, X_n are independent random variables, then for every t0t \geq 0:

P ⁣(f(X1,,Xn)E[f(X1,,Xn)]t)    2exp ⁣(2t2i=1nci2).P\!\left(\left|f(X_1, \ldots, X_n) - \mathbb{E}[f(X_1, \ldots, X_n)]\right| \geq t\right) \;\leq\; 2 \exp\!\left(-\frac{2 t^2}{\sum_{i=1}^n c_i^2}\right).
Proof [show]

The proof uses the Doob martingale construction: a sequence of conditional expectations that reveals the inputs one at a time. Define

Mk  =  E[f(X1,,Xn)X1,,Xk],k=0,1,,n.M_k \;=\; \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1, \ldots, X_k], \qquad k = 0, 1, \ldots, n.

Then M0=E[f]M_0 = \mathbb{E}[f] (no conditioning), Mn=f(X1,,Xn)M_n = f(X_1, \ldots, X_n) (fully conditioned), and by the tower property E[Mk+1M0,,Mk]=Mk\mathbb{E}[M_{k+1} \mid M_0, \ldots, M_k] = M_k — so (Mk)(M_k) is a martingale.

The key step is bounding the increments. Because the XiX_i are independent, we can integrate out Xk+1X_{k+1} conditional on (X1,,Xk)(X_1, \ldots, X_k) to express both MkM_k and Mk+1M_{k+1} as integrals over Xk+1X_{k+1}. Using the bounded-differences property:

Mk+1Mk  =  E[fX1,,Xk+1]E[fX1,,Xk].|M_{k+1} - M_k| \;=\; \left|\mathbb{E}[f \mid X_1, \ldots, X_{k+1}] - \mathbb{E}[f \mid X_1, \ldots, X_k]\right|.

We can write this as E[f(X1,,Xk+1,Xk+2,,Xn)f(X1,,Xk,Xk+1,,Xn)]|\mathbb{E}[f(X_1, \ldots, X_{k+1}, X'_{k+2}, \ldots, X'_n) - f(X_1, \ldots, X_k, X'_{k+1}, \ldots, X'_n)]| where the XjX'_j are independent copies integrated out. The bounded-differences assumption in coordinate k+1k+1 gives

Mk+1Mk    ck+1.|M_{k+1} - M_k| \;\leq\; c_{k+1}.

Applying Azuma–Hoeffding (Theorem 12.7) with these increment bounds would give P(MnM0t)2exp(t2/(2kck2))P(|M_n - M_0| \geq t) \leq 2 \exp(-t^2/(2 \sum_k c_k^2)) — the Azuma constant. The sharper McDiarmid constant (2t22t^2 in the numerator rather than t2/2t^2/2) comes from a better sub-Gaussian MGF bound on each increment: given the filtration up to step kk, the increment Mk+1MkM_{k+1} - M_k is a mean-zero random variable that lies in an interval of length at most ck+1c_{k+1} (because varying Xk+1X_{k+1} over its support changes ff by at most ck+1c_{k+1}, so the conditional range of Mk+1M_{k+1} given the past is ck+1\leq c_{k+1}). Hoeffding’s lemma (Theorem 12.2) then gives

E ⁣[es(Mk+1Mk)X1,,Xk]    exp ⁣(s2ck+128).\mathbb{E}\!\left[e^{s(M_{k+1} - M_k)} \mid X_1, \ldots, X_k\right] \;\leq\; \exp\!\left(\frac{s^2 c_{k+1}^2}{8}\right).

Applying the Chernoff technique with this conditional MGF bound — exactly as in Hoeffding’s inequality proof, iterating the conditional-expectation argument over kk — gives

P(MnM0t)    2exp ⁣(2t2kck2).P(|M_n - M_0| \geq t) \;\leq\; 2 \exp\!\left(-\frac{2 t^2}{\sum_k c_k^2}\right).

The key point: Azuma’s generic constant uses Mk+1Mkc|M_{k+1} - M_k| \leq c as a worst-case absolute bound, which is equivalent to sub-Gaussian with parameter c2c^2; but for the Doob martingale of a bounded-differences function, we know the stronger fact that the increment’s conditional range is ck+1\leq c_{k+1}, which by Hoeffding’s lemma gives sub-Gaussian parameter ck+12/4c_{k+1}^2/4. The factor of 44 in the variance proxy is exactly the factor-of-44 improvement in the exponent. \blacksquare

McDiarmid is the Swiss army knife of concentration. It applies to any function of independent inputs that has bounded differences, regardless of the function’s structure. The variance of ff doesn’t enter — only the coordinate-wise Lipschitz constants cic_i.

Example 7 The empirical CDF satisfies bounded differences

For iid samples X1,,XnX_1, \ldots, X_n from distribution FF, the empirical CDF is Fn(x)=1ni=1n1{Xix}F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{X_i \leq x\}. For any fixed xx, changing one input XiX_i changes Fn(x)F_n(x) by at most 1/n1/n (it either adds or removes one indicator). So Fn(x)F_n(x) satisfies bounded differences with ci=1/nc_i = 1/n for each ii, and McDiarmid gives

P ⁣(Fn(x)F(x)t)    2exp ⁣(2nt2).P\!\left(\left|F_n(x) - F(x)\right| \geq t\right) \;\leq\; 2 \exp\!\left(-2 n t^2\right).

This is pointwise in xx. The supremum supxFn(x)F(x)\sup_x |F_n(x) - F(x)| requires the stronger Dvoretzky–Kiefer–Wolfowitz inequality (Theorem 10.9), which gives the same exponential rate but uniformly over xx.

A more substantial example: the kernel density estimator f^h(x)=1nhiK((xXi)/h)\hat{f}_h(x) = \frac{1}{nh} \sum_i K((x - X_i)/h). Changing one XiX_i changes f^h(x)\hat{f}_h(x) by at most supuK(u)/(nh)\sup_u K(u) / (nh), so ci=K/(nh)c_i = \|K\|_\infty / (nh), and McDiarmid gives exponential concentration of f^h(x)\hat{f}_h(x) around its expectation at rate 2n2h2t2/(nK2)=2nh2t2/K22n^2 h^2 t^2 / (n \|K\|_\infty^2) = 2nh^2 t^2 / \|K\|_\infty^2. This is the starting point for uniform convergence of density estimators.

Remark 7 Why McDiarmid rather than Hoeffding for sums of Lipschitz-like functionals

When ff is a linear function — the sample mean — McDiarmid with ci=(ba)/nc_i = (b-a)/n recovers Hoeffding exactly: P(Xˉnμt)2exp(2nt2/(ba)2)P(|\bar{X}_n - \mu| \geq t) \leq 2 \exp(-2 n t^2 / (b-a)^2). But McDiarmid goes much further: it concentrates nonlinear functionals (empirical CDFs, kernel densities, Rademacher complexities, the maximum discrepancy of a VC class over a sample) with no additional work beyond checking the bounded-differences property. This is why it dominates statistical learning theory. The cost — relative to a dedicated argument for a specific functional — is that the bound is always worst-case in a coordinate-wise sense; variance-sensitive refinements exist (Bousquet’s inequality, Talagrand’s inequality) but require substantially more machinery.

12.8 From Bounds to Algorithms: PAC, Johnson–Lindenstrauss, Differential Privacy

The payoff of the concentration machinery is immediate and broad. Three examples illustrate the range.

Three-panel ML applications: (left) PAC bound — sample complexity n(ε, δ, |H|) for several |H|; (center) Johnson–Lindenstrauss — target dimension k vs number of points n for several ε; (right) differential privacy — noise scale σ vs privacy parameter ε for several δ
Theorem 9 PAC bound for finite hypothesis classes

Let H\mathcal{H} be a finite set of hypotheses h:X{0,1}h : \mathcal{X} \to \{0, 1\}, each with 0011 loss on the data distribution PP: R(h)=E[L(h(X),Y)]R(h) = \mathbb{E}[L(h(X), Y)]. Given iid training samples (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n), let R^n(h)\hat{R}_n(h) be the training error. Then for every ε>0,δ(0,1)\varepsilon > 0, \delta \in (0, 1):

P ⁣(suphHR(h)R^n(h)ε)    2Hexp(2nε2),P\!\left(\sup_{h \in \mathcal{H}} \big|R(h) - \hat{R}_n(h)\big| \geq \varepsilon\right) \;\leq\; 2 |\mathcal{H}| \exp(-2 n \varepsilon^2),

and therefore n12ε2log ⁣2Hδn \geq \frac{1}{2\varepsilon^2} \log\!\frac{2|\mathcal{H}|}{\delta} samples suffice to guarantee that the training error approximates the true error uniformly over H\mathcal{H} within ε\varepsilon, with probability at least 1δ1 - \delta.

Proof [show]

For each fixed hHh \in \mathcal{H}, the loss L(h(Xi),Yi)L(h(X_i), Y_i) is bounded in [0,1][0, 1], so Hoeffding’s inequality gives

P ⁣(R(h)R^n(h)ε)    2exp(2nε2).P\!\left(|R(h) - \hat{R}_n(h)| \geq \varepsilon\right) \;\leq\; 2 \exp(-2 n \varepsilon^2).

By the union bound over H\mathcal{H}:

P ⁣(hH:R(h)R^n(h)ε)    hHP(R(h)R^n(h)ε)    2Hexp(2nε2).P\!\left(\exists h \in \mathcal{H} : |R(h) - \hat{R}_n(h)| \geq \varepsilon\right) \;\leq\; \sum_{h \in \mathcal{H}} P(|R(h) - \hat{R}_n(h)| \geq \varepsilon) \;\leq\; 2 |\mathcal{H}| \exp(-2 n \varepsilon^2).

Setting this δ\leq \delta and solving for nn gives the sample complexity bound. \blacksquare

Sample-Size Calculator
You need n ≥ 385 under the tightest applicable bound (CLT (approx)).
Chebyshev
2,000
Bernstein
763
Hoeffding
738
CLT (approx)
385
The CLT row uses the asymptotic z = Φ⁻¹(1 − δ/2) formula and is approximate; the others are non-asymptotic.
Example 8 Empirical risk minimization is consistent

Let h^n=argminhHR^n(h)\hat{h}_n = \arg\min_{h \in \mathcal{H}} \hat{R}_n(h) be the empirical risk minimizer. Theorem 12.9 implies R(h^n)R+2εR(\hat{h}_n) \leq R^* + 2\varepsilon with probability 1δ\geq 1 - \delta, where R=minhHR(h)R^* = \min_{h \in \mathcal{H}} R(h) is the best possible risk in the class. This is the Probably Approximately Correct (PAC) guarantee: with high probability (at least 1δ1 - \delta), the learned hypothesis is approximately optimal (within 2ε2\varepsilon of the best). The sample complexity nlogH/ε2n \propto \log|\mathcal{H}|/\varepsilon^2 is what powers VC-dimension bounds, Rademacher-complexity bounds, and the modern statistical learning theory of Vapnik, Bousquet, and others.

Example 9 Johnson–Lindenstrauss: random projections preserve distances

Given nn points in Rd\mathbb{R}^d, project them into Rk\mathbb{R}^k using a random Gaussian matrix ARk×dA \in \mathbb{R}^{k \times d} with iid N(0,1/k)\mathcal{N}(0, 1/k) entries. The Johnson–Lindenstrauss lemma says that if

k    8ln(n/δ)ε2,k \;\geq\; \frac{8 \ln(n / \delta)}{\varepsilon^2},

then with probability at least 1δ1 - \delta, every pairwise distance xixj\|x_i - x_j\| is preserved within a factor of (1±ε)(1 \pm \varepsilon) after projection: A(xixj)2[(1ε),(1+ε)]xixj2\|A(x_i - x_j)\|^2 \in [(1-\varepsilon), (1+\varepsilon)] \|x_i - x_j\|^2.

The proof: fix a single pair x,yx, y and their difference v=xyv = x - y. The random variable Av2/v2\|Av\|^2 / \|v\|^2 is a scaled chi-squared (a sum of kk squared Gaussians, divided by kk), and concentration of sub-exponential random variables (Bernstein’s inequality applied to a chi-squared, or direct MGF calculations) gives P(Av2/v2[1ε,1+ε])2ekε2/8P(\|Av\|^2 / \|v\|^2 \notin [1 - \varepsilon, 1 + \varepsilon]) \leq 2 e^{-k\varepsilon^2/8}. Union-bounding over all (n2)\binom{n}{2} pairs gives the claimed target dimension.

The punchline: k=O(ε2logn)k = O(\varepsilon^{-2} \log n) dimensions suffice — independent of the original dimension dd. This is the entire theoretical foundation of dimensionality reduction by random projection, which underlies sketching algorithms, fast nearest-neighbor search, and compressed sensing.

Remark 8 Differential privacy: noise calibration via sub-Gaussian tails

The Gaussian mechanism for differential privacy releases a query f(D)f(D) on a dataset DD with added noise: f(D)+Zf(D) + Z where ZN(0,σ2)Z \sim \mathcal{N}(0, \sigma^2). To achieve (ε,δ)(\varepsilon, \delta)-differential privacy, we need to control how much the output distribution shifts when one person’s data changes — which is controlled by the sensitivity Δf=supD,Df(D)f(D)\Delta_f = \sup_{D, D'} |f(D) - f(D')|, where the sup is over datasets differing in one record.

A calculation using the sub-Gaussian tail bound shows that the noise scale

σ  =  Δf2ln(1.25/δ)ε\sigma \;=\; \frac{\Delta_f \sqrt{2 \ln(1.25 / \delta)}}{\varepsilon}

is sufficient. The derivation: the privacy loss random variable log(PD(o)/PD(o))\log(P_D(o) / P_{D'}(o)) is sub-Gaussian with parameter Δf/σ\Delta_f/\sigma, and the (ε,δ)(\varepsilon, \delta)-privacy condition P(log(PD/PD)>ε)δP(\log(P_D/P_{D'}) > \varepsilon) \leq \delta becomes 2exp(ε2σ2/(2Δf2))δ2 \exp(-\varepsilon^2 \sigma^2 / (2 \Delta_f^2)) \leq \delta, which inverts to the σ\sigma formula above. This is why every differentially private algorithm uses Gaussian or Laplace noise calibrated to a tail-bound quantity — the concentration machinery is the privacy machinery.

12.9 Connections to Machine Learning

The five families of bounds developed here appear everywhere in ML, in ways that are easy to miss if you haven’t seen them. Here are the most load-bearing applications.

Three-panel ML connections: (left) Rademacher complexity symmetrization — ghost sample argument with McDiarmid controlling the deviation; (center) SGD convergence with Azuma–Hoeffding confidence bands; (right) non-asymptotic Monte Carlo — Hoeffding band vs CLT band as a function of n
Remark 9 The information hierarchy of concentration

There is a clean progression from the weakest to the strongest concentration bound, parametrized by how much structural information you know about the distribution:

  • Markov (Topic 4, Theorem 11): uses only the mean. Gives O(1/ε)O(1/\varepsilon) tail decay, typically useless for finite-sample guarantees.
  • Chebyshev (Topic 4, Theorem 12): uses the variance. Gives O(1/(nε2))O(1/(n\varepsilon^2)) decay — polynomial in nn.
  • Hoeffding (Theorem 12.3): uses boundedness. Gives O(ecnε2)O(e^{-c n \varepsilon^2}) — exponential in nn, distribution-free within bounded classes.
  • Bernstein (Theorem 12.4): uses variance + boundedness. Two-regime: sub-Gaussian for small ε\varepsilon, sub-exponential for large ε\varepsilon.
  • Sub-Gaussian (Definition 12.1, Theorem 12.5): uses the full sub-Gaussian MGF structure. Unifies bounded, Gaussian, and many other families under a single framework.
  • Chernoff (Theorem 12.1): uses the full MGF. The optimal exponent is the Cramér rate function I(x)I(x).

More information ⟹ tighter bounds. This is the no-free-lunch principle applied to tail estimation: to get exponential rates, you need to assume something about the distribution beyond finite variance. The beauty of the modern theory is that sub-Gaussian is a natural, checkable condition that captures the right cases (sums of bounded or Gaussian-like increments) without requiring you to know the distribution exactly.

The ML applications map onto this hierarchy precisely:

  • Generalization bounds (formalML, coming soon): Hoeffding gives the O(logH/n)O(\sqrt{\log |\mathcal{H}| / n}) deviation for finite classes. McDiarmid + Rademacher symmetrization extend this to infinite classes with bounded-differences functionals.
  • SGD convergence (formalML, coming soon): Azuma–Hoeffding applied to the martingale of cumulative gradient errors gives non-asymptotic bounds O(1/T)O(1/\sqrt{T}) on the optimization gap, provided gradient noise is sub-Gaussian.
  • Dimensionality reduction (formalML, coming soon): sub-Gaussian concentration of random-projection norms gives Johnson–Lindenstrauss and its descendants (sketching, compressive sensing).
  • Differential privacy (formalML, coming soon): Sub-Gaussian tail calibration of the Gaussian mechanism sets the noise scale. Concentrated DP (zCDP) uses the sub-Gaussian MGF directly.
  • Non-asymptotic Monte Carlo (formalML, coming soon): Hoeffding gives explicit sample-size formulas for bounded integrands, complementing the CLT’s asymptotic ±1.96σ/n\pm 1.96 \sigma / \sqrt{n} bands.

In every case, the story is the same: the CLT tells you what the distribution of Xˉn\bar{X}_n looks like; concentration tells you how many samples you need.

12.10 Summary

Four families of exponential-rate bounds. One master technique. One optimal exponent.

Single wide panel: all bounds plotted together on log scale — Markov, Chebyshev, Hoeffding, Bernstein, sub-Gaussian, exact tail — for Bernoulli(0.3), iid, X̄ₙ with n=100, as a function of ε from 0 to 0.5. The headline figure showing the progression from loose to tight.

The master technique is Chernoff’s: apply Markov to etXe^{tX}, optimize over tt. Every exponential tail bound in this topic — Hoeffding, Bernstein, sub-Gaussian, Azuma, McDiarmid — is a specialization.

The optimal exponent is the Cramér rate function I(x)=supt(txΛ(t))I(x) = \sup_t (tx - \Lambda(t)), the Legendre transform of the log-MGF. For Bernoulli it’s the KL divergence; for Normal it’s a parabola; for exponential-family distributions in general, it’s the conjugate of the log-partition function. Cramér’s theorem says no better exponent is possible.

The information hierarchy that organizes the bounds:

TheoremAssumptionBoundRateInformation used
Markov (4.11)X0X \geq 0, finite meanE[X]/a\mathbb{E}[X]/aO(1/ε)O(1/\varepsilon)mean
Chebyshev (4.12)finite varianceσ2/ε2\sigma^2/\varepsilon^2O(1/(nε2))O(1/(n\varepsilon^2))mean + variance
Hoeffding (12.3)bounded [ai,bi][a_i, b_i]2e2nε2/(biai)22 e^{-2n\varepsilon^2/\sum(b_i-a_i)^2}O(ecnε2)O(e^{-cn\varepsilon^2})mean + range
Bernstein (12.4)bounded, variance known2enε2/(2σ2+2Mε/3)2 e^{-n\varepsilon^2/(2\sigma^2 + 2M\varepsilon/3)}sub-Gauss small ε\varepsilon; sub-exp largemean + variance + range
Sub-Gaussian (12.5)sub-Gaussian(σ\sigma)2enε2/(2σ2)2 e^{-n\varepsilon^2/(2\sigma^2)}O(ecnε2)O(e^{-cn\varepsilon^2})sub-Gaussian parameter
Chernoff (12.1)finite MGF near 00inftetaMX(t)\inf_t e^{-ta} M_X(t)O(enI(a))O(e^{-nI(a)})full MGF
McDiarmid (12.8)bounded differences cic_i2e2t2/ci22 e^{-2t^2/\sum c_i^2}O(ect2)O(e^{-ct^2})bounded-differences constants

The applications that made these tools indispensable: PAC bounds for supervised learning, Rademacher complexity for uniform convergence, the Johnson–Lindenstrauss lemma for dimensionality reduction, the Gaussian mechanism for differential privacy, Azuma–Hoeffding for stochastic gradient descent, and non-asymptotic confidence bands for Monte Carlo. Every finite-sample guarantee in modern ML passes through this machinery.

Track 3 is now complete. The arc was: modes of convergence (Topic 9) gave the vocabulary; the law of large numbers (Topic 10) gave the qualitative statement Xˉnμ\bar{X}_n \to \mu; the central limit theorem (Topic 11) gave the shape of fluctuations at the 1/n1/\sqrt{n} scale; and this topic gives the rate — exponential concentration at all scales, with explicit non-asymptotic constants. Where the LLN says “eventually close”, the CLT says “Gaussian-shaped”, and concentration says “with probability at least 12ecnε21 - 2e^{-cn\varepsilon^2}, within ε\varepsilon, for every nn.”

What comes next. Confidence intervals turn the concentration bounds into explicit interval estimates: every 1δ1 - \delta confidence interval for a bounded mean is a direct application of Hoeffding. Hypothesis testing inverts this: the rejection region of any test is a tail event, and the power of a test is controlled by concentration under the alternative. The bootstrap and other resampling methods give empirical analogs of the CLT and concentration. Further out, empirical processes and Bayesian computation build on the sub-Gaussian framework to handle infinite function classes and posterior concentration. Across formalML, the tools developed here power every generalization bound, every privacy guarantee, and every non-asymptotic analysis of stochastic optimization.

The reader who has made it to here has the full non-asymptotic toolkit of probability theory: modes of convergence, the LLN, the CLT, and concentration. That is the foundation on which all of statistical inference and statistical learning theory is built.


References

  1. Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
  2. Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications to Data Science. Cambridge University Press.
  3. Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
  4. Dembo, A., & Zeitouni, O. (2010). Large Deviations Techniques and Applications (2nd ed.). Springer.
  5. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  6. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  7. Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.
  8. Wasserman, L. (2004). All of Statistics. Springer.
  9. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.