Large Deviations & Tail Bounds
From Chebyshev to Hoeffding to sub-Gaussian — exponential-rate concentration and the finite-sample tools that power statistical learning theory.
12.1 Why Chebyshev Isn’t Enough
The Law of Large Numbers tells us that , and Chebyshev’s inequality gives us a rate: . That rate is — polynomial, not exponential. For an ML practitioner designing a model that has to perform within of its expected risk with failure probability , Chebyshev says you need roughly samples. For a bounded random variable on with , the true answer (we’ll see) is closer to . Chebyshev is off by more than a factor of three — and it gets worse as shrinks, because enters polynomially rather than logarithmically.
The gap matters. A factor of three in sample complexity is the difference between running an experiment overnight and running it for a week. A factor of twenty (at ) is the difference between running it at all and giving up. The fundamental tension is this: Chebyshev uses only the variance, and the variance is the coarsest summary of a distribution that has any connection to its tails. The true tail probability depends on the whole shape of the distribution — and for many distributions that shape is much better than what the variance suggests.
Concentration inequalities are the tools that exploit this gap. They trade additional structural assumptions (boundedness, sub-Gaussian behavior, a known variance bound) for exponentially tighter rates. The payoff is not incremental: it transforms the rate from to , which at is versus Chebyshev’s (for Bernoulli). At , it’s versus . That’s the difference between “maybe safe” and “safe to a billion trials.”
The structural thesis of this topic. Where the CLT (Topic 11) describes the shape of the distribution of at scale — it’s Gaussian — concentration inequalities describe the rate at which the tails decay, at all scales. These are complementary: the CLT tells you “for , the histogram of replications looks roughly Gaussian with spread ”; concentration tells you “the probability of a deviation of size is at most , and this is non-asymptotic — it holds for every , not just in the limit.” Together, they give a complete picture of how sample means behave.
The rest of this topic develops four families of exponential-rate bounds, in increasing order of specificity: Chernoff (generic, MGF-based), Hoeffding (bounded), Bernstein (bounded + known variance), and sub-Gaussian (modern parametrization). Each uses more information about the distribution than the last, and each gives a correspondingly tighter bound. Then we step up to the Cramér rate function — the optimal exponent, captured as a Legendre transform — and McDiarmid’s inequality for functions of independent variables. The topic closes with PAC learning, Johnson–Lindenstrauss, and the other ML applications that made these tools indispensable.
12.2 The Chernoff Bounding Technique
The engine behind every exponential tail bound is a single idea: Markov’s inequality applied to the exponential of the random variable. We take a positive function of whose expectation we can control — namely — and let Markov do the rest. Optimizing over the free parameter gives the tightest bound in the family.
Let be a random variable whose moment-generating function is finite in a neighborhood of . Then for any :
For sums of independent random variables with iid summands, this becomes
Proof [show]
The core step uses Markov’s inequality applied to the non-negative random variable . For any , the event is identical to the event because the exponential is strictly increasing. So
Markov’s inequality applied to with threshold gives
This holds for every , so we may take the infimum over :
For the iid sum: by independence, the MGF factorizes, . Apply the single-variable result to with threshold and factor the MGF.
The Chernoff bound is not one bound — it is a family of bounds parametrized by . The art is choosing to make the bound as tight as possible, which is a convex optimization problem: the log of the objective, , is convex in (the cumulant-generating function is convex, as follows from Hölder’s inequality applied to the MGF). The minimum is interior whenever the first-order condition has a solution in the MGF’s domain. This is called the tilting parameter: it defines a new probability measure under which has mean instead of , making the event typical rather than rare.
Let be iid Bernoulli and . The MGF of a single Bernoulli is , so
where we’ve written the threshold as for . The first-order condition solves explicitly:
Substituting back, the Chernoff bound for the upper tail of a sum of iid Bernoullis is
For (so threshold ), this gives a bound of about . The exact probability from the Binomial CDF is about — the Chernoff bound is loose by less than a factor of two, while Chebyshev for the same event gives a bound of , almost worse.
The Chernoff bound is tight in a precise asymptotic sense. Substituting the optimal gives the exponent
where is the Cramér rate function. So the Chernoff bound for the iid sum is
and Cramér’s theorem (§12.6) says the bound is actually achieved at the exponential rate: . The Chernoff technique is the optimal exponential bound — nothing better is possible using the MGF alone.
12.3 Hoeffding’s Inequality
For bounded random variables we can control the MGF without computing it exactly. This is the content of Hoeffding’s lemma — a statement about the MGF of a bounded, centered random variable — and it’s the key step that makes the Chernoff technique yield a clean, distribution-free bound. Combining Hoeffding’s lemma with the Chernoff technique and independence gives Hoeffding’s inequality, which is the workhorse of statistical learning theory.
Let be a random variable with almost surely and . Then for every :
Proof [show]
The key observation is that convexity of lets us bound above by its chord on . For any , write
a convex combination of the endpoints with weights summing to . By convexity of the exponential,
Taking expectations, using linearity and :
It remains to bound this right-hand side by a Gaussian-style quantity. Define
where we’ve substituted for bookkeeping. We will show for all . Direct computation gives , (check: , which vanishes at ), and — after a short algebraic manipulation letting and :
By the arithmetic–geometric mean inequality, … actually, a cleaner bound comes from the identity for combined with (AM-GM). Squaring:
Therefore
Now use Taylor’s theorem with the Lagrange remainder at : for some between and ,
Exponentiating and reading off gives
Hoeffding’s lemma says a centered bounded random variable has a sub-Gaussian MGF — its MGF is at most the MGF of a Gaussian with variance proxy . This is the key fact. Every subsequent proof in this section just plugs this MGF bound into the Chernoff technique.
Let be independent with almost surely. Write and . Then for every :
and by symmetry
In the iid case with for all , this simplifies to the form previewed in Theorem 10.7:
Proof [show]
We bound the upper tail ; the lower tail follows by symmetry applied to , and the two-sided bound by the union bound.
Let , so and . The centered range is the same: . The sum , so
Apply the Chernoff technique (Theorem 12.1) to . For any :
By independence, the joint MGF factorizes:
Applying Hoeffding’s lemma (Theorem 12.2) to each :
so the product is
Combining:
This is a quadratic in that opens upward. Minimizing over by setting the derivative to zero:
Substituting back into the exponent:
Therefore
Applying the same argument to (which has the same centered range) yields the lower tail, and adding the two bounds gives the two-sided statement. In the iid case, , so the exponent becomes as claimed.
This completes the proof deferred in Theorem 10.7. The bound is remarkably clean: no variance appears, only the range. That’s both Hoeffding’s strength (you don’t need to know the variance) and its weakness (you can’t benefit from knowing the variance is small).
You want to estimate the fraction of voters supporting candidate A to within with confidence . Each response is a Bernoulli with unknown . Hoeffding says:
That matches the “margin of error with 95% confidence, ” rule of thumb reported by polling organizations. The beauty of the Hoeffding derivation is that it’s distribution-free: it works for any , doesn’t require estimating variance, and doesn’t rely on a normal approximation. Compare with Chebyshev, which would give — the factor of three difference we advertised in §12.1.
Hoeffding’s inequality uses only the range , not the variance. For a Bernoulli with small, this is wasteful: the variance is , which can be much smaller than . Hoeffding doesn’t care — it treats the worst-case range as though it always applied. This is sometimes exactly what you want (in PAC-learning scenarios where the distribution is unknown) and sometimes a real loss. Bernstein’s inequality (§12.4) fixes this by letting the bound shrink when the variance shrinks.
12.4 Bernstein’s Inequality
Hoeffding wastes information when the variance is small. For a rare event like Bernoulli, the variance is while the worst-case range-squared is — two orders of magnitude. Bernstein’s inequality recovers this loss by using both the variance and the range, and it produces a two-regime bound: sub-Gaussian behavior for small (dominated by variance) and sub-exponential behavior for large (dominated by the range).
Let be independent with , , and almost surely. Then for every :
Proof [show]
As in Hoeffding, center the variables: let so and , . We bound the upper tail via the Chernoff technique and treat the lower tail symmetrically.
The key step is a sharper MGF bound than Hoeffding’s lemma, exploiting the variance. Expand as a power series:
Taking expectations (since ):
For , the boundedness gives , so . Therefore
Using the Taylor expansion … a cleaner route: for , the inequality holds (verify by expanding both sides as series). Using :
This is the sub-exponential MGF bound: it grows like a sub-Gaussian MGF for small but blows up as . Summing over independent variables:
Chernoff: for ,
Optimizing over : setting the derivative of the exponent to zero gives
Substituting back (a calculation the reader is welcome to verify term-by-term):
Therefore
The two-sided version follows by applying the same argument to and union-bounding.
Now the comparison between Chebyshev, Hoeffding, Bernstein, sub-Gaussian, and the exact tail — all on a single chart — makes the full progression concrete.
Take Bernoulli, iid, . Then , , (the deviation is at most ), . For :
- Hoeffding: — clamped to ; useless.
- Bernstein: .
Bernstein gives a useful bound while Hoeffding gives nothing. The reason is the variance: when is small, is much smaller than . Bernstein exploits this; Hoeffding cannot.
Bernstein’s denominator interpolates two behaviors. For small (the sub-Gaussian regime), the term dominates and the exponent is — Gaussian-like tail decay with variance proxy . For large (the sub-exponential regime), the term dominates and the exponent is — exponential (not Gaussian) tail decay. The crossover happens around . This two-regime behavior is exactly what we mean by a sub-exponential random variable, which we define next.
12.5 Sub-Gaussian and Sub-Exponential Random Variables
The modern approach to concentration is not to prove each bound separately but to classify random variables by how their MGF behaves near . Sub-Gaussian variables have MGFs bounded by a Gaussian; sub-exponential variables have MGFs bounded by a looser one that only works in a neighborhood. Each class has a clean tail bound and closure properties (sums, products, powers) that let you combine simple building blocks into sophisticated bounds.
A random variable with is sub-Gaussian with parameter if for every :
The parameter is sometimes called the variance proxy. It need not equal the actual standard deviation of .
A random variable with is sub-exponential with parameters (both positive) if for every :
Compared to the sub-Gaussian definition, the MGF bound only holds in a neighborhood of , not for all . This allows heavier tails, but the tail probability decays only exponentially (not Gaussian) past a certain scale.
Let be a random variable with . The following are equivalent (up to universal constants in the parameter):
- MGF condition: is sub-Gaussian with parameter : for all .
- Tail condition: for all .
- Moment condition: for all , with a universal constant.
- Orlicz norm condition: , where .
The proof of Theorem 12.5 is an exercise in manipulating the MGF, the tail, and the moments against each other — the key tools are Markov’s inequality (MGF ⟹ tail, via the Chernoff argument) and integration-by-parts (tail ⟹ moments) — and the universal-constant factors relating the four parameters are small and can be read off from Vershynin (2018, Chapter 2). We will not reproduce the proof here; the practically important thing is the equivalence itself, which means that “sub-Gaussian” is a single property that can be checked in whichever formulation is most convenient.
- Normal: The MGF is exactly , so it is sub-Gaussian with parameter — the variance proxy equals the actual standard deviation.
- Bounded : By Hoeffding’s lemma, centered is sub-Gaussian with parameter . So every bounded random variable is sub-Gaussian, though the variance proxy may be loose.
- Rademacher : Sub-Gaussian with . The key building block of Rademacher complexity.
- Exponential centered at : Sub-exponential with parameters , but not sub-Gaussian — the right tail decays like rather than .
- : Sub-exponential but not sub-Gaussian. This matters: variables arise as squared Gaussians, and the square of a sub-Gaussian is always sub-exponential.
- Poisson centered at : Sub-exponential, not sub-Gaussian.
- Cauchy: Has no finite mean, let alone a sub-Gaussian or sub-exponential MGF. Heavy-tailed. Concentration inequalities do not apply; even the LLN fails for Cauchy.
- Student- for : Has finite variance but only polynomial tail decay. Not sub-exponential.
If are independent and sub-Gaussian with parameters , then is sub-Gaussian with parameter , and is sub-Gaussian with parameter (in the iid case with common ). The proof is a one-line MGF computation:
This is why sub-Gaussian is the right abstraction for statistical learning: every class closed under finite sums gives a rate that decays at the correct scale. Apply the tail condition to with variance proxy :
This is the sub-Gaussian form of the bound — cleaner than Hoeffding because it parametrizes by rather than by .
There is a strict hierarchy of tail classes:
Each inclusion is strict: a Normal is sub-Gaussian but not bounded; an Exponential is sub-exponential but not sub-Gaussian; a Student- has finite variance but is not sub-exponential; a Cauchy has neither. Closure properties sit nicely on top: products of sub-Gaussians are sub-exponential (as in ), sums of sub-Gaussians are sub-Gaussian, maxima of finitely many sub-Gaussians are sub-Gaussian (with parameter scaled by ). The Orlicz norm is a single number that controls the full tail behavior — Vershynin’s Chapter 2 develops this formalism in detail, and it’s the vocabulary of choice in high-dimensional statistics.
12.6 The Cramér Rate Function and Large Deviations
We’ve seen that the Chernoff bound gives an exponential rate of decay for tail probabilities, and that the optimal exponent is . This exponent deserves a name and a theory of its own. The study of exponential rates — of the form for sets far from the mean — is the subject of large deviations theory. Its central result for iid sample means is Cramér’s theorem: the exponent is exactly the rate function, not just an upper bound.
For a random variable with cumulant-generating function (finite in a neighborhood of ), the Cramér rate function is the Legendre transform of :
The rate function is non-negative, convex, and vanishes at .
The non-negativity and the zero at the mean are both consequences of the definition: choose to get , and at the function is maximized at (since ), giving . Convexity follows because is a supremum of affine functions of .
Let be iid with cumulant-generating function finite in a neighborhood of , and let . Then for every :
and for every :
Equivalently, decays exponentially at the rate .
Cramér’s theorem says the Chernoff upper bound is exponentially tight: there is no better exponent than . The proof of the upper bound is exactly the Chernoff argument (Theorem 12.1). The proof of the matching lower bound — that — uses a change of measure: tilt the distribution so that the mean becomes , the event becomes typical under the tilted measure, and then bound the likelihood ratio. We refer the reader to Dembo & Zeitouni (2010, Theorem 2.2.3) for the full argument.
For Bernoulli, . The first-order condition becomes , which solves to . Substituting:
So the large-deviation rate for a Bernoulli average deviating to is the Kullback–Leibler divergence between the empirical distribution Ber and the true distribution Ber. This is not a coincidence — it is the Bernoulli case of Sanov’s theorem (next remark), which says the rate for the empirical measure is always the KL divergence.
For example, at : , so . For this is about ; the true binomial probability at is about . The Cramér rate captures the exponential order correctly; the sub-exponential prefactor (which Cramér does not capture) accounts for the remaining factor of .
Sanov’s theorem generalizes Cramér from the scalar sample mean to the full empirical distribution. If are iid from a distribution and is the empirical measure (point masses at the observations), then for a “nice” set of distributions far from :
In words: the rate at which the empirical distribution deviates to any fixed target is the KL divergence between and . This is the information-theoretic interpretation of large deviations — rare events are improbable in proportion to how much information they would contain about the wrong distribution. The proof uses the method of types or topological machinery that we do not develop here; Dembo & Zeitouni (Chapter 6) is the standard reference. For our purposes, the Bernoulli case (Example 12.6) is the concrete instance that matters.
The rate function is the Legendre transform of the cumulant-generating function . For exponential families, is the log-partition function (up to a shift in parameterization). So the rate function of an exponential-family sample mean is the conjugate of the log-partition function — a direct mathematical connection between large deviations and the exponential-family machinery of Topic 7. In Bayesian inference, this same conjugacy structure appears: the posterior concentrates around the MLE at the rate given by the KL divergence, and the quadratic approximation to at the MLE is the Fisher information — the curvature of the log-likelihood. All of these are manifestations of the same Legendre duality.
12.7 Concentration for Functions of Independent Variables
So far we’ve concentrated sample means — linear functions of the data. But many quantities in statistics and ML are non-linear functions of the data: the empirical CDF , the sample median, the kernel density estimator, the supremum deviation , the Rademacher complexity of a function class. We need tools that apply to any function of that doesn’t change too much when one input is changed.
The machinery is McDiarmid’s inequality, which is the direct corollary of the Azuma–Hoeffding inequality for martingales. We state Azuma–Hoeffding without proving it (the proof requires some martingale theory we don’t develop) and use it to prove McDiarmid.
A martingale is a sequence of random variables with the property that each equals the conditional expectation of given the history: . Intuitively, it’s a “fair game” process: knowing the past doesn’t help you predict whether the next step will go up or down. The increments have mean zero given the past, and they can be arbitrarily dependent — what matters is only that they are not too large in magnitude.
Let be a martingale with almost surely for each . Then for every :
Azuma–Hoeffding is Hoeffding for martingale increments — the same sub-Gaussian MGF argument goes through, replacing independence with the martingale property (conditional expectation in place of unconditional). The full proof is in Boucheron, Lugosi & Massart (2013, §2.5) or Dembo & Zeitouni (Chapter 2).
Let be a function of variables satisfying the bounded-differences property with constants : for every and every ,
If are independent random variables, then for every :
Proof [show]
The proof uses the Doob martingale construction: a sequence of conditional expectations that reveals the inputs one at a time. Define
Then (no conditioning), (fully conditioned), and by the tower property — so is a martingale.
The key step is bounding the increments. Because the are independent, we can integrate out conditional on to express both and as integrals over . Using the bounded-differences property:
We can write this as where the are independent copies integrated out. The bounded-differences assumption in coordinate gives
Applying Azuma–Hoeffding (Theorem 12.7) with these increment bounds would give — the Azuma constant. The sharper McDiarmid constant ( in the numerator rather than ) comes from a better sub-Gaussian MGF bound on each increment: given the filtration up to step , the increment is a mean-zero random variable that lies in an interval of length at most (because varying over its support changes by at most , so the conditional range of given the past is ). Hoeffding’s lemma (Theorem 12.2) then gives
Applying the Chernoff technique with this conditional MGF bound — exactly as in Hoeffding’s inequality proof, iterating the conditional-expectation argument over — gives
The key point: Azuma’s generic constant uses as a worst-case absolute bound, which is equivalent to sub-Gaussian with parameter ; but for the Doob martingale of a bounded-differences function, we know the stronger fact that the increment’s conditional range is , which by Hoeffding’s lemma gives sub-Gaussian parameter . The factor of in the variance proxy is exactly the factor-of- improvement in the exponent.
McDiarmid is the Swiss army knife of concentration. It applies to any function of independent inputs that has bounded differences, regardless of the function’s structure. The variance of doesn’t enter — only the coordinate-wise Lipschitz constants .
For iid samples from distribution , the empirical CDF is . For any fixed , changing one input changes by at most (it either adds or removes one indicator). So satisfies bounded differences with for each , and McDiarmid gives
This is pointwise in . The supremum requires the stronger Dvoretzky–Kiefer–Wolfowitz inequality (Theorem 10.9), which gives the same exponential rate but uniformly over .
A more substantial example: the kernel density estimator . Changing one changes by at most , so , and McDiarmid gives exponential concentration of around its expectation at rate . This is the starting point for uniform convergence of density estimators.
When is a linear function — the sample mean — McDiarmid with recovers Hoeffding exactly: . But McDiarmid goes much further: it concentrates nonlinear functionals (empirical CDFs, kernel densities, Rademacher complexities, the maximum discrepancy of a VC class over a sample) with no additional work beyond checking the bounded-differences property. This is why it dominates statistical learning theory. The cost — relative to a dedicated argument for a specific functional — is that the bound is always worst-case in a coordinate-wise sense; variance-sensitive refinements exist (Bousquet’s inequality, Talagrand’s inequality) but require substantially more machinery.
12.8 From Bounds to Algorithms: PAC, Johnson–Lindenstrauss, Differential Privacy
The payoff of the concentration machinery is immediate and broad. Three examples illustrate the range.
Let be a finite set of hypotheses , each with – loss on the data distribution : . Given iid training samples , let be the training error. Then for every :
and therefore samples suffice to guarantee that the training error approximates the true error uniformly over within , with probability at least .
Proof [show]
For each fixed , the loss is bounded in , so Hoeffding’s inequality gives
By the union bound over :
Setting this and solving for gives the sample complexity bound.
Let be the empirical risk minimizer. Theorem 12.9 implies with probability , where is the best possible risk in the class. This is the Probably Approximately Correct (PAC) guarantee: with high probability (at least ), the learned hypothesis is approximately optimal (within of the best). The sample complexity is what powers VC-dimension bounds, Rademacher-complexity bounds, and the modern statistical learning theory of Vapnik, Bousquet, and others.
Given points in , project them into using a random Gaussian matrix with iid entries. The Johnson–Lindenstrauss lemma says that if
then with probability at least , every pairwise distance is preserved within a factor of after projection: .
The proof: fix a single pair and their difference . The random variable is a scaled chi-squared (a sum of squared Gaussians, divided by ), and concentration of sub-exponential random variables (Bernstein’s inequality applied to a chi-squared, or direct MGF calculations) gives . Union-bounding over all pairs gives the claimed target dimension.
The punchline: dimensions suffice — independent of the original dimension . This is the entire theoretical foundation of dimensionality reduction by random projection, which underlies sketching algorithms, fast nearest-neighbor search, and compressed sensing.
The Gaussian mechanism for differential privacy releases a query on a dataset with added noise: where . To achieve -differential privacy, we need to control how much the output distribution shifts when one person’s data changes — which is controlled by the sensitivity , where the sup is over datasets differing in one record.
A calculation using the sub-Gaussian tail bound shows that the noise scale
is sufficient. The derivation: the privacy loss random variable is sub-Gaussian with parameter , and the -privacy condition becomes , which inverts to the formula above. This is why every differentially private algorithm uses Gaussian or Laplace noise calibrated to a tail-bound quantity — the concentration machinery is the privacy machinery.
12.9 Connections to Machine Learning
The five families of bounds developed here appear everywhere in ML, in ways that are easy to miss if you haven’t seen them. Here are the most load-bearing applications.
There is a clean progression from the weakest to the strongest concentration bound, parametrized by how much structural information you know about the distribution:
- Markov (Topic 4, Theorem 11): uses only the mean. Gives tail decay, typically useless for finite-sample guarantees.
- Chebyshev (Topic 4, Theorem 12): uses the variance. Gives decay — polynomial in .
- Hoeffding (Theorem 12.3): uses boundedness. Gives — exponential in , distribution-free within bounded classes.
- Bernstein (Theorem 12.4): uses variance + boundedness. Two-regime: sub-Gaussian for small , sub-exponential for large .
- Sub-Gaussian (Definition 12.1, Theorem 12.5): uses the full sub-Gaussian MGF structure. Unifies bounded, Gaussian, and many other families under a single framework.
- Chernoff (Theorem 12.1): uses the full MGF. The optimal exponent is the Cramér rate function .
More information ⟹ tighter bounds. This is the no-free-lunch principle applied to tail estimation: to get exponential rates, you need to assume something about the distribution beyond finite variance. The beauty of the modern theory is that sub-Gaussian is a natural, checkable condition that captures the right cases (sums of bounded or Gaussian-like increments) without requiring you to know the distribution exactly.
The ML applications map onto this hierarchy precisely:
- Generalization bounds (formalML, coming soon): Hoeffding gives the deviation for finite classes. McDiarmid + Rademacher symmetrization extend this to infinite classes with bounded-differences functionals.
- SGD convergence (formalML, coming soon): Azuma–Hoeffding applied to the martingale of cumulative gradient errors gives non-asymptotic bounds on the optimization gap, provided gradient noise is sub-Gaussian.
- Dimensionality reduction (formalML, coming soon): sub-Gaussian concentration of random-projection norms gives Johnson–Lindenstrauss and its descendants (sketching, compressive sensing).
- Differential privacy (formalML, coming soon): Sub-Gaussian tail calibration of the Gaussian mechanism sets the noise scale. Concentrated DP (zCDP) uses the sub-Gaussian MGF directly.
- Non-asymptotic Monte Carlo (formalML, coming soon): Hoeffding gives explicit sample-size formulas for bounded integrands, complementing the CLT’s asymptotic bands.
In every case, the story is the same: the CLT tells you what the distribution of looks like; concentration tells you how many samples you need.
12.10 Summary
Four families of exponential-rate bounds. One master technique. One optimal exponent.
The master technique is Chernoff’s: apply Markov to , optimize over . Every exponential tail bound in this topic — Hoeffding, Bernstein, sub-Gaussian, Azuma, McDiarmid — is a specialization.
The optimal exponent is the Cramér rate function , the Legendre transform of the log-MGF. For Bernoulli it’s the KL divergence; for Normal it’s a parabola; for exponential-family distributions in general, it’s the conjugate of the log-partition function. Cramér’s theorem says no better exponent is possible.
The information hierarchy that organizes the bounds:
| Theorem | Assumption | Bound | Rate | Information used |
|---|---|---|---|---|
| Markov (4.11) | , finite mean | mean | ||
| Chebyshev (4.12) | finite variance | mean + variance | ||
| Hoeffding (12.3) | bounded | mean + range | ||
| Bernstein (12.4) | bounded, variance known | sub-Gauss small ; sub-exp large | mean + variance + range | |
| Sub-Gaussian (12.5) | sub-Gaussian() | sub-Gaussian parameter | ||
| Chernoff (12.1) | finite MGF near | full MGF | ||
| McDiarmid (12.8) | bounded differences | bounded-differences constants |
The applications that made these tools indispensable: PAC bounds for supervised learning, Rademacher complexity for uniform convergence, the Johnson–Lindenstrauss lemma for dimensionality reduction, the Gaussian mechanism for differential privacy, Azuma–Hoeffding for stochastic gradient descent, and non-asymptotic confidence bands for Monte Carlo. Every finite-sample guarantee in modern ML passes through this machinery.
Track 3 is now complete. The arc was: modes of convergence (Topic 9) gave the vocabulary; the law of large numbers (Topic 10) gave the qualitative statement ; the central limit theorem (Topic 11) gave the shape of fluctuations at the scale; and this topic gives the rate — exponential concentration at all scales, with explicit non-asymptotic constants. Where the LLN says “eventually close”, the CLT says “Gaussian-shaped”, and concentration says “with probability at least , within , for every .”
What comes next. Confidence intervals turn the concentration bounds into explicit interval estimates: every confidence interval for a bounded mean is a direct application of Hoeffding. Hypothesis testing inverts this: the rejection region of any test is a tail event, and the power of a test is controlled by concentration under the alternative. The bootstrap and other resampling methods give empirical analogs of the CLT and concentration. Further out, empirical processes and Bayesian computation build on the sub-Gaussian framework to handle infinite function classes and posterior concentration. Across formalML, the tools developed here power every generalization bound, every privacy guarantee, and every non-asymptotic analysis of stochastic optimization.
The reader who has made it to here has the full non-asymptotic toolkit of probability theory: modes of convergence, the LLN, the CLT, and concentration. That is the foundation on which all of statistical inference and statistical learning theory is built.
References
- Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
- Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications to Data Science. Cambridge University Press.
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
- Dembo, A., & Zeitouni, O. (2010). Large Deviations Techniques and Applications (2nd ed.). Springer.
- Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
- Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Cengage.
- Wasserman, L. (2004). All of Statistics. Springer.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.