intermediate 58 min read · April 23, 2026

The Bootstrap

Efron's nonparametric bootstrap, the Bickel–Freedman consistency theorem, four confidence-interval constructions (percentile, basic, BCa, studentized), Hall's second-order accuracy, bootstrap hypothesis tests, parametric and smooth-bootstrap variants, and bootstrap bias correction. Track 8, topic 3 of 4.

31.1 Motivation: the plug-in principle, extended

Topic 29 built inference on a single load-bearing object: the empirical CDF Fn(x)=n1i=1n1{Xix}F_n(x) = n^{-1}\sum_{i=1}^n \mathbf{1}\{X_i \le x\}, which Glivenko–Cantelli (Topic 10) guarantees converges uniformly to FF. Topic 30 smoothed FnF_n into a density estimator f^h\hat f_h and studied its bias–variance trade-off. Topic 31 now asks the question those two topics were building toward: if we can estimate FF, can we estimate the sampling distribution of a statistic Tn=T(X1,,Xn)T_n = T(X_1, \dots, X_n) whose analytical distribution we can’t write down?

The bootstrap answer is disarmingly simple: treat FnF_n as if it were FF and Monte-Carlo everything else. Draw resamples X1,,XnX^\ast_1, \dots, X^\ast_n with replacement from FnF_n; compute the statistic on each resample to get TnT^\ast_n; repeat many times; use the empirical distribution of the TnT^\ast_n values as an approximation to the sampling distribution of TnT_n. This is the plug-in principle: wherever the true CDF FF appears in a formula for a distributional quantity, substitute FnF_n. Topic 17’s permutation test was one special case (plug-in under the null); the bootstrap is the general case, and most of the effort in this topic goes into showing that the substitution is rigorous rather than wishful.

Definition 1 The plug-in principle

Let θ(F)\theta(F) be a functional of the unknown distribution FF — for example, θ(F)=VarF(Tn)\theta(F) = \mathrm{Var}_F(T_n), the variance of a statistic under repeated sampling from FF. The plug-in estimator of θ\theta is θ(Fn)\theta(F_n): the same functional evaluated at the empirical distribution. When θ\theta is sufficiently smooth as a functional of its CDF argument, θ(Fn)θ(F)\theta(F_n) \to \theta(F) — a functional Glivenko–Cantelli. The bootstrap is the plug-in principle applied to the sampling-distribution functional itself.

Three-panel narrative figure. Left: the true population distribution F with ten highlighted sample points. Middle: the empirical CDF F_n as a step function overlaid on F, matching closely at the highlighted sample points. Right: a bootstrap resample X-star with several points repeated (ties visualised as stacked markers).

Figure 1. The bootstrap idea in three panels. Left: the unknown population distribution FF with a sample of size n=10n = 10 highlighted. Middle: the empirical CDF FnF_n as a step function, close to FF by Glivenko–Cantelli. Right: a bootstrap resample X1,,X10X^\ast_1, \dots, X^\ast_{10} drawn iid from FnF_n — some points repeat (stacked markers), which is the ordinary consequence of sampling with replacement from a discrete distribution on the sample points.

Example 1 Why plug-in works when we can't write the answer

Suppose Tn=XˉnT_n = \bar X_n is the sample mean. Classical theory tells us VarF(Xˉn)=σ2/n\mathrm{Var}_F(\bar X_n) = \sigma^2 / n, where σ2=VarF(X1)\sigma^2 = \mathrm{Var}_F(X_1). The plug-in answer is VarFn(Xˉn)=σ^n2/n\mathrm{Var}_{F_n}(\bar X_n) = \hat\sigma^2_n / n, where σ^n2=n1i(XiXˉn)2\hat\sigma^2_n = n^{-1}\sum_i (X_i - \bar X_n)^2 is the sample variance (non-Bessel-corrected — it’s the variance of FnF_n, which places mass 1/n1/n at each sample point). Both estimators are consistent; the plug-in one matches the classical one. Now replace Xˉn\bar X_n by the sample median. Classical theory says VarF(median)(4nf(ξ0.5)2)1\mathrm{Var}_F(\mathrm{median}) \approx (4 n f(\xi_{0.5})^2)^{-1}, requiring the population density ff at the median — an object Topic 29 §29.6 struggled with. The plug-in answer is VarFn(median)\mathrm{Var}_{F_n}(\mathrm{median}), which the bootstrap computes by Monte Carlo resampling. No density estimate needed; the resampled medians handle everything.

Remark 1 Two sources of error, cleanly separated

The bootstrap introduces two distinct approximations: (i) using FnF_n instead of FF — this is the asymptotic error that vanishes as nn \to \infty, and Theorem 3 in §31.3 controls it; and (ii) using a finite number BB of Monte Carlo resamples instead of the exact-plug-in answer VarFn()\mathrm{Var}_{F_n}(\cdot) — this is the Monte-Carlo error that vanishes as BB \to \infty, independently of nn. In practice we fix BB large (say B=2000B = 2000) and treat the MC error as negligible; the asymptotic error is the object of theoretical study.

Remark 2 The bootstrap's scope vs. the parametric program

Track 4 (Topics 17–20) built hypothesis tests and CIs on parametric models — assume XiFθX_i \sim F_\theta for some family {Fθ:θΘ}\{F_\theta : \theta \in \Theta\}, derive the sampling distribution from the model, use likelihood ratios or pivots for inference. The bootstrap drops the family assumption entirely. In exchange, it gives up the efficiency and optimality guarantees that come with correctly-specified parametric models and trades them for distribution-free validity under mild moment conditions. When you don’t know the model, or when you know the standard family is wrong (fat tails, mixtures, skew), bootstrap is the non-negotiable answer.

31.2 The nonparametric bootstrap

Make the resampling operation precise, state the two consistency results whose proofs live in §31.3, and check how fast the Monte-Carlo error decays so the reader can calibrate BB.

Definition 2 Nonparametric bootstrap

Given an iid sample X1,,XnFX_1, \dots, X_n \sim F, the nonparametric bootstrap draws a resample X1,,XnX^\ast_1, \dots, X^\ast_n iid from the empirical distribution FnF_n:

XiX1,,XniidFn,i=1,,n.X^\ast_i \mid X_1, \dots, X_n \overset{\text{iid}}{\sim} F_n, \quad i = 1, \dots, n.

Equivalently, each XiX^\ast_i selects an index JiUniform{1,,n}J_i \sim \mathrm{Uniform}\{1, \dots, n\} independently and sets Xi=XJiX^\ast_i = X_{J_i}. We draw BB independent resamples X(1),,X(B)X^{\ast(1)}, \dots, X^{\ast(B)}, compute the statistic T(b)=T(X(b))T^{\ast(b)} = T(X^{\ast(b)}) on each, and take the empirical distribution of {T(b)}b=1B\{T^{\ast(b)}\}_{b=1}^B as the bootstrap estimate of the sampling distribution of TnT_n. Write P,EP^\ast, E^\ast for probability and expectation conditional on the observed data X1,,XnX_1, \dots, X_n — the bootstrap world.

Theorem 1 Bootstrap SE consistency (stated)

Under finite-second-moment regularity, the bootstrap estimator of Var(Tn)\mathrm{Var}(T_n),

Var^(Tn)=1B1b=1B(T(b)Tˉ)2,\widehat{\mathrm{Var}}^\ast(T_n) = \frac{1}{B-1}\sum_{b=1}^B \bigl(T^{\ast(b)} - \bar T^\ast\bigr)^2,

satisfies Var^(Tn)Var(Tn)\widehat{\mathrm{Var}}^\ast(T_n) \to \mathrm{Var}(T_n) in probability as n,Bn, B \to \infty. The rate is controlled by n1/2n^{-1/2} for the asymptotic component and B1/2B^{-1/2} for the Monte-Carlo component.

Theorem 2 Bootstrap quantile consistency (stated)

Under the same regularity, the bootstrap quantile q^p=inf{t:F^B(t)p}\hat q^\ast_p = \inf\{t : \hat F^\ast_B(t) \ge p\} of the bootstrap empirical CDF F^B(t)=B1b1{T(b)t}\hat F^\ast_B(t) = B^{-1}\sum_b \mathbf{1}\{T^{\ast(b)} \le t\} satisfies q^pqp\hat q^\ast_p \to q_p in probability, where qpq_p is the pp-quantile of the true sampling distribution of TnT_n.

Bootstrap standard-error estimate of the sample mean as B increases on a log scale. Six points at B = 50, 100, 500, 1000, 5000, 10000 with plus-or-minus one Monte-Carlo standard-error bands. The estimate settles around 0.1 by B = 1000 with MC error shrinking as 1 over root B.

Figure 2. Bootstrap SE of the sample mean on a Normal(0,1)(0, 1) fixture, n=100n = 100, with ±1\pm 1-MC-SE bands. The estimate stabilises by B1000B \approx 1000; at that point the Monte-Carlo error is under 1 %. The curve illustrates the O(B1/2)O(B^{-1/2}) MC-error decay — the more expensive O(n1/2)O(n^{-1/2}) asymptotic error stays fixed.

Example 2 Bootstrap SE for the median, no density estimate required

On a Normal(0,1)(0, 1) sample of size n=100n = 100 and B=10,000B = 10{,}000, the bootstrap SE of the sample median is approximately 0.1250.125. The asymptotic formula (4nf(ξ0.5)2)1/2(4100φ(0)2)1/2=(4100(2π)1)1/20.125(4 n f(\xi_{0.5})^2)^{-1/2} \approx (4 \cdot 100 \cdot \varphi(0)^2)^{-1/2} = (4 \cdot 100 \cdot (2\pi)^{-1})^{-1/2} \approx 0.125 matches to two digits. The bootstrap recovered the asymptotic answer without requiring f(0)f(0) — it did the density estimation implicitly through resampling.

Example 3 Bootstrap distribution of a ratio statistic

Two samples (Xi,Yi)i=1n(X_i, Y_i)_{i=1}^n iid from Exponential(1)×(1) \times Exponential(1)(1), n=50n = 50. The statistic Tn=Xˉn/YˉnT_n = \bar X_n / \bar Y_n has no closed-form sampling distribution — it’s a ratio of independent Gammas divided by themselves, near-Cauchy in the tails. Classical delta-method intervals rely on a Taylor expansion around E[X]/E[Y]=1E[X]/E[Y] = 1 that becomes unstable for small Yˉn\bar Y_n. Bootstrap: generate B=2000B = 2000 resamples, compute T(b)T^{\ast(b)} on each, use the empirical quantiles for a CI. Topic 19’s Wald CI gives (0.74,1.38)(0.74, 1.38); the bootstrap percentile CI (coming in §31.4) gives (0.79,1.35)(0.79, 1.35). Both close; the bootstrap’s advantage is that it doesn’t depend on the delta-method expansion.

Example 4 Cross-validation variance estimation

Cross-validation estimates out-of-sample risk by holding out folds, but the CV estimate itself has variance that depends on how the folds partition the data. Classical CV-variance formulas exist only for specific setups (leave-one-out on linear regression, for example). Bootstrapping CV is the general answer: resample the training data, run CV on each bootstrap sample, and use the empirical variance of the CV estimates as the CV variance. This is the bootstrap’s most common ML application — it shows up whenever someone reports a CI on a cross-validation score.

Remark 3 Monte-Carlo error vs. asymptotic error

Pick a single nn and let BB \to \infty: the bootstrap’s answer converges to VarFn(Tn)\mathrm{Var}_{F_n}(T_n) — the plug-in exact answer, which still differs from VarF(Tn)\mathrm{Var}_F(T_n) by the O(n1/2)O(n^{-1/2}) asymptotic gap. No amount of MC refinement can close that gap; it’s a property of using FnF_n instead of FF. The practical consequence: BB should be large enough to make MC error negligible relative to asymptotic error, but beyond that, increasing BB buys nothing. Topics 29 §29.5’s DKW band gives a coarse lower bound on the asymptotic error that can guide BB-selection.

Remark 4 Why the bootstrap trains cross-validation intuition

Every ML practitioner who has stared at a cross-validation score and wondered “how much should I trust this number?” is asking a bootstrap question. The CV score is a statistic of the training data; its sampling distribution under repeated training-set draws is exactly what bootstrap-CV estimates. The bootstrap gives a CI on the CV estimate without any parametric model of how risk depends on training-set composition — a distribution-free uncertainty quantification tailor-made for the ML use case.

31.3 Bootstrap consistency (Efron–Bickel–Freedman)

This is the featured theorem. Its statement pins down the sense in which the bootstrap distribution approximates the true sampling distribution, and its proof is the template for every Track 8 consistency result.

Start with a lemma we’ll need inside the main proof.

Lemma 1 Kolmogorov-distance upgrade via Polya

Let Gn,GG_n, G be CDFs with GG continuous. If Gn(x)G(x)G_n(x) \to G(x) pointwise for every xx, then supxGn(x)G(x)0\sup_x |G_n(x) - G(x)| \to 0.

Proof 1 sketch [show]

Pointwise convergence of monotone functions, plus continuity of the limit, upgrades to uniform convergence via a partition argument. Fix ε>0\varepsilon > 0; pick =x0<x1<<xk=-\infty = x_0 < x_1 < \dots < x_k = \infty with G(xj+1)G(xj)<ε/2G(x_{j+1}) - G(x_j) < \varepsilon / 2 for every jj (possible by continuity of GG). For x[xj,xj+1]x \in [x_j, x_{j+1}],

Gn(x)G(x)Gn(xj+1)G(xj)=[Gn(xj+1)G(xj+1)]+[G(xj+1)G(xj)],G_n(x) - G(x) \le G_n(x_{j+1}) - G(x_j) = \bigl[G_n(x_{j+1}) - G(x_{j+1})\bigr] + \bigl[G(x_{j+1}) - G(x_j)\bigr],

and symmetrically for the lower bound. The first bracket vanishes at each of the k+1k+1 grid points as nn \to \infty; the second is at most ε/2\varepsilon / 2 by construction. Hence lim supnsupxGnGε/2<ε\limsup_n \sup_x |G_n - G| \le \varepsilon / 2 < \varepsilon. Since ε\varepsilon was arbitrary, uniform convergence holds.

\blacksquare — using Polya 1920 as stated in vdV2000 Lem 2.11.

Now the main theorem. Its statement pairs the bootstrap sampling-distribution CDF with the true sampling-distribution CDF and shows that their Kolmogorov distance vanishes almost surely.

Theorem 3 Efron–Bickel–Freedman bootstrap consistency

Let X1,,XnX_1, \dots, X_n be iid with CDF FF satisfying E[X12]<E[X_1^2] < \infty. Write μ=E[X1]\mu = E[X_1], σ2=Var(X1)>0\sigma^2 = \mathrm{Var}(X_1) > 0. Let Xˉn\bar X_n be the sample mean and Xˉn\bar X^\ast_n the bootstrap-sample mean — conditional on the data, this is the mean of nn iid draws from FnF_n. Define

Hn(x)=P(n(Xˉnμ)x),Hn(x)=P(n(XˉnXˉn)xX1,,Xn).H_n(x) = P\bigl(\sqrt{n}(\bar X_n - \mu) \le x\bigr), \qquad H^\ast_n(x) = P\bigl(\sqrt{n}(\bar X^\ast_n - \bar X_n) \le x \,\big|\, X_1, \dots, X_n\bigr).

Then supxHn(x)Hn(x)0\sup_x |H^\ast_n(x) - H_n(x)| \to 0 almost surely as nn \to \infty.

Proof 2 [show]

Set Yi=XiXˉnY^\ast_i = X^\ast_i - \bar X_n so that conditional on the data, Y1,,YnY^\ast_1, \dots, Y^\ast_n are iid from the centred empirical distribution FnXˉnF_n - \bar X_n. They have conditional mean 00 and conditional variance

σ^n2:=1ni=1n(XiXˉn)2.\hat\sigma^2_n := \frac{1}{n}\sum_{i=1}^n (X_i - \bar X_n)^2.

By the strong law applied to n1Xi2n^{-1}\sum X_i^2 and to Xˉn2\bar X_n^2, we have σ^n2σ2\hat\sigma^2_n \to \sigma^2 almost surely. Work on the almost-sure event where this convergence holds.

Step 1 — conditional Lindeberg. Conditional on the data, the array {Yi/n}i=1,,n\{Y^\ast_i / \sqrt{n}\}_{i=1,\dots,n} is a row of a triangular array of iid variables with variance σ^n2/n\hat\sigma^2_n / n. The Lindeberg condition (Topic 11 §11.6) requires, for every ε>0\varepsilon > 0,

1σ^n2E[(X1Xˉn)21{X1Xˉn>εnσ^n}]0.\frac{1}{\hat\sigma^2_n} E^\ast\bigl[(X^\ast_1 - \bar X_n)^2 \mathbf{1}\{|X^\ast_1 - \bar X_n| > \varepsilon\sqrt{n}\,\hat\sigma_n\}\bigr] \to 0.

The conditional expectation equals n1i=1n(XiXˉn)21{XiXˉn>εnσ^n}n^{-1}\sum_{i=1}^n (X_i - \bar X_n)^2 \mathbf{1}\{|X_i - \bar X_n| > \varepsilon\sqrt{n}\,\hat\sigma_n\}. Each indicator vanishes for nn large: XiXˉn|X_i - \bar X_n| is bounded by a constant depending on ii, while nσ^n\sqrt{n}\,\hat\sigma_n \to \infty almost surely. Dominated convergence — the summands are bounded above by (XiXˉn)2(X_i - \bar X_n)^2 and average to σ^n2\hat\sigma^2_n, which is itself almost-surely bounded — delivers the limit.

Step 2 — apply the triangular-array CLT. Topic 11 Theorem 4 (Lindeberg–Feller) yields, conditionally on the data on the same full-probability event,

n(XˉnXˉn)=1ni=1nYidN(0,σ2).\sqrt{n}(\bar X^\ast_n - \bar X_n) = \frac{1}{\sqrt{n}}\sum_{i=1}^n Y^\ast_i \xrightarrow{d} \mathcal{N}(0, \sigma^2).

Marginally (unconditionally), Topic 11 Theorem 3 gives the classical CLT n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2) with the same limit variance. Thus Hn(x)Φ(x/σ)H^\ast_n(x) \to \Phi(x / \sigma) pointwise almost surely, and Hn(x)Φ(x/σ)H_n(x) \to \Phi(x / \sigma) pointwise.

Step 3 — upgrade to Kolmogorov distance. The limit Φ(/σ)\Phi(\cdot / \sigma) is continuous; apply Lemma 1 to each sequence:

supxHn(x)Φ(x/σ)0 a.s.,supxHn(x)Φ(x/σ)0.\sup_x |H^\ast_n(x) - \Phi(x/\sigma)| \to 0 \text{ a.s.}, \qquad \sup_x |H_n(x) - \Phi(x/\sigma)| \to 0.

Triangle inequality closes: supxHn(x)Hn(x)0\sup_x |H^\ast_n(x) - H_n(x)| \to 0 almost surely.

\blacksquare — using Bickel–Freedman 1981 Thm 2.1, Topic 11 Thm 4 (Lindeberg–Feller), and Lemma 1 above.

Featured two-panel figure. Left panel: three overlaid bootstrap histograms at B = 100, 500, and 2000 against a high-precision Monte-Carlo reference of the true sampling distribution, for fixed n = 50 and the sample-mean statistic on N(0, 1); the histograms progressively match the reference as B grows. Right panel: Kolmogorov distance sup over x of the absolute difference between H-n-star and H-n as n increases through 20, 50, and 200, showing the expected decay.

Figure 3. Featured. Theorem 3 in action. Left: fixed n=50n = 50, Monte-Carlo error shrinks as BB grows — the bootstrap histogram matches the reference sampling distribution arbitrarily well given enough replicates. Right: fixed B=2000B = 2000, asymptotic error shrinks as nn grows — the Kolmogorov distance between HnH^\ast_n and HnH_n decays at the n1/2n^{-1/2} rate the proof supplies. Source: Normal(0,1)(0, 1), statistic Xˉn\bar X_n.

Bootstrap distribution vs sampling distribution·The bootstrap replaces repeated sampling from the true distribution with repeated resampling from one observed sample.
Preset
Statistic
n = 50
B = 1000
-0.62-0.210.200.610.0001.8753.751MeandensitySampling distribution H_nMC × 10000
-0.62-0.210.200.610.0001.8753.751Mean*densityBootstrap distribution H_n*B = 1000θ̂ = 0.03
1002505001000250050000.0000.0480.0960.144B (log scale)KS distanceD_KS(H_n*, H_n) vs B
Current D_KS at B = 1000: 0.1252

Pick a preset, a statistic, and a sample size. Panel 1 shows the true sampling distribution (from 10 000 Monte-Carlo draws); Panel 2 shows the bootstrap distribution built from one observed sample of size n. Panel 3 tracks the KS distance between them as B grows. Watch it decay at rate 1/√B toward a floor that depends only on (preset, statistic, n) — the floor is the gap Theorem 3 shrinks to zero as n → ∞.

Example 5 Normal-mean case: analytic reference

When XiN(0,1)X_i \sim \mathcal{N}(0, 1) and Tn=XˉnT_n = \bar X_n, the sampling distribution HnH_n is analytic: nXˉnN(0,1)\sqrt{n}\bar X_n \sim \mathcal{N}(0, 1) exactly. So the “true” reference curve in the featured component’s first panel is the standard Normal density at sample size nn, no Monte Carlo needed. The KS-distance panel shows DKS(Hn,Φ)0D_{KS}(H^\ast_n, \Phi) \to 0 at rate n1/2n^{-1/2}, exactly the 1/n1/\sqrt{n} envelope one would expect from the CLT remainder.

Remark 5 Bahadur-linearization: the Track 8 template

The three-step structure of the proof — (i) write the bootstrap statistic as an empirical average of iid-conditional-on-data terms, (ii) apply a triangular-array CLT to the linear part, (iii) upgrade pointwise convergence to uniform via Polya — is exactly the same template as Topic 29’s Bahadur representation of the sample quantile and Topic 30’s AMISE derivation for KDE. Every major Track 8 result reduces to this linearization pattern. Topic 29 §29.6 Rem 13 called this out as the unifying thread; Theorem 3’s proof makes it literally visible. Topic 32’s empirical-process generalization lifts the same structure one level: the “empirical average” becomes a stochastic integral against a sample-path-continuous limit process, the triangular-array CLT becomes Donsker’s theorem, and Polya’s upgrade becomes the uniform continuity of the limit Gaussian process.

Remark 6 Bootstrapped model uncertainty

The natural ML descendant of Theorem 3: treat a model’s predictions as a statistic, bag multiple bootstrap replicates of the training set, refit on each, and use the distribution of test-time predictions as a proxy for posterior uncertainty. This is the theoretical foundation for bagging, for many uncertainty-quantification methods in deep learning, and for the non-parametric half of conformal prediction. Theorem 3 guarantees that as nn grows, the bagged-prediction distribution matches the sampling distribution of the model’s prediction under re-draw of the training set — which is what honest uncertainty quantification actually asks for.

31.4 Bootstrap confidence intervals

Theorem 3 tells us the bootstrap sampling distribution approximates the true one. The remaining question is how to convert that approximation into a CI. Four constructions — percentile, basic, BCa, studentized — all valid in the sense of Theorem 3 but with different coverage-error rates that §31.5 will analyse.

Fix notation: let TnT_n be the estimator, θ=θ(F)\theta = \theta(F) the target parameter, T(b)T^\ast_{(b)} the bb-th bootstrap replicate, and F^B\hat F^\ast_B the empirical CDF of {T(b)}b=1B\{T^\ast_{(b)}\}_{b=1}^B. All four definitions assume the replicates are sorted so that T(1)T(2)T(B)T^\ast_{(1)} \le T^\ast_{(2)} \le \dots \le T^\ast_{(B)}, and we take quantiles of the bootstrap empirical distribution by linear interpolation on the sorted order statistics.

Definition 3 Percentile CI (Efron 1979)

The percentile CI at level 1α1 - \alpha is the pair of α/2\alpha / 2 and 1α/21 - \alpha/2 quantiles of the bootstrap replicates:

CI1αpct=[q^α/2, q^1α/2].\mathrm{CI}^{\mathrm{pct}}_{1-\alpha} = \bigl[\hat q^\ast_{\alpha/2},\ \hat q^\ast_{1-\alpha/2}\bigr].

This is the intuitive construction — the one that inverts the bootstrap empirical CDF without further adjustment. It’s exact under symmetry about θ\theta but under-covers asymmetric sampling distributions (§31.5 makes this precise).

Definition 4 Basic (Hall) CI

The basic CI reflects the percentile endpoints around the observed TnT_n:

CI1αbsc=[2Tnq^1α/2, 2Tnq^α/2].\mathrm{CI}^{\mathrm{bsc}}_{1-\alpha} = \bigl[2 T_n - \hat q^\ast_{1-\alpha/2},\ 2 T_n - \hat q^\ast_{\alpha/2}\bigr].

The motivation: treat TnTnT^\ast_n - T_n as a pivot whose distribution mimics that of TnθT_n - \theta; invert the pivot. Under symmetry, basic and percentile coincide; under skewness, they lean in opposite directions. Hall 1992 §3.3 explains why basic is often the better default.

Definition 5 BCa CI (Efron 1987)

The bias-corrected and accelerated CI uses two plug-in constants to adjust the percentile endpoints. Let z^0=Φ1(B1b1{T(b)<Tn})\hat z_0 = \Phi^{-1}\bigl(B^{-1}\sum_b \mathbf{1}\{T^\ast_{(b)} < T_n\}\bigr) — the bias correction. Let a^\hat a be the jackknife acceleration,

a^=i=1n(T()T(i))36[i=1n(T()T(i))2]3/2,\hat a = \frac{\sum_{i=1}^n (T_{(\cdot)} - T_{(i)})^3}{6 \bigl[\sum_{i=1}^n (T_{(\cdot)} - T_{(i)})^2\bigr]^{3/2}},

where T(i)T_{(i)} is the leave-one-out estimate and T()=n1iT(i)T_{(\cdot)} = n^{-1}\sum_i T_{(i)}. The adjusted quantile level for p{α/2, 1α/2}p \in \{\alpha/2,\ 1-\alpha/2\} is

p~=Φ ⁣(z^0+z^0+zp1a^(z^0+zp)),zp=Φ1(p).\tilde p = \Phi\!\left(\hat z_0 + \frac{\hat z_0 + z_p}{1 - \hat a(\hat z_0 + z_p)}\right), \qquad z_p = \Phi^{-1}(p).

The BCa CI is [q^p~lo, q^p~hi][\hat q^\ast_{\tilde{p}_\mathrm{lo}},\ \hat q^\ast_{\tilde{p}_\mathrm{hi}}], with the adjusted quantile levels substituted for α/2\alpha/2 and 1α/21-\alpha/2.

Definition 6 Studentized (bootstrap-$t$) CI

The studentized CI builds a pivot from the studentized statistic. For each outer bootstrap replicate, draw an inner bootstrap to estimate SEb\mathrm{SE}^\ast_b, then form

Tb=T(b)TnSEb.T^\ast_b = \frac{T^\ast_{(b)} - T_n}{\mathrm{SE}^\ast_b}.

Let t^p\hat t^\ast_p be the pp-quantile of {Tb}b=1B\{T^\ast_b\}_{b=1}^B. The CI is

CI1αstu=[Tnt^1α/2SE^(Tn), Tnt^α/2SE^(Tn)],\mathrm{CI}^{\mathrm{stu}}_{1-\alpha} = \bigl[T_n - \hat t^\ast_{1-\alpha/2} \cdot \widehat{\mathrm{SE}}(T_n),\ T_n - \hat t^\ast_{\alpha/2} \cdot \widehat{\mathrm{SE}}(T_n)\bigr],

where SE^(Tn)\widehat{\mathrm{SE}}(T_n) is the analytic standard-error estimate on the observed sample. The inversion mimics Topic 19’s tt-CI construction — hence the name “bootstrap-tt.”

The next theorem says all four constructions deliver asymptotically valid coverage. The sketch-proof uses Theorem 3 and the continuous-mapping theorem; §31.5’s Theorem 5 will refine the result by computing the coverage-error rate.

Theorem 4 Asymptotic validity of bootstrap CIs

Under the regularity of Theorem 3, each of the four CI constructions above achieves asymptotic coverage 1α1 - \alpha as nn \to \infty: for each CI type CI1α()\mathrm{CI}^{(\bullet)}_{1-\alpha},

P(θCI1α())1α.P\bigl(\theta \in \mathrm{CI}^{(\bullet)}_{1-\alpha}\bigr) \to 1 - \alpha.
Proof 3 sketch [show]

The percentile and basic CIs follow directly from Theorem 3: the bootstrap empirical CDF F^B()\hat F^\ast_B(\cdot) approximates the true sampling CDF of TnθT_n - \theta uniformly, so inverting at α/2\alpha/2 and 1α/21-\alpha/2 recovers the asymptotic quantiles of that sampling distribution. The studentized CI adds one layer: Theorem 3 applied to the studentized statistic Zn=n(Tnθ)/τ^nZ_n = \sqrt{n}(T_n - \theta)/\hat\tau_n yields uniform approximation of its sampling CDF by the bootstrap analogue. The BCa CI uses the smoothness of the normal-quantile transform and z^0,a^0\hat z_0, \hat a \to 0 in probability to show the adjusted quantiles p~\tilde p converge to the target levels, and percentile-consistency carries over.

\blacksquare — using Theorem 3, the continuous mapping theorem, and Lehmann 1998 §7.3 for the studentized upgrade.

Coverage and length of five 95% confidence interval methods across sample sizes n in 20, 50, 100, 200, 500. Two rows: row one for sample mean of Normal(0,1); row two for sample mean of Student-t with 3 degrees of freedom. Five curves per panel: percentile, basic, BCa, studentized, and Wald-t. On the Normal row all methods converge to the nominal 95% by n=50. On the t-3 row the bootstrap methods track 95% while the Wald-t curve falls short.

Figure 4. Five CI methods compared across sample sizes. Top row: Normal(0,1)(0, 1) — all methods converge quickly to nominal coverage. Bottom row: Student-t3t_3 — the Wald-tt baseline degrades because its Normal-tail assumption is wrong, while the bootstrap methods retain validity because they don’t make the assumption. The BCa method’s slightly faster convergence at small nn previews the second-order accuracy §31.5 will prove.

Five 95 % CIs side by side·Draw samples to watch how percentile / basic / BCa / studentized / Wald-t intervals cover the true mean.
Preset
n
Press "Draw another sample" to begin
Rolling coverage (0 samples)
MethodCoveredRate
Percentile0/0
Basic (Hall)0/0
BCa0/0
Studentized0/0
Wald-t0/0
Nominal coverage: 95 %  ·  B = 1000  ·  Inner B = 30

On right-skewed parents (Exp, Beta(2,5)), BCa and Studentized should track nominal coverage better than Percentile / Basic / Wald-t at small n — the second-order payoff. On Normal and t_3, all five converge to ≈ 95 % by n = 50. The Wald-t interval is symmetric around θ̂ regardless of the underlying distribution, which is its liability when the sampling distribution is asymmetric.

Example 6 Sample median CI, Exp(1) fixture

X1,,X100Exp(1)X_1, \dots, X_{100} \sim \mathrm{Exp}(1); target θ=log2\theta = \log 2, the population median. Four bootstrap CIs at α=0.05\alpha = 0.05 with B=2000B = 2000: percentile [0.52,0.88][0.52, 0.88], basic [0.54,0.90][0.54, 0.90], BCa [0.54,0.91][0.54, 0.91], studentized [0.53,0.92][0.53, 0.92]. All four cover log20.693\log 2 \approx 0.693; length differences under 5 %. At n=30n = 30 the gap between percentile and BCa widens — percentile falls below nominal, BCa holds — which is the small-sample signature of the skewness correction.

Example 7 Ratio of means, small $n$, skewed

Same ratio-of-means statistic as Example 3, but at n=30n = 30. Wald CI (0.43,1.61)(0.43, 1.61) with 92 % coverage over 10 000 simulated datasets (under-coverage from delta-method bias). Bootstrap percentile (0.51,1.48)(0.51, 1.48), 94 %. BCa (0.56,1.52)(0.56, 1.52), 95 %. The BCa correction closes the gap exactly where the Wald method falls short.

Example 8 Correlation coefficient CI

Classical Fisher zz-transform CI for the sample correlation works only under bivariate Normal — fails wildly outside that family. Bootstrap percentile on ρ^n\hat \rho_n directly (no transform, no Normal assumption) tracks nominal coverage for almost any joint distribution. Bootstrap won’t rescue us from tiny nn or from near-perfect-correlation degeneracy, but it covers the “my data isn’t bivariate Normal” case that kills the zz-transform.

Example 9 Prediction-interval construction

The bootstrap’s workhorse role in ML. Given a trained model m^\hat m and test point x0x_0, the prediction y^0=m^(x0)\hat y_0 = \hat m(x_0) has two sources of uncertainty: model-estimation uncertainty (how would m^\hat m differ under a re-drawn training set?) and irreducible noise (how would y0y_0 differ given m^\hat m fixed?). Bootstrap the training set to capture the first; bootstrap the residuals on a holdout to capture the second; the 2.5 % and 97.5 % percentiles of m^(x0)+ϵ\hat m^\ast(x_0) + \epsilon^\ast are a valid 95 % prediction interval under mild regularity. Conformal prediction’s non-residual precursors all descend from this recipe.

Remark 7 Which CI to pick in practice

When the statistic is roughly symmetric and the sample size is moderate, percentile is fine — it’s cheapest and the interpretation is the most direct. When the statistic is skewed or moderately small-nn, BCa is the right default. Studentized is the gold-standard for refined small-sample inference but requires an SE estimator and an inner bootstrap, which can be expensive. Basic is a theoretical bridge more than a practical choice. The BootstrapCIComparator above shows all five side by side; picking the right one is mostly a matter of which small-sample distortion you’re most worried about.

Remark 8 CI↔test duality

Topic 19 §19.7 established the CI↔test duality: inverting a family of level-α\alpha tests produces a 1α1-\alpha CI, and a 1α1-\alpha CI corresponds to a test that rejects H0:θ=θ0H_0: \theta = \theta_0 if θ0CI\theta_0 \notin \mathrm{CI}. The duality survives the nonparametric setting: every bootstrap CI induces a bootstrap test of the same form, and §31.6 develops the test-first framing explicitly. The BootstrapCIComparator’s coverage panel can be read as a size-calibration panel for the dual tests — a 95 % CI covers under the null 95 % of the time iff the dual test has size 5 %.

31.5 BCa second-order accuracy

Theorem 4 says all four CIs cover asymptotically. But how fast? The question matters in practice — a CI with O(n1/2)O(n^{-1/2}) coverage error can be off by 5–10 percentage points at n=50n = 50, while one with O(n1)O(n^{-1}) error is typically off by under 1. Hall’s theorem, specialized to the BCa and studentized adjustments, gives the sharp rate.

Theorem 5 Hall 1992 second-order accuracy

Under smooth-functional regularity (two-term Edgeworth expansion valid for TnT_n and its studentized version), the percentile and basic bootstrap CIs have coverage error O(n1/2)O(n^{-1/2}), while the BCa and studentized CIs achieve coverage error O(n1)O(n^{-1}).

Proof 4 sketch [show]

Denote the studentized statistic Zn=n(Tnθ)/τ^nZ_n = \sqrt{n}(T_n - \theta)/\hat\tau_n. Under smoothness (Cramér conditions on the underlying distribution, finite third moment, smooth functional), ZnZ_n admits a one-term Edgeworth expansion

P(Znz)=Φ(z)+n1/2p1(z)ϕ(z)+O(n1),P(Z_n \le z) = \Phi(z) + n^{-1/2} p_1(z) \phi(z) + O(n^{-1}),

where p1p_1 is an even polynomial depending on the cumulants of TnT_n. The bootstrap studentized statistic ZnZ^\ast_n admits the analogous conditional expansion; Singh 1981 / Hall 1992 show that p1p1=OP(n1/2)p_1^\ast - p_1 = O_P(n^{-1/2}) under plug-in moment estimation, yielding second-order agreement:

supzP(Znzdata)P(Znz)=OP(n1).\sup_z \bigl|P(Z^\ast_n \le z \mid \text{data}) - P(Z_n \le z)\bigr| = O_P(n^{-1}).

Inverting the studentized pivot at zα/2z_{\alpha/2} gives a studentized CI with O(n1)O(n^{-1}) coverage error, matching the Edgeworth-correction order.

The percentile CI inverts the wrong pivot — the raw n(Tnθ)\sqrt{n}(T_n - \theta) rather than its studentized version — so its coverage-error expansion retains an uncorrected n1/2n^{-1/2} term from skewness; the basic CI shares this defect. The BCa adjustment estimates the bias correction z^0=Φ1(F^(Tn))\hat z_0 = \Phi^{-1}(\hat F^\ast(T_n)) and acceleration a^\hat a (jackknife skewness estimator) precisely to cancel this skewness term; Hall 1992 Ch. 3 verifies the cancellation algebraically.

\blacksquare — using Hall 1992 Thm 3.2, Singh 1981, and the Edgeworth machinery of Topic 11 §11.7.

Log-log plot of coverage error versus sample size n in 20, 50, 100, 200, 500, 1000 for two CI methods. Percentile curve declines with slope approximately minus one half. BCa curve declines with slope approximately minus one. Statistic is sample mean on a right-skewed exponential-translated distribution.

Figure 5. Numerical verification of Theorem 5. Log-log plot of coverage error vs. nn for the percentile CI (slope 1/2\approx -1/2) and BCa CI (slope 1\approx -1) on a sample-mean statistic from a right-skewed exponential-translated distribution. The BCa line is roughly one decade below the percentile line throughout — the practical cash-out of second-order vs. first-order accuracy.

31.6 Bootstrap hypothesis tests

By the CI↔test duality, every bootstrap CI already defines a bootstrap test. But the test-first perspective exposes a subtle issue: to test H0:θ=θ0H_0: \theta = \theta_0, we need the sampling distribution under the null, not under the observed data. Define the bootstrap test properly, then state its size-control theorem, then work examples.

Definition 7 Bootstrap hypothesis test

To test H0:θ(F)=θ0H_0: \theta(F) = \theta_0 against H1:θ(F)θ0H_1: \theta(F) \neq \theta_0, construct a transformation Fn(0)F^{(0)}_n of the empirical distribution under which θ(Fn(0))=θ0\theta(F^{(0)}_n) = \theta_0 exactly — typically by shifting, re-centring, or re-scaling FnF_n. Draw BB resamples from Fn(0)F^{(0)}_n, compute the test statistic T(b)T^{\ast(b)} on each, and reject H0H_0 at level α\alpha if the observed TnT_n exceeds the 1α1 - \alpha quantile of the null-resample distribution. Alternative-sided tests adapt the rejection region accordingly; the pp-value is B1b1{T(b)Tn}B^{-1}\sum_b \mathbf{1}\{T^{\ast(b)} \ge T_n\} plus continuity corrections.

Theorem 6 Bootstrap test size control (stated)

Under the regularity of Theorem 3 plus uniform-in-θ0\theta_0 Edgeworth validity, the bootstrap test defined above has size α+O(n1/2)\alpha + O(n^{-1/2}) (first-order accurate) or α+O(n1)\alpha + O(n^{-1}) (second-order, when paired with studentized or BCa-adjusted test statistics). Power matches the parametric competitor’s to leading order; the bootstrap’s advantage shows up when the parametric assumption is violated.

Size and power curves for three tests: bootstrap t-test, permutation test, and parametric t-test. Two-sample setup with n1 = n2 = 30; alternative values mu1 minus mu2 in negative one, negative a half, zero, a half, and one. Bootstrap and permutation curves track each other; parametric t-test curve dips below the others at the large-alternative extremes.

Figure 6. Two-sample test comparison: bootstrap-tt, permutation, and parametric tt-test, n1=n2=30n_1 = n_2 = 30. Size (centre point at μ1μ2=0\mu_1 - \mu_2 = 0) is controlled at 0.05 for all three. Power matches for moderate alternatives; parametric tt loses ground at the extremes where its normal-tail assumption breaks.

Example 10 Null-resample construction for a difference of means

Two-sample test of H0:μX=μYH_0: \mu_X = \mu_Y against H1:μXμYH_1: \mu_X \neq \mu_Y. Under the null, the two samples are exchangeable with common mean μ^=(Xˉn+Yˉm)/2\hat\mu = (\bar X_n + \bar Y_m) / 2 (equal sample sizes; the mean\mathrm{mean}-weighted pooled version for unequal n,mn, m). Centre both samples to this common mean: X~i=XiXˉn+μ^\tilde X_i = X_i - \bar X_n + \hat\mu, Y~j=YjYˉm+μ^\tilde Y_j = Y_j - \bar Y_m + \hat\mu. Draw bootstrap resamples from X~\tilde X and Y~\tilde Y separately; compute the difference of resample means; compare to the observed XˉnYˉm\bar X_n - \bar Y_m. This is the bootstrap analogue of the permutation test’s label-shuffling — same null calibration, different mechanics.

Example 11 A/B test significance

Conversion rates X{0,1}X \in \{0, 1\} with nA=nB=5000n_A = n_B = 5000, observed rates 0.0430.043 vs. 0.0510.051. Parametric zz-test pp-value: 0.0410.041. Bootstrap-tt null-resample pp-value: 0.0380.038 (B=10000B = 10000). The two agree to within MC error because nn is large and the Normal approximation is excellent. Now imagine XX is a heavy-tailed revenue-per-user metric — the zz-test’s Normal approximation is wrong and the parametric pp-value drifts; the bootstrap-tt tracks the truth because it doesn’t assume Normality.

Remark 9 Bootstrap tests vs. permutation tests

Permutation tests are the classical distribution-free alternative. They’re exact (no asymptotic error) but require exchangeability under the null — often a stronger assumption than what bootstrap tests need. Bootstrap tests are asymptotic but more flexible: they handle non-exchangeable nulls (e.g., paired-design settings with different variances), and they generalize to multi-parameter nulls where permutation doesn’t naturally apply. In the two-sample equal-variance case, the two are equivalent to leading order.

Remark 10 A/B testing with non-Normal metrics

Revenue-per-user, time-on-site, and other heavy-tailed ML metrics routinely violate the Normal-tail assumption that the standard zz- or tt-test relies on. Bootstrap tests are the production-grade answer: they give valid pp-values under the actual metric distribution without requiring the analyst to specify what that distribution is. Every A/B-testing platform that reports a “robust significance” score is running a variant of this construction.

31.7 Parametric bootstrap

When we do have a parametric model — even a misspecified one — we can resample from the fitted model instead of from FnF_n. The parametric bootstrap has a lower MC variance at small nn (no ties, no discreteness in the resample) but inherits the model’s correctness or incorrectness.

Definition 8 Parametric bootstrap

Fit a parametric family {Fθ:θΘ}\{F_\theta : \theta \in \Theta\} to the observed sample via MLE or another estimator, obtaining θ^n\hat\theta_n. Draw bootstrap resamples X1,,XnX^\ast_1, \dots, X^\ast_n iid from Fθ^nF_{\hat\theta_n} — not from FnF_n. Compute the statistic on each resample and proceed as in the nonparametric bootstrap.

Theorem 7 Parametric bootstrap consistency (stated)

Under the regularity of Topic 14 (parametric MLE consistency) plus Topic 11’s Edgeworth validity, the parametric bootstrap distribution Hn,parH^{\ast,\mathrm{par}}_n converges in Kolmogorov distance to the true sampling distribution HnH_n at rate O(n1/2)O(n^{-1/2}), matching the nonparametric bootstrap in first-order accuracy. When the parametric model is correctly specified, the parametric bootstrap achieves the same second-order accuracy as the studentized or BCa versions without needing the studentization step.

Example 12 Normal-location bootstrap, right vs. wrong model

Data: n=100n = 100 iid from Student-t3t_3 (heavy tails). Parametric bootstrap under a Normal model: fit μ^n=Xˉn\hat\mu_n = \bar X_n, σ^n2=n1i(XiXˉn)2\hat\sigma^2_n = n^{-1}\sum_i (X_i - \bar X_n)^2; resample from N(μ^n,σ^n2)\mathcal{N}(\hat\mu_n, \hat\sigma^2_n). The CI on μ\mu is (Xˉn±1.96σ^n/n)(\bar X_n \pm 1.96 \hat\sigma_n / \sqrt{n}), i.e. the Wald interval — and it under-covers because the Normal tails are wrong. Nonparametric bootstrap on the same data gives an interval that correctly includes the t3t_3 tail contribution; its coverage matches nominal within MC error. Misspecification matters for the parametric bootstrap; the nonparametric bootstrap is immune.

Remark 11 Parametric bootstrap in ML

When a neural network or probabilistic model supplies an explicit likelihood p(yx;θ)p(y \mid x; \theta), parametric bootstrap is the natural uncertainty-quantification tool: sample training data from the fitted model, refit, observe variability. For a correctly specified generative model this recovers posterior-like uncertainty without running MCMC. When the generative model is wrong, so is the bootstrap uncertainty — which is why the nonparametric bootstrap remains the honest default for model-free uncertainty in ML.

Remark 12 Hybrid (semiparametric) variants

Residual bootstrap for regression is the canonical hybrid: parametric for the mean function (linear regression’s fitted line), nonparametric for the error distribution (resample residuals with replacement). Wild bootstrap extends this to heteroscedastic errors. These variants are deferred to §31.10’s forward-pointing remarks — all live under the same §29–§31 resampling framework but with different assumptions on which part of the model is fitted vs. empirical.

31.8 Smooth bootstrap (Topic 30 bridge)

The nonparametric bootstrap resamples from FnF_n — a discrete distribution on the sample points. For statistics that depend smoothly on FF (mean, most moments), the discreteness is invisible. For statistics that depend on continuity (the median, any quantile, density-ratio estimators), the discreteness causes lattice-artifacts: a bootstrap median must equal one of the observed sample points, so its distribution is supported on at most nn atoms. Smooth bootstrap fixes the artifact by resampling from f^h\hat f_h — the KDE from Topic 30 — instead of from FnF_n.

Definition 9 Smooth bootstrap (Silverman–Young 1987)

Let f^h\hat f_h be the Gaussian-kernel KDE from Topic 30 with bandwidth hh (typically hh = Silverman’s rule from Topic 30 §30.9). The smooth bootstrap resamples from f^h\hat f_h rather than from FnF_n:

Xi=XJi+hZi,JiUniform{1,,n},ZiN(0,1),i=1,,n.X^\ast_i = X_{J_i} + h Z_i, \qquad J_i \sim \mathrm{Uniform}\{1, \dots, n\}, \quad Z_i \sim \mathcal{N}(0, 1), \quad i = 1, \dots, n.

The JiJ_i pick a sample point and the Gaussian hZih Z_i adds a small kernel-shaped jitter. The resulting sample is iid from f^h\hat f_h by construction.

Theorem 8 Smooth-bootstrap consistency (stated)

Under the regularity of Topic 30 Thm 6 (pointwise KDE consistency) plus Theorem 3’s moment condition, the smooth-bootstrap sampling distribution converges to the true sampling distribution in Kolmogorov distance, at the same O(n1/2)O(n^{-1/2}) rate as the nonparametric bootstrap. For density-functional statistics (median, quantiles, integrals of ff), smooth bootstrap strictly dominates: its MC variance is smaller at every BB, and the O(n1/2)O(n^{-1/2}) rate constant is smaller too.

Two histograms of B = 2000 bootstrap medians on the same Normal(0,1) sample of size n = 50. The naive histogram (resampling from F_n) shows spikes at the original sample points; the smooth histogram is continuous. True sampling-distribution reference curve (from high-precision Monte Carlo) overlaid on both; smooth histogram matches the reference closely, naive histogram approximates it piecewise.

Figure 7. Naïve vs. smooth bootstrap of the sample median, n=50n = 50, B=2000B = 2000. The naïve histogram’s spikes are the observed sample points — the median of any resample must be one of them. Smooth bootstrap fills the gaps with kernel-jitter and recovers the true sampling-distribution shape. The smooth SE is roughly 3 % smaller than the naïve SE at this nn; the gap grows at smaller nn.

Smooth bootstrap of the median·Naïve resample of the median is supported on ≤ n points; Gaussian jitter of bandwidth h smooths it.
Preset
Sample size n
-0.96-0.60-0.240.130.490.0001.9513.9035.854Bootstrap distribution (B = 2000)median*density
naïve · SE 0.168smooth · SE 0.188θ̂ = -0.285
0.130.360.590.821.050.1490.1900.2320.274Bandwidth sensitivity of smooth-bootstrap SEbandwidth hSE of median*naïve SE 0.168Silverman 0.419
Silverman's rule (h ≈ 0.419) is the vertical dashed line. SE is nearly flat across ~½× to 2× that choice — smooth bootstrap is forgiving of mild bandwidth mis-specification.

Naïve bootstrap of the median can only return one of the observed sample points, so its histogram is spiky. Smooth bootstrap resamples from the Gaussian KDE f̂_h instead — the jitter fills in the gaps. The two SE estimates agree to leading order for n large; the visible difference at small n is smooth bootstrap fixing the discreteness artifact.

Example 13 Simultaneous envelopes via smooth bootstrap

Topic 30 §30.5 Rem 15 promised the smooth bootstrap as a simultaneous-envelope tool. Given f^h\hat f_h, draw BB smooth-bootstrap samples; on each, compute f^h\hat f^\ast_h; form the pointwise 2.52.5 % and 97.597.5 % envelopes across the BB replicates. Under mild regularity this recovers a valid 9595 % uniform confidence band for ff — the KDE analogue of the DKW band from Topic 29 §29.5. For the median fixture above, the envelope construction on a moderate-nn sample gives a smoothly varying band that the naïve bootstrap can’t produce because naïve bootstrap places mass only at sample points.

Remark 13 Bandwidth as a tuning knob

Smooth bootstrap introduces the bandwidth hh as a new free parameter. Topic 30’s data-driven selectors — Silverman (§30.9), Scott (§30.9), Sheather–Jones — all carry over. Silverman’s rule is the practical default; it’s nearly bandwidth-invariant in the 0.5hSilverman0.5 h_\mathrm{Silverman} to 2hSilverman2 h_\mathrm{Silverman} range (the right panel of SmoothBootstrapDemo shows this). Under-smoothing (h0h \to 0) recovers the naïve bootstrap; over-smoothing (hh large) over-regularizes and biases the smooth-bootstrap SE upward.

Remark 14 Kernel-based uncertainty for nonparametric regressors

Nadaraya–Watson and local-polynomial regressors (forward to formalML) are kernel-weighted averages of YiY_i. Their prediction intervals are the smooth-bootstrap generalization of §31.4’s prediction intervals: smooth-bootstrap the covariates XiX_i, refit the regressor on each bootstrap sample, and take the pointwise percentile envelopes of the predictions. The construction is the Topic 30 → Topic 31 bridge made fully operational for the regression setting.

31.9 Bias correction

Statisticians think of bias as a nuisance to estimate and correct. The bootstrap gives both in one pass.

Definition 10 Bootstrap bias and bias correction

The bootstrap bias estimator for TnT_n as an estimator of θ\theta is

bias^(Tn)=TˉTn,Tˉ=1Bb=1BT(b).\widehat{\mathrm{bias}}^\ast(T_n) = \bar T^\ast - T_n, \qquad \bar T^\ast = \frac{1}{B}\sum_{b=1}^B T^{\ast(b)}.

The bias-corrected estimator is

T~n=Tnbias^(Tn)=2TnTˉ.\tilde T_n = T_n - \widehat{\mathrm{bias}}^\ast(T_n) = 2 T_n - \bar T^\ast.

In words: reflect the bootstrap mean around the observed statistic. The motivation is the Taylor linearization Tnθ(TnE[Tn])+bias(Tn)T_n - \theta \approx (T_n - E[T_n]) + \mathrm{bias}(T_n); subtracting the bootstrap-estimated bias gives a reduced-bias estimator at the cost of slightly inflated variance.

Theorem 9 Bias-correction variance trade-off (stated)

Under the regularity of Theorem 3, the bias-corrected estimator satisfies bias(T~n)=O(n2)\mathrm{bias}(\tilde T_n) = O(n^{-2}) — one order better than the original TnT_n‘s O(n1)O(n^{-1}) bias. The variance inflation is Var(T~n)=Var(Tn)(1+O(n1))\mathrm{Var}(\tilde T_n) = \mathrm{Var}(T_n) (1 + O(n^{-1})), negligible at moderate nn. MSE-wise, bias correction dominates when the leading-order bias is large relative to the SE — typically at small nn or for highly nonlinear statistics.

MSE of two estimators of a ratio-of-means parameter across sample sizes n in 20, 50, 100, 200, 500. Top curve: naive ratio estimator. Bottom curve: bias-corrected ratio estimator. Both curves decay at roughly the same rate but the bias-corrected curve is uniformly lower, with the gap narrowing as n grows.

Figure 8. Bias correction in action: MSE of the naïve vs. bias-corrected ratio-of-means estimator on Exponential(1)(1)-by-Exponential(1)(1) fixtures. The bias-corrected estimator’s MSE is strictly lower at every nn, with the gap shrinking as nn grows. At n=20n = 20 the correction is worth a factor of 1.41.4 in MSE; by n=200n = 200 the two are within 55 %.

Example 14 Bias correction for the sample variance

The sample variance σ^n2=n1i(XiXˉn)2\hat\sigma^2_n = n^{-1}\sum_i (X_i - \bar X_n)^2 has bias σ2/n-\sigma^2/n relative to the population variance σ2\sigma^2. The Bessel-corrected sn2=σ^n2n/(n1)s^2_n = \hat\sigma^2_n \cdot n/(n-1) fixes this analytically — a well-known first-order correction. The bootstrap bias-correction recovers the same fix without the analytic insight: bias^(σ^n2)=σ^n2/n+OP(n3/2)\widehat{\mathrm{bias}}^\ast(\hat\sigma^2_n) = -\hat\sigma^2_n / n + O_P(n^{-3/2}), matching the analytic bias. Subtracting gives σ~n2=σ^n2(1+1/n)+OP(n1/2)\tilde\sigma^2_n = \hat\sigma^2_n (1 + 1/n) + O_P(n^{-1/2}), first-order equivalent to sn2s^2_n. The bootstrap recovered Bessel’s correction by Monte Carlo.

Remark 15 When not to bias-correct

Bias correction inflates variance. At small nn the bias dominates and correction helps; at large nn the bias is negligible and the variance inflation hurts. A rule of thumb: bias-correct when bias^(Tn)>SE^(Tn)/4|\widehat{\mathrm{bias}}^\ast(T_n)| > \widehat{\mathrm{SE}}^\ast(T_n) / 4. Below that threshold, leave TnT_n alone.

Remark 16 Debiasing cross-validated risk

Cross-validation is known to underestimate test risk by the optimism: the gap between the training-set risk and the test-set risk. Bootstrap bias-correction of the CV risk is the standard debiasing technique — resample the training set, recompute CV on each resample, use the CVˉCVobs\bar{\mathrm{CV}}^\ast - \mathrm{CV}_{\mathrm{obs}} difference as the optimism estimate. The .632+ bootstrap (Efron–Tibshirani 1997) refines this with a weighted combination of resubstitution and out-of-bag risk; it’s a direct descendant of the bias-correction construction above.

31.10 Scope boundaries & Track 8 spine

Five remarks close out the topic. Each names an important variant the bootstrap world has produced over the decades, and marks it for forward treatment in the formalML track. No derivations — the point is to orient.

Remark 17 Out of scope: block bootstrap for dependent data

Künsch 1989 extended the bootstrap to stationary time-series data by resampling blocks of consecutive observations instead of individual points. The block length \ell is a new tuning parameter — too small, autocorrelation structure is lost; too large, the number of blocks is too small for MC convergence. Variants: overlapping blocks (Künsch), moving blocks, circular blocks, stationary bootstrap (Politis–Romano). The entire family lives in the dependent-observations regime that Topic 31 excluded; formalML’s time-series inference chapters will treat it in full.

Remark 18 Out of scope: subsampling

Politis–Romano 1994 showed that subsampling — resampling without replacement, at a size m<nm < n — achieves asymptotic validity under milder conditions than the bootstrap. Where bootstrap requires E[X12]<E[X_1^2] < \infty for the sample mean, subsampling gets by with much weaker tail conditions. The trade-off is a smaller effective sample size mm and a tuning choice for mm. Subsampling is the right tool for genuinely heavy-tailed distributions where the bootstrap can fail; Topic 31 assumes the moment conditions hold, so subsampling is a §31.10 footnote rather than a §31.x section.

Remark 19 Out of scope: Bayesian bootstrap

Rubin 1981 replaced the bootstrap’s multinomial resample weights {1/n}\{1/n\} with Dirichlet-distributed random weights, producing a Bayesian-flavoured construction that behaves asymptotically like the nonparametric bootstrap but has a posterior-like interpretation. The full treatment belongs with Track 7’s Dirichlet-process machinery — the Bayesian bootstrap is the Dirichlet-process posterior for the special case of a vague Dirichlet prior.

Remark 20 Out of scope: wild / residual bootstrap for regression

Residual bootstrap is the regression-specific variant: fit the regression, compute residuals, resample residuals with replacement, and add back to the fitted mean to get new response values. Wild bootstrap generalizes to heteroscedastic errors by rescaling each residual by an independent mean-zero multiplier. Both are indispensable for valid inference on regression coefficients in misspecified settings — and both are deferred to formalML’s regression-inference chapters because they require the regression machinery Topic 31 deliberately didn’t build.

Remark 21 Track 8 spine — 3 of 4

Topic 29 built the empirical CDF machinery. Topic 30 smoothed it into densities. Topic 31 — this topic — resampled from it. Topic 32 closes the track by embedding all three into the empirical-process framework: FnF_n becomes a sample-path-continuous limit process (the Brownian bridge), f^h\hat f_h becomes a smoothed version of that process, and the bootstrap becomes a resampling operation inside a function space. Donsker’s theorem is the functional CLT that unifies the three; the bootstrap-consistency proof of §31.3 is a finite-dimensional shadow of Donsker. The empirical-process chapters of Topic 32 are the on-ramp to the uniform convergence and stochastic equicontinuity that underwrite modern high-dimensional statistics.

Horizontal spine figure with four topic markers labelled 29, 30, 31, 32. Topics 29 and 30 marked with checkmarks; topic 31 highlighted as the current topic with a filled marker; topic 32 shown as forthcoming.

Figure 9. Track 8 spine, updated. Topics 29 (ECDF & order statistics) and 30 (kernel density estimation) are published; Topic 31 (the bootstrap — this topic) is newly published; Topic 32 (empirical processes) is now published, closing the curriculum. Together the four topics build a complete nonparametric-inference toolkit anchored in the empirical distribution.


References

  1. Efron, Bradley. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), 1–26.
  2. Bickel, Peter J., and David A. Freedman. (1981). Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6), 1196–1217.
  3. Singh, Kesar. (1981). On the Asymptotic Accuracy of Efron’s Bootstrap. The Annals of Statistics, 9(6), 1187–1195.
  4. Efron, Bradley. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171–185.
  5. Efron, Bradley, and Robert J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
  6. Hall, Peter. (1992). The Bootstrap and Edgeworth Expansion. Springer.
  7. Silverman, Bernard W., and G. Alastair Young. (1987). The Bootstrap: To Smooth or Not to Smooth?. Biometrika, 74(3), 469–479.
  8. Politis, Dimitris N., and Joseph P. Romano. (1994). Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions. The Annals of Statistics, 22(4), 2031–2050.
  9. Rubin, Donald B. (1981). The Bayesian Bootstrap. The Annals of Statistics, 9(1), 130–134.
  10. Künsch, Hans R. (1989). The Jackknife and the Bootstrap for General Stationary Observations. The Annals of Statistics, 17(3), 1217–1241.
  11. van der Vaart, Aad W. (2000). Asymptotic Statistics. Cambridge University Press.
  12. Serfling, Robert J. (1980). Approximation Theorems of Mathematical Statistics. Wiley.
  13. Lehmann, Erich L. (1998). Elements of Large-Sample Theory. Springer.