intermediate 55 min read · April 18, 2026

Confidence Intervals & Duality

Every family of level-α tests is a (1−α) confidence procedure, and every confidence procedure is a family of tests. The z, t, χ², F, Wald, Score, LRT, Wilson, Clopper–Pearson, and profile-likelihood intervals as one pattern: test inversion.

19.1 What a Confidence Interval Is (and Is Not)

Imagine running the same experiment a hundred times. Each run produces a data sample X(i)X^{(i)} and, from it, an interval C(X(i))=[Li,Ui]C(X^{(i)}) = [L_i, U_i]. If the interval procedure has 95% coverage, then roughly ninety-five of those hundred intervals — not any particular one — contain the true parameter. That’s the claim a confidence interval makes. It is not the claim that the parameter has 95% probability of lying in the particular interval you happened to compute.

The distinction is the single most important idea in this topic. It is also the one most often mangled in practice, so we stage it carefully before building machinery on top of it.

One hundred simulated 95% z-CIs from iid Normal samples. Roughly five of the intervals miss the true μ — the procedure delivers on its 95% promise over the long run, but any individual interval either contains μ or it doesn't. Right panel: the Bayesian posterior for one sample — a different probabilistic object (distribution over μ given data), with a credible interval built from its quantiles.

Definition 1 Confidence procedure

Let {Pθ:θΘ}\{P_\theta : \theta \in \Theta\} be a parametric family. A (1α)(1-\alpha)-confidence procedure for θ\theta is a data-measurable set-valued map C:X2ΘC : \mathcal{X} \to 2^\Theta — usually written XC(X)X \mapsto C(X) — satisfying

Pθ(θC(X))1αfor every θΘ.P_\theta(\theta \in C(X)) \ge 1 - \alpha \qquad \text{for every } \theta \in \Theta.

When C(X)=[L(X),U(X)]C(X) = [L(X), U(X)] is an interval, we call it a (1α)(1-\alpha) confidence interval. The probability Pθ(θC(X))P_\theta(\theta \in C(X)) is the coverage of the procedure at parameter value θ\theta.

Definition 2 Nominal vs actual coverage

The quantity 1α1 - \alpha is the nominal coverage — the level the procedure advertises. The function θPθ(θC(X))\theta \mapsto P_\theta(\theta \in C(X)) is the actual coverage, and it depends on both θ\theta and the sample size nn.

A procedure is exact if actual coverage equals nominal at every θ\theta; anti-conservative (or liberal) if actual coverage is below nominal at some θ\theta; conservative if actual coverage is above nominal at every θ\theta.

The three adjectives name the three failure modes — and the coverage calibration of §19.8 is the diagnostic for which mode a given procedure falls into.

Remark 1 Historical origin — Neyman 1937

The confidence-interval concept is due to formalML: . Neyman was reacting to Fisher’s fiducial distribution framework, which had produced confusing results in multi-parameter problems. His innovation was to strip probability statements of all posterior-like interpretation and define them purely as frequency properties of the procedure. This is why the probability Pθ(θC(X))P_\theta(\theta \in C(X)) in Definition 1 is indexed by a fixed θ\theta — the randomness sits entirely in XX and therefore in C(X)C(X). The true parameter θ\theta is a constant, not a random variable.

The philosophical move was radical enough that it took decades to settle into standard teaching. Fiducial intervals lingered in some applied literature until the 1960s. Bayesian credible intervals, built from a genuine posterior distribution over θ\theta, solve the “what is the probability” question differently and are the subject of Track 7 — not a competing frequentist procedure but a different inference paradigm (Rem 3).

Remark 2 The '95% probability' trap — the #1 pedagogical anchor

Read Definition 1 again. The probability statement is Pθ(θC(X))1αP_\theta(\theta \in C(X)) \ge 1 - \alpha, where the subscript tells us that θ\theta is held fixed and XX is the random variable. So the event {θC(X)}\{\theta \in C(X)\} is an event about XX (whether the random interval catches the fixed true value), not an event about θ\theta.

Once the data are in hand — say C(X)=[0.20,0.34]C(X) = [0.20, 0.34] — the parameter θ\theta either lies in [0.20,0.34][0.20, 0.34] or it doesn’t. The frequentist framework does not assign a probability to that specific fact. What it says is: “The procedure I just used generates intervals that catch the true parameter 95 times out of 100 on repeated experiments.” That’s a guarantee about the procedure’s long-run error rate, not a posterior probability on any one output.

A careful locution: “I am 95% confident that θ\theta lies in [0.20,0.34][0.20, 0.34] is a statement about the procedure’s reliability, not about the probability of this one interval being right. Every introductory statistics textbook says this at least once; the fact that practitioners continue to slip into the posterior-probability reading anyway is why we belabor the point here. Every result in the rest of this topic lives or dies with this distinction — in particular, the coverage diagnostics of §19.8 are meaningful only if “coverage” means the procedure’s long-run error rate.

Remark 3 Frequentist coverage vs Bayesian posterior credibility

A Bayesian credible interval starts from a posterior distribution π(θX)\pi(\theta \mid X) — the distribution of θ\theta given the observed data — and defines a (1α)(1-\alpha) credible interval as any set CC with π(θCX)=1α\pi(\theta \in C \mid X) = 1 - \alpha. Here θ\theta is treated as a random variable (with a prior that gets updated to a posterior), so the posterior probability of a specific interval containing θ\theta makes sense directly.

Frequentist coverage and Bayesian credibility answer different questions about different objects. A frequentist CI guarantees a long-run error rate over repeated experiments but says nothing about this specific dataset’s θ\theta. A Bayesian credible interval gives a probability for this specific dataset’s θ\theta but depends on the chosen prior (with different priors giving different intervals). Neither is “right” — they’re answering different questions — but confusing them is the source of the “95% probability” trap in Remark 2.

Under a flat (improper) prior for a Normal mean with known variance, the two intervals coincide numerically: Xˉ±zα/2σ/n\bar X \pm z_{\alpha/2}\sigma/\sqrt n is both the 95% frequentist CI and the 95% credible interval. This numerical coincidence is the reason the confusion is so persistent — and why the distinction matters more the further one moves from the symmetric Normal case. Topic 25 develops the Bayesian perspective, including credible intervals and the flat-prior coincidence with z-CIs for Normal means; here we stay frequentist.


19.2 The Test–CI Duality Theorem

Every hypothesis test at level α\alpha generates a (1α)(1-\alpha) confidence set, and every confidence set generates a family of hypothesis tests. The correspondence is exact — the two procedures carry the same information, just organized around different questions. This is the organizing principle of Topic 19, and it makes every CI construction in this topic a test inversion in disguise.

Two-plane diagram of the duality. The shaded region is the joint acceptance set of (θ₀, T) pairs where the test at θ₀ does not reject at data T. Horizontal slicing at fixed T_obs (left) gives the CI — the set of θ₀ the data does not reject. Vertical slicing at fixed θ₀ (right) gives the test's acceptance region — the set of T values the test at θ₀ does not reject. One object, two slicings.

Theorem 1 Test–CI duality

Fix α(0,1)\alpha \in (0, 1) and a parametric family {Pθ:θΘ}\{P_\theta : \theta \in \Theta\}. Suppose that for every θ0Θ\theta_0 \in \Theta we have a level-α\alpha test φθ0(X){0,1}\varphi_{\theta_0}(X) \in \{0, 1\} of H0:θ=θ0H_0: \theta = \theta_0 (reject iff φθ0(X)=1\varphi_{\theta_0}(X) = 1), so that Pθ0(φθ0(X)=1)αP_{\theta_0}(\varphi_{\theta_0}(X) = 1) \le \alpha. Define

C(X)={θ0Θ:φθ0(X)=0}.C(X) = \{\theta_0 \in \Theta : \varphi_{\theta_0}(X) = 0\}.

Then C(X)C(X) is a (1α)(1-\alpha) confidence set for θ\theta: Pθ(θC(X))1αP_\theta(\theta \in C(X)) \ge 1 - \alpha for every θΘ\theta \in \Theta.

Conversely, given a (1α)(1-\alpha) confidence set C(X)C(X), the collection {φθ0(X)=1{θ0C(X)}:θ0Θ}\{\varphi_{\theta_0}(X) = \mathbf{1}\{\theta_0 \notin C(X)\} : \theta_0 \in \Theta\} is a family of level-α\alpha tests.

-1.500-1.000-0.5000.00000.5001.0001.500-1.500-1.000-0.5000.00000.5001.0001.500θ₀ (null value)T (statistic)
Horizontal slicing → CI C(T_obs)
At T = 0.250, the set of θ₀ the test does NOT reject is [-0.108, 0.608]. This is the z CI for μ.
Vertical slicing → acceptance region A(θ₀)
At θ₀ = 0.0000, the test does NOT reject for T ∈ [-0.358, 0.358]. Outside this range, H₀: θ = 0.0000 would be rejected at level α.

Drag the blue horizontal line to change T_obs and watch the CI update; drag the green vertical line to change θ₀ and watch the acceptance region update. Same shaded region, two slicings.

Proof [show]

Step 1 — Forward direction (tests → CI). Fix θΘ\theta \in \Theta. By the construction of C(X)C(X), the event {θC(X)}\{\theta \in C(X)\} equals the event {φθ(X)=0}\{\varphi_\theta(X) = 0\} — the non-rejection event of the level-α\alpha test of H0:θ=θH_0: \theta = \theta (null value coinciding with the true value).

Pθ(θC(X))=Pθ(φθ(X)=0)=1Pθ(φθ(X)=1)1α.P_\theta(\theta \in C(X)) = P_\theta(\varphi_\theta(X) = 0) = 1 - P_\theta(\varphi_\theta(X) = 1) \ge 1 - \alpha.

The inequality is the size constraint of the test at the true parameter, applied with θ0=θ\theta_0 = \theta. Since θ\theta was arbitrary, the coverage bound holds uniformly.

Step 2 — Converse direction (CI → tests). Fix θ0Θ\theta_0 \in \Theta and define φθ0(X)=1{θ0C(X)}\varphi_{\theta_0}(X) = \mathbf{1}\{\theta_0 \notin C(X)\}. Under H0:θ=θ0H_0: \theta = \theta_0,

Pθ0(φθ0(X)=1)=Pθ0(θ0C(X))=1Pθ0(θ0C(X))1(1α)=α.P_{\theta_0}(\varphi_{\theta_0}(X) = 1) = P_{\theta_0}(\theta_0 \notin C(X)) = 1 - P_{\theta_0}(\theta_0 \in C(X)) \le 1 - (1 - \alpha) = \alpha.

The inequality is the (1α)(1-\alpha) coverage of CC at θ=θ0\theta = \theta_0. Hence {φθ0}\{\varphi_{\theta_0}\} is a family of level-α\alpha tests.

∎ — by indicator algebra and the size constraint; see NEY1937 for the original formulation.

Example 1 z-test inversion → z-CI

Let X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) with σ\sigma known. The two-sided z-test of H0:μ=μ0H_0: \mu = \mu_0 at level α\alpha rejects iff n(Xˉμ0)/σ>zα/2|\sqrt n (\bar X - \mu_0)/\sigma| > z_{\alpha/2}. By Theorem 1, the CI is the set of null values the test does not reject:

C(X)={μ0:n(Xˉμ0)/σzα/2}=[Xˉzα/2σ/n,  Xˉ+zα/2σ/n].C(X) = \{\mu_0 : |\sqrt n (\bar X - \mu_0)/\sigma| \le z_{\alpha/2}\} = [\bar X - z_{\alpha/2}\sigma/\sqrt n,\; \bar X + z_{\alpha/2}\sigma/\sqrt n].

This is the textbook z-interval. The duality makes it an automatic consequence of the z-test’s acceptance region.

Example 2 t-test inversion → t-CI

Same setup but σ\sigma unknown; replace σ\sigma by the sample standard deviation SS. The two-sided t-test rejects iff n(Xˉμ0)/S>tn1,α/2|\sqrt n (\bar X - \mu_0)/S| > t_{n-1, \alpha/2}. Inverting:

C(X)=[Xˉtn1,α/2S/n,  Xˉ+tn1,α/2S/n].C(X) = [\bar X - t_{n-1, \alpha/2} S/\sqrt n,\; \bar X + t_{n-1, \alpha/2} S/\sqrt n].

Exactness — meaning the CI has coverage exactly 1α1-\alpha at every μ\mu — is inherited from the exactness of the t-test, which comes from formalML: : the pivot n(Xˉμ)/S\sqrt n (\bar X - \mu)/S is distribution-free because Xˉ\bar X and S2S^2 are independent. Same Basu in §19.3.

Remark 4 Duality as organizing principle

Every CI construction in this topic is a test inversion. The z-CI, t-CI, χ²-CI, F-CI of §19.3 come from the four pivotal tests. The Wald/Score/LRT CIs of §19.4 come from inverting the Topic 18 asymptotic trio. The Wilson interval of §19.5 is the score-test inversion for binomial pp. The Clopper–Pearson interval of §19.6 is the exact test inversion via the beta–binomial CDF identity. The profile-likelihood CI of §19.7 is the generalized-LRT inversion with Wilks as the asymptotic engine. The TOST equivalence procedure of §19.9 is two one-sided tests run in parallel. Every time we write down a confidence interval, we are cashing in the duality theorem — and the fact that it is the same theorem in every case is the topic’s unifying thread.

Remark 5 Vector-θ extension

Theorem 1 extends without change to vector θRk\theta \in \mathbb{R}^k: a level-α\alpha test at every θ0\theta_0 yields a (1α)(1-\alpha) confidence set (not necessarily an interval; possibly a region in Rk\mathbb{R}^k). The Wald CI for a vector GLM coefficient is an ellipsoid; the profile-likelihood CI for a vector parameter with nuisance components profiled out is a region whose boundary is the χk2\chi^2_k-threshold contour. Simultaneous CIs for multiple parameters and confidence ellipsoids for Hotelling’s T2T^2 are developed in formalML: ; the scalar case of Topic 19 captures every main idea, with the vector-case bookkeeping involving matrix Fisher information in place of scalar I(θ0)I(\theta_0).


19.3 Pivotal Quantities

A pivotal quantity is a function of data and parameter whose distribution does not depend on the parameter. When one exists, the CI construction is one-line: compute the quantiles of the pivot, rearrange to solve for the parameter, done. The z-CI, t-CI, χ²-CI for variance, and F-CI for variance ratio are the four main exact-small-sample CIs, and all four come from pivots.

Four classical pivots. (a) z-pivot √n(x̄ − μ)/σ ∼ Normal(0, 1); (b) t-pivot √n(x̄ − μ)/S ∼ t with n−1 df, via Basu independence; (c) χ²-pivot (n−1)S²/σ² ∼ χ² with n−1 df; (d) F-pivot (S₁²/σ₁²)/(S₂²/σ₂²) ∼ F with n₁−1 and n₂−1 df. In each panel, the CI for the parameter of interest is the algebraic rearrangement of the pivot's symmetric quantile bracket.

Definition 3 Pivotal quantity

A pivot for a parameter θ\theta is a random function Q(X,θ)Q(X, \theta) of data XX and parameter θ\theta whose distribution does not depend on θ\theta — that is, the law of Q(X,θ)Q(X, \theta) is the same under every PθP_\theta.

Given a pivot with known distribution FQF_Q and quantiles qα/2,q1α/2q_{\alpha/2}, q_{1-\alpha/2}, a (1α)(1-\alpha) CI for θ\theta is obtained by solving

{θ:qα/2Q(X,θ)q1α/2}\{\theta : q_{\alpha/2} \le Q(X, \theta) \le q_{1-\alpha/2}\}

for θ\theta. The inversion is algebra; the probability content is packed into the pivot.

Example 3 z-pivot and t-pivot for a Normal mean

For iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) data with σ\sigma known, Q(X,μ)=n(Xˉμ)/σQ(X, \mu) = \sqrt n (\bar X - \mu)/\sigma is a pivot with QN(0,1)Q \sim \mathcal{N}(0, 1). Inverting zα/2Qzα/2-z_{\alpha/2} \le Q \le z_{\alpha/2} gives Xˉzα/2σ/nμXˉ+zα/2σ/n\bar X - z_{\alpha/2}\sigma/\sqrt n \le \mu \le \bar X + z_{\alpha/2}\sigma/\sqrt n — the z-CI of Example 1.

For σ\sigma unknown, Q(X,μ)=n(Xˉμ)/SQ(X, \mu) = \sqrt n (\bar X - \mu)/S is a pivot with Qtn1Q \sim t_{n-1} — this is where Basu’s independence theorem enters: the distribution of n(Xˉμ)/S\sqrt n (\bar X - \mu)/S is free of both μ\mu and σ\sigma precisely because Xˉ ⁣ ⁣ ⁣S2\bar X \perp\!\!\!\perp S^2 under normality. Inversion gives the t-CI of Example 2. Worked numerically in formalML: .

Example 4 χ²-pivot for Normal variance

For iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) data with μ\mu unknown, Q(X,σ2)=(n1)S2/σ2Q(X, \sigma^2) = (n-1)S^2/\sigma^2 is a pivot with Qχn12Q \sim \chi^2_{n-1}. Invert χn1,α/22Qχn1,1α/22\chi^2_{n-1, \alpha/2} \le Q \le \chi^2_{n-1, 1-\alpha/2}:

(n1)S2χn1,1α/22σ2(n1)S2χn1,α/22.\frac{(n-1)S^2}{\chi^2_{n-1, 1-\alpha/2}} \le \sigma^2 \le \frac{(n-1)S^2}{\chi^2_{n-1, \alpha/2}}.

The interval is asymmetric in σ2\sigma^2 — the larger quantile appears in the denominator of the lower endpoint — a direct consequence of the χ²’s skew. Unlike the z/t intervals, there is no “plus or minus” formulation; the asymmetry is real and reflects the distributional shape.

Example 5 F-pivot for variance ratio

For two independent Normal samples with sample variances S12,S22S_1^2, S_2^2 and sizes n1,n2n_1, n_2, the ratio (S12/σ12)/(S22/σ22)(S_1^2/\sigma_1^2)/(S_2^2/\sigma_2^2) is distributed as Fn11,n21F_{n_1-1, n_2-1} — a pivot for σ12/σ22\sigma_1^2/\sigma_2^2. Inverting gives a CI for the variance ratio:

S12/S22Fn11,n21,1α/2σ12σ22S12/S22Fn11,n21,α/2.\frac{S_1^2/S_2^2}{F_{n_1-1, n_2-1, 1-\alpha/2}} \le \frac{\sigma_1^2}{\sigma_2^2} \le \frac{S_1^2/S_2^2}{F_{n_1-1, n_2-1, \alpha/2}}.

This is the two-sample variance-ratio CI — the standard tool for testing equality of Normal variances and, by inversion, for quantifying how different they might be.

Remark 6 Pivots are rare; test inversion is the general tool

The z, t, χ², and F pivots exhaust essentially every classical example of exact small-sample CIs. For non-Normal families — Bernoulli, Poisson, Exponential, Gamma, Weibull, and every GLM — no exact pivot exists for the parameter of interest. That is why the rest of Topic 19 develops the asymptotic and exact-by-inversion constructions: Wald, Score, LRT, Wilson, Clopper–Pearson, profile likelihood. All of them are test inversions via Theorem 1, not pivot manipulations — and test inversion is the general-purpose tool. Pivots are the special case where the algebra gives a closed form.


19.4 Wald, Score, LRT Confidence Intervals

formalML: proved that the Wald, Score, and LRT statistics all converge in distribution to χ12\chi^2_1 under the null for regular parametric families. Theorem 1 now turns each test into a CI. The three CIs coincide asymptotically — they share the same leading-order χ12\chi^2_1 coverage — but differ at finite nn in ways that matter for rare-event regimes and boundary parameters. §19.5 handles the Bernoulli boundary case; the rest of §19.4 catalogues the trio.

Log-likelihood and three CIs for Bernoulli p̂ = 0.3, n = 50. Wald (amber) is the symmetric interval from the quadratic approximation at p̂. Wilson (purple; score-test inversion) follows the likelihood curvature. LRT (green) bisects the likelihood's χ²₁-threshold contour. All three agree asymptotically; at finite n the asymmetric LRT and Wilson stay further from the boundary.

Definition 4 Wald confidence interval

For a regular parametric family with scalar θ\theta, MLE θ^n\hat\theta_n, and Fisher information I(θ)I(\theta), the Wald CI at level 1α1-\alpha is

CWald(X)=[θ^nzα/2nI(θ^n),  θ^n+zα/2nI(θ^n)].C_{\rm Wald}(X) = \left[\hat\theta_n - \frac{z_{\alpha/2}}{\sqrt{n\,I(\hat\theta_n)}},\; \hat\theta_n + \frac{z_{\alpha/2}}{\sqrt{n\,I(\hat\theta_n)}}\right].

Equivalent description: invert the Wald test Wn(θ0)=n(θ^nθ0)2I(θ^n)zα/22W_n(\theta_0) = n(\hat\theta_n - \theta_0)^2 I(\hat\theta_n) \le z^2_{\alpha/2}. The interval is symmetric around θ^n\hat\theta_n — the quadratic approximation to the log-likelihood at the MLE imposes symmetry regardless of the underlying likelihood’s actual shape.

Definition 5 Score confidence interval

The Score CI is the set of null values the score (Rao) test does not reject:

CScore(X)={θ0:Sn(θ0)zα/22},Sn(θ0)=Un(θ0)2nI(θ0),C_{\rm Score}(X) = \{\theta_0 : S_n(\theta_0) \le z^2_{\alpha/2}\}, \qquad S_n(\theta_0) = \frac{U_n(\theta_0)^2}{n\,I(\theta_0)},

where Un(θ0)=n(θ0)/θU_n(\theta_0) = \partial \ell_n(\theta_0)/\partial \theta is the score at the null. Because the variance I(θ0)I(\theta_0) is evaluated at the null, the resulting endpoints solve a quadratic-in-θ0\theta_0 inequality — generally asymmetric around θ^n\hat\theta_n, and always within the parameter space for Bernoulli (§19.5).

Definition 6 LRT confidence interval

The likelihood-ratio confidence interval is the set of null values the LRT does not reject:

CLRT(X)={θ0:2logΛn(θ0)χ1,1α2},C_{\rm LRT}(X) = \{\theta_0 : -2\log\Lambda_n(\theta_0) \le \chi^2_{1, 1-\alpha}\},

where 2logΛn(θ0)=2[n(θ^n)n(θ0)]-2\log\Lambda_n(\theta_0) = 2[\ell_n(\hat\theta_n) - \ell_n(\theta_0)]. Endpoint computation is a bisection on the log-likelihood’s χ²-threshold contour around the MLE — asymmetric whenever the log-likelihood is non-quadratic, which is the generic case at moderate nn.

Theorem 2 Asymptotic coverage of the Wald/Score/LRT CIs

For a regular parametric family with scalar θ\theta and Fisher information I(θ)I(\theta) continuous in θ\theta, each of CWald,CScore,CLRTC_{\rm Wald}, C_{\rm Score}, C_{\rm LRT} has asymptotic coverage 1α1 - \alpha:

limnPθ0(θ0C(X))=1α,{Wald,Score,LRT}.\lim_{n\to\infty} P_{\theta_0}(\theta_0 \in C_\bullet(X)) = 1 - \alpha, \qquad \bullet \in \{\rm Wald, Score, LRT\}.

Proof. Each CI is the non-rejection region of a test with asymptotic χ12\chi^2_1 null distribution. By Topic 18 §18.7 Thm 5, the Wald/Score/LRT statistics each converge in distribution to χ12\chi^2_1 under H0:θ=θ0H_0: \theta = \theta_0. By the continuous mapping theorem, Pθ0(statisticχ1,1α2)1αP_{\theta_0}(\text{statistic} \le \chi^2_{1, 1-\alpha}) \to 1 - \alpha. Hence coverage converges to 1α1 - \alpha by Theorem 1 applied at level α\alpha. ∎

Wald boundary pathology. At p̂ = 1/30 ≈ 0.033, the Wald CI lower endpoint is below zero — outside the parameter space — because the quadratic approximation to the log-likelihood extrapolates through the boundary. Wilson, by evaluating the variance at the null p₀ rather than at p̂, stays inside [0, 1] for every x.

Example 6 Bernoulli — all three CIs at p̂ = 0.3, n = 50

For iid Bernoulli(p)(p) with p^=0.3\hat p = 0.3, n=50n = 50, at α=0.05\alpha = 0.05:

  • Wald: p^±z0.025p^(1p^)/n=0.3±1.960.0648=[0.173,0.427]\hat p \pm z_{0.025}\sqrt{\hat p(1-\hat p)/n} = 0.3 \pm 1.96 \cdot 0.0648 = [0.173, 0.427].
  • Score (Wilson): closed-form quadratic inversion (§19.5 Proof 2); here [0.191,0.438][0.191, 0.438].
  • LRT: bisection on 2logΛn(p0)=χ1,0.952-2\log\Lambda_n(p_0) = \chi^2_{1, 0.95}; here [0.185,0.435][0.185, 0.435].

All three are close — agreement to ≈ 0.02 in each endpoint — because n=50n = 50 is moderate and p^=0.3\hat p = 0.3 is not near a boundary. Agreement deteriorates as n20n \to 20 or p^0\hat p \to 0; §19.5 and §19.8 quantify.

Example 7 Poisson rate — all three CIs at λ̂ = 2, n = 30

For iid Poisson(λ)(\lambda) with λ^=2\hat\lambda = 2, n=30n = 30, α=0.05\alpha = 0.05:

  • Wald: λ^±z0.025λ^/n=2±1.960.258=[1.494,2.506]\hat\lambda \pm z_{0.025}\sqrt{\hat\lambda/n} = 2 \pm 1.96 \cdot 0.258 = [1.494, 2.506].
  • Score: invert n(λ^λ0)2/λ0=z2n(\hat\lambda - \lambda_0)^2/\lambda_0 = z^2; quadratic in λ0\lambda_0 gives [1.527,2.572][1.527, 2.572].
  • LRT: bisection on 2n[λ^log(λ^/λ0)(λ^λ0)]=χ1,0.9522n[\hat\lambda\log(\hat\lambda/\lambda_0) - (\hat\lambda - \lambda_0)] = \chi^2_{1, 0.95}; here [1.529,2.572][1.529, 2.572].

Score and LRT agree to three decimals here; Wald is visibly different because the Poisson log-likelihood’s curvature at λ^=2\hat\lambda = 2 differs from the quadratic approximation.

Remark 7 CRLB as CI-width envelope

The Cramér-Rao lower bound from formalML: says that every unbiased estimator θ^n\hat\theta_n satisfies Varθ(θ^n)1/(nI(θ))\text{Var}_\theta(\hat\theta_n) \ge 1/(n I(\theta)). Dualized: no Wald-type CI can have width less than 2zα/2/nI(θ)2 z_{\alpha/2} / \sqrt{n I(\theta)} at its leading-order rate. The CRLB is thus the asymptotic width envelope for confidence intervals — the same Fisher information that bounds estimator variance bounds CI width. All three asymptotic CIs of §19.4 achieve this envelope to leading order; the finite-sample corrections are where they differ.

Remark 8 Finite-sample divergence and reparameterization

Topic 18 §18.8 showed that Wald, Score, and LRT differ finite-sample, with the Wald test sensitive to reparameterization: a logit transform and its Wald CI back-transform do not equal the pp-scale Wald CI. The same is true for Wald CIs; the LRT CI, by contrast, is invariant — {θ0:2logΛn(θ0)χ1,1α2}\{\theta_0 : -2\log\Lambda_n(\theta_0) \le \chi^2_{1, 1-\alpha}\} transforms to {g(θ0):2logΛn(g(θ0))χ1,1α2}\{g(\theta_0) : -2\log\Lambda_n(g(\theta_0)) \le \chi^2_{1, 1-\alpha}\} under a bijection gg, because 2logΛn-2\log\Lambda_n depends only on the likelihood values, not on how θ\theta is parameterized. This is why production GLM libraries default to LRT (aka “deviance”) CIs for coefficients whose null effect is at a boundary (logistic regression on rare events, Poisson on small counts); the reparameterization-dependent Wald CI gives qualitatively different answers depending on the scale chosen. formalML: has the proof for the test-statistic version; dualization via Theorem 1 transfers it verbatim to CIs.


19.5 The Wilson Interval

The Wald CI for Bernoulli pp fails spectacularly near the boundary: at p^=0\hat p = 0 it collapses to [0,0][0, 0]; at p^=1/300.033\hat p = 1/30 \approx 0.033 its lower endpoint is below zero, outside the parameter space. The fix is to invert the score test rather than the Wald test — evaluating the variance p0(1p0)/np_0(1-p_0)/n at the null p0p_0 rather than at the MLE p^\hat p. The resulting closed-form CI stays in [0,1][0, 1] automatically, and is the Wilson interval of Wilson (1927) — the industry default for A/B-test conversion-rate confidence intervals.

Theorem 3 Wilson interval

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)(p) with p^n=Xˉn\hat p_n = \bar X_n. The asymptotic level-α\alpha score test of H0:p=p0H_0: p = p_0 rejects iff Zn(p0)>zα/2|Z_n(p_0)| > z_{\alpha/2}, where Zn(p0)=n(p^np0)/p0(1p0)Z_n(p_0) = \sqrt n (\hat p_n - p_0)/\sqrt{p_0(1-p_0)}. The test-inversion CI is the Wilson interval

CWilson(X)=p^n+z22n±zp^n(1p^n)n+z24n21+z2/n,C_{\rm Wilson}(X) = \frac{\hat p_n + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat p_n(1-\hat p_n)}{n} + \frac{z^2}{4n^2}}}{1 + z^2/n},

where z=zα/2z = z_{\alpha/2}.

Proof [show]

Step 1 — Set up the inversion. By Theorem 1, p0C(X)p_0 \in C(X) iff the score test fails to reject at p0p_0: Zn(p0)2z2Z_n(p_0)^2 \le z^2, which is

(p^np0)2z2p0(1p0)n.(\hat p_n - p_0)^2 \le z^2 \frac{p_0(1 - p_0)}{n}.

Step 2 — Rearrange as a quadratic in p0p_0. Expand both sides and collect terms in p0p_0:

p^n22p^np0+p02z2p0nz2p02n.\hat p_n^2 - 2\hat p_n p_0 + p_0^2 \le \frac{z^2 p_0}{n} - \frac{z^2 p_0^2}{n}.

Grouping into the quadratic inequality Ap02+Bp0+C0A p_0^2 + B p_0 + C \le 0 with

A=1+z2n,B= ⁣(2p^n+z2n),C=p^n2.A = 1 + \frac{z^2}{n}, \qquad B = -\!\left(2\hat p_n + \frac{z^2}{n}\right), \qquad C = \hat p_n^2.

Step 3 — Solve. The coefficient A>0A > 0, so the inequality Ap02+Bp0+C0A p_0^2 + B p_0 + C \le 0 defines the interval [p,p+][p_-, p_+] between the roots of the quadratic equation. By the quadratic formula,

p±=B±B24AC2A=2p^n+z2/n±(2p^n+z2/n)24(1+z2/n)p^n22(1+z2/n).p_\pm = \frac{-B \pm \sqrt{B^2 - 4AC}}{2A} = \frac{2\hat p_n + z^2/n \pm \sqrt{(2\hat p_n + z^2/n)^2 - 4(1 + z^2/n)\hat p_n^2}}{2(1 + z^2/n)}.

Step 4 — Simplify the discriminant. Expanding (2p^n+z2/n)24(1+z2/n)p^n2(2\hat p_n + z^2/n)^2 - 4(1 + z^2/n)\hat p_n^2 and cancelling the 4p^n24\hat p_n^2 terms gives 4z2p^n/n+z4/n24z2p^n2/n=4z2p^n(1p^n)/n+z4/n24z^2 \hat p_n/n + z^4/n^2 - 4z^2\hat p_n^2/n = 4z^2\hat p_n(1 - \hat p_n)/n + z^4/n^2. Factoring 4z2/n24z^2/n^2 from under the square root:

(2p^n+z2/n)24(1+z2/n)p^n2=2zp^n(1p^n)n+z24n2.\sqrt{(2\hat p_n + z^2/n)^2 - 4(1 + z^2/n)\hat p_n^2} = 2z \sqrt{\frac{\hat p_n(1-\hat p_n)}{n} + \frac{z^2}{4n^2}}.

Step 5 — Assemble. Substituting back and dividing numerator and denominator by 2:

p±=p^n+z2/(2n)±zp^n(1p^n)/n+z2/(4n2)1+z2/n.p_\pm = \frac{\hat p_n + z^2/(2n) \pm z\sqrt{\hat p_n(1-\hat p_n)/n + z^2/(4n^2)}}{1 + z^2/n}.

This is the stated Wilson interval. Note that the z2/(2n)z^2/(2n) shift in the numerator — the regularizing ingredient that keeps the endpoints inside [0,1][0, 1] for every p^n\hat p_n — comes from evaluating the variance at p0p_0 rather than at p^n\hat p_n in Step 1. Wald’s boundary pathology (Rem 8 of Topic 18 §18.8) is exactly the absence of this shift.

∎ — using score-test inversion (Topic 18 §18.7) and Theorem 1; WIL1927.

Example 8 Boundary example: p̂ = 0

At n=20,x=0,α=0.05n = 20, x = 0, \alpha = 0.05: p^n=0\hat p_n = 0, z=1.96z = 1.96.

  • Wald: [0,0][0, 0] — a point, coverage 0 at every p>0p > 0.
  • Wilson: center =(0+1.962/40)/(1+1.962/20)=0.0805= (0 + 1.96^2/40)/(1 + 1.96^2/20) = 0.0805; half-width =1.960+1.962/1600/(1.192)=0.0805= 1.96\sqrt{0 + 1.96^2/1600}/(1.192) = 0.0805. Lower =max(0,0.08050.0805)=0= \max(0, 0.0805 - 0.0805) = 0; upper =0.0805+0.0805(corrected)0.161= 0.0805 + 0.0805 \cdot \text{(corrected)} \approx 0.161. Proper interval.
  • Clopper–Pearson (next section): [0,0.168][0, 0.168].

The Wilson and Clopper–Pearson upper endpoints agree to 5%\approx 5\%; Wald is catastrophic. This is the concrete content of Topic 18 §18.8 Rem 16: for rare-event A/B tests, Wald under-covers at the boundary, and Wilson is the practical default.

Remark 9 Agresti–Coull 'plus-4' approximation

formalML: proposed an easy-to-remember approximation to Wilson: add z2/22z^2/2 \approx 2 successes and z2/22z^2/2 \approx 2 failures to the observed counts, then apply the Wald formula to the inflated sample. At α=0.05\alpha = 0.05, z2/21.922z^2/2 \approx 1.92 \approx 2, so the popular form is “add 2 successes, add 2 failures, Wald the result.” The resulting interval usually matches Wilson within 0.010.01 and is a reasonable hand-calculator substitute. The pedagogical slogan — approximate is better than exact for interval estimation of binomial proportions — captures the main takeaway of BRO2001 in five words.

Remark 10 BRO2001 — coverage calibration quantified

formalML: gave the systematic diagnostic for binomial CI coverage: actual coverage as a function of true pp across a grid of nn. Their Table 1 is the authority for test cases (§19.8 Ex 12); their key finding is that Wald’s actual coverage oscillates around 1α1-\alpha but dips below nominal — sometimes by 0.10.1 or more — at moderate pp and small nn. Wilson’s coverage oscillates around 1α1-\alpha with much smaller amplitude; Clopper–Pearson is always at or above nominal (conservative) but often by 0.030.03 or more. The paper’s recommendation for practical work: Wilson as default; Agresti–Coull as hand-calculator substitute; Clopper–Pearson only when a strict lower bound on coverage matters (regulatory submissions, conservative monitoring).


19.6 Clopper–Pearson Exact Intervals

The Wald and Wilson CIs for binomial pp are asymptotic — derived from the χ12\chi^2_1 null distribution of the score test. The formalML: is exact: it guarantees coverage 1α\ge 1 - \alpha for every p[0,1]p \in [0, 1], no matter how small nn. The price is conservatism — actual coverage strictly exceeds nominal at most pp — and the mechanism is the discreteness of the binomial: its CDF jumps, so you can only control size at or below α\alpha, never exactly at α\alpha.

Clopper–Pearson via the beta–binomial CDF identity. Left: Beta(x, n−x+1) density with shaded α/2 lower tail; the quantile at that tail is p_L. Right: Beta(x+1, n−x) density with shaded α/2 upper tail; the quantile is p_U. Two beta quantiles = one CI endpoint pair.

Theorem 4 Clopper–Pearson interval

Let XBinomial(n,p)X \sim \text{Binomial}(n, p) with observed value x{0,1,,n}x \in \{0, 1, \ldots, n\}. The Clopper–Pearson (1α)(1-\alpha) confidence interval for pp is

CCP(x)=[pL,pU],pL=B1(α/2;x,nx+1),pU=B1(1α/2;x+1,nx),C_{\rm CP}(x) = [p_L, p_U], \qquad p_L = B^{-1}(\alpha/2;\, x, n-x+1), \qquad p_U = B^{-1}(1-\alpha/2;\, x+1, n-x),

where B1(q;a,b)B^{-1}(q; a, b) is the qq-quantile of the Beta(a,b)(a, b) distribution, with the conventions pL=0p_L = 0 when x=0x = 0 and pU=1p_U = 1 when x=nx = n. Coverage Pp(pCCP(X))1αP_p(p \in C_{\rm CP}(X)) \ge 1 - \alpha for every p[0,1]p \in [0, 1].

Procedure
Nominal 0.9500.000.200.400.600.801.000.860.880.900.920.940.960.981.00True pActual coverage
Wald
Mean coverage: 0.9255
Wilson
Mean coverage: 0.9503
Clopper–Pearson
Mean coverage: 0.9643

Coverage computed exactly by summing P_p(X = x) over all outcomes whose CI contains p. Wald oscillates below nominal; Wilson hugs nominal with small amplitude; Clopper–Pearson stays at or above nominal (conservative). BRO2001 Table 1 is the authoritative reference.

Proof [show]

Step 1 — Exact two-sided test inversion. The exact two-sided test of H0:p=p0H_0: p = p_0 at level α\alpha (Topic 17 §17.6 Ex 11) fails to reject at p0p_0 iff both tail probabilities exceed α/2\alpha/2:

Pp0(Xx)α/2andPp0(Xx)α/2.P_{p_0}(X \ge x) \ge \alpha/2 \qquad \text{and} \qquad P_{p_0}(X \le x) \ge \alpha/2.

By Theorem 1, the non-rejection set is the CI [pL,pU][p_L, p_U]: pLp_L is the largest p0p_0 satisfying Pp0(Xx)=α/2P_{p_0}(X \ge x) = \alpha/2 (equality by continuity of the binomial CDF in pp), and pUp_U is the smallest p0p_0 satisfying Pp0(Xx)=α/2P_{p_0}(X \le x) = \alpha/2.

Step 2 — Beta–binomial identity. The master identity — provable by repeated integration by parts — is

Pp(Xk)=j=kn(nj)pj(1p)nj=Ip(k,nk+1),P_p(X \ge k) = \sum_{j=k}^n \binom{n}{j} p^j(1-p)^{n-j} = I_p(k, n-k+1),

where Ix(a,b)=B(x;a,b)/B(a,b)I_x(a, b) = B(x; a, b)/B(a, b) is the regularized incomplete beta (the Beta(a,b)(a, b) CDF at xx). Equivalently, Pp(Xk)=1Ip(k+1,nk)P_p(X \le k) = 1 - I_p(k+1, n-k).

Step 3 — Solve for pLp_L. Setting PpL(Xx)=α/2P_{p_L}(X \ge x) = \alpha/2 and applying the identity with k=xk = x:

IpL(x,nx+1)=α/2pL=B1(α/2;x,nx+1).I_{p_L}(x, n-x+1) = \alpha/2 \quad\Longleftrightarrow\quad p_L = B^{-1}(\alpha/2;\, x, n-x+1).

The second equivalence identifies the inverse regularized incomplete beta with the Beta(x,nx+1)(x, n-x+1) inverse CDF.

Step 4 — Solve for pUp_U. Symmetrically, setting PpU(Xx)=α/2P_{p_U}(X \le x) = \alpha/2 and using Pp(Xx)=1Ip(x+1,nx)P_p(X \le x) = 1 - I_p(x+1, n-x):

1IpU(x+1,nx)=α/2IpU(x+1,nx)=1α/2,1 - I_{p_U}(x+1, n-x) = \alpha/2 \quad\Longleftrightarrow\quad I_{p_U}(x+1, n-x) = 1 - \alpha/2,

hence pU=B1(1α/2;x+1,nx)p_U = B^{-1}(1-\alpha/2;\, x+1, n-x).

Step 5 — Boundary conventions. At x=0x = 0, Pp(X0)=1α/2P_p(X \ge 0) = 1 \ge \alpha/2 for every pp — the “lower tail” constraint is vacuous — so the CI extends down to pL=0p_L = 0. Similarly x=nx = n gives pU=1p_U = 1. The formulas with x+1x + 1 and nxn - x swapped in the second Beta keep the quantile formulas meaningful at the boundaries (Beta(1,n)(1, n) for x=0x = 0, Beta(n,1)(n, 1) for x=nx = n).

Step 6 — Coverage bound. Because the binomial is discrete, the exact tail probabilities Pp(Xx)P_p(X \ge x) and Pp(Xx)P_p(X \le x) are step functions of pp with jumps at the n+1n + 1 possible values of XX. Enforcing the α/2\alpha/2-size constraint at equality in Step 1 means the test’s actual size is α\le \alpha (over-controlled at most p0p_0); by Step 2 of Proof 1, the CI’s coverage is 1α\ge 1 - \alpha. Equality is attained only at boundary points where the discrete CDF achieves α/2\alpha/2 exactly — hence the CI is exact but generically conservative, with actual coverage strictly exceeding nominal at most pp.

∎ — by the beta–binomial CDF identity and exact test inversion (Theorem 1); CLO1934.

Example 9 Clopper–Pearson at n = 20, x = 3

At α=0.05\alpha = 0.05:

  • pL=B1(0.025;3,18)0.032p_L = B^{-1}(0.025;\, 3, 18) \approx 0.032,
  • pU=B1(0.975;4,17)0.379p_U = B^{-1}(0.975;\, 4, 17) \approx 0.379.

So the 95% Clopper–Pearson CI is [0.032,0.379][0.032, 0.379]. The Wilson CI at the same (n,x,α)(n, x, \alpha) is [0.054,0.360][0.054, 0.360]; Wald is [0.007,0.307][-0.007, 0.307] before clamping — the lower endpoint is negative, reflecting the same pathology as Example 8 at a less extreme level. Clopper–Pearson’s conservatism shows up as the widest of the three intervals: it buys guaranteed coverage at the cost of width. Worked numerically in formalML: .

Remark 11 Why conservatism — discreteness of the binomial

A discrete CDF cannot assign mass exactly α/2\alpha/2 to a tail unless α/2\alpha/2 happens to coincide with a cumulative PMF value at some integer cutoff. Generically it doesn’t, so the exact tail test must reject only when the cumulative mass is at or below α/2\alpha/2, which gives actual size strictly less than α\alpha at most pp. Dualized via Theorem 1, this yields actual coverage strictly above 1α1 - \alpha — conservatism. The same phenomenon reappears for Poisson, negative binomial, hypergeometric — every discrete family. The only way to achieve exact coverage from a discrete test is to use a randomized test, which no one does in practice because it delivers different answers on the same data.

Remark 12 When to choose Clopper–Pearson

Default choice for binomial CIs is Wilson (Rem 10). Clopper–Pearson is preferred when:

  1. Regulatory requirement of strict coverage. FDA submissions, clinical trial monitoring, and other contexts where “actual coverage 1α\ge 1 - \alpha” must be certifiable regardless of pp.
  2. Very small nn or x=0x = 0 / x=nx = n. The boundary conventions of Theorem 4 extend to the extremes; Wilson’s closed form degenerates at x=0x = 0 (returns [0,z2/(n+z2)][0, z^2/(n + z^2)], well-defined but conservatism-free).
  3. Low tolerance for any coverage dip. In monitoring applications where even a 2% under-coverage is unacceptable, Clopper–Pearson’s always-at-or-above-nominal guarantee is worth the width penalty.

The trade-off is explicit: buy coverage guarantees with width. Wilson gives tighter intervals at roughly-nominal (but sometimes slightly-under) coverage.


19.7 Profile Likelihood Confidence Intervals

Every CI so far has been for a one-parameter family — a clean setup where the Fisher information and test statistic are scalar. In practice, nearly every inference problem has nuisance parameters: in a Normal with unknown variance, the mean is of interest and σ2\sigma^2 is nuisance; in a Gamma, the shape is of interest and the rate is nuisance; in logistic regression, one coefficient is of interest and the rest are nuisance. The profile likelihood handles this by profiling out the nuisance at each value of the parameter of interest, reducing the problem to a scalar LRT. Fulfills formalML: .

Profile likelihood for Gamma shape α with rate β profiled out. Left: joint log-likelihood surface with the profile ridge β̂(α) = α/x̄ traced in green. Right: the 1D profile ℓ_P(α) with the χ²₁-threshold line (red dashed); the CI is the band between the two intersections. The profile recovers Wilks' test applied to the 1D restriction — asymptotic χ²₁ coverage at n → ∞, finite-n gap quantified by BRO2001-style calibration.

Definition 7 Profile log-likelihood and profile CI

Let {Pθ,ψ:θΘR,ψΨRk1}\{P_{\theta, \psi} : \theta \in \Theta \subset \mathbb{R}, \psi \in \Psi \subset \mathbb{R}^{k-1}\} be a regular parametric family with scalar parameter of interest θ\theta and nuisance ψ\psi. The profile log-likelihood is

P(θ)=supψ(θ,ψ)=(θ,ψ^(θ)),\ell_P(\theta) = \sup_\psi \ell(\theta, \psi) = \ell(\theta, \hat\psi(\theta)),

where ψ^(θ)\hat\psi(\theta) is the conditional MLE of ψ\psi at fixed θ\theta. The profile-likelihood confidence interval at level 1α1 - \alpha is

CP(X)={θ:2[P(θ^n)P(θ)]χ1,1α2},C_P(X) = \{\theta : 2\,[\ell_P(\hat\theta_n) - \ell_P(\theta)] \le \chi^2_{1, 1-\alpha}\},

where θ^n\hat\theta_n is the profile MLE (equivalently, the θ\theta-coordinate of the joint MLE).

Theorem 5 Profile-likelihood CI asymptotic coverage

Under Wilks’ regularity conditions (see Topic 18 §18.6),

P(θ0,ψ0)(θ0CP(X))1αas n,P_{(\theta_0, \psi_0)}(\theta_0 \in C_P(X)) \to 1 - \alpha \qquad \text{as } n \to \infty,

for every (θ0,ψ0)Θ×Ψ(\theta_0, \psi_0) \in \Theta \times \Psi.

ℓ_P(θ̂) − χ²₁₋α / 2WaldExact t-1.00-0.500.000.501.001.502.00-35.0-34.5-34.0-33.5-33.0-32.5μ (Normal mean)ℓ_P(θ)
Profile CI (Wilks)
[-0.137, 0.986] — threshold-crossings of the χ²₁ level curve.
Wald (quadratic) vs Exact t
Wald: [-0.111, 0.959], t-CI: [-0.162, 1.010]. The Wald-to-t gap shrinks as 1/n; at large n all three agree.

Profile curve is ℓ_P(θ) = sup_ψ ℓ(θ, ψ). The shaded band is {θ : 2[ℓ_P(θ̂) − ℓ_P(θ)] ≤ χ²₁₋α} — the CI obtained by inverting the generalized LRT with ψ profiled out (Wilks' theorem applies).

Proof [show]

Step 1 — Profile LRT statistic. Fix the true parameter (θ0,ψ0)(\theta_0, \psi_0). The test-inversion construction of CP(X)C_P(X) is precisely the inversion of the generalized LRT ( formalML: ) for the composite null H0:θ=θ0H_0: \theta = \theta_0 with ψ\psi nuisance:

2logΛn(θ0)=2[(θ^n,ψ^n)supψ(θ0,ψ)]=2[P(θ^n)P(θ0)].-2\log\Lambda_n(\theta_0) = 2\,[\ell(\hat\theta_n, \hat\psi_n) - \sup_\psi \ell(\theta_0, \psi)] = 2\,[\ell_P(\hat\theta_n) - \ell_P(\theta_0)].

The equality uses the definition of the profile: supψ(θ0,ψ)=P(θ0)\sup_\psi \ell(\theta_0, \psi) = \ell_P(\theta_0) and (θ^n,ψ^n)=P(θ^n)\ell(\hat\theta_n, \hat\psi_n) = \ell_P(\hat\theta_n) by construction.

Step 2 — Wilks’ asymptotic null. By formalML: , under the regular-parametric assumptions and with scalar θ\theta restricted under H0H_0,

2logΛn(θ0)dχ12under (θ0,ψ0).-2\log\Lambda_n(\theta_0) \xrightarrow{d} \chi^2_1 \qquad \text{under } (\theta_0, \psi_0).

Step 3 — Coverage by inversion. The event {θ0CP(X)}\{\theta_0 \in C_P(X)\} is the event {2logΛn(θ0)χ1,1α2}\{-2\log\Lambda_n(\theta_0) \le \chi^2_{1, 1-\alpha}\} — the non-rejection event for the LRT at θ0\theta_0. By Step 2 and the continuous mapping theorem,

P(θ0,ψ0)(θ0CP(X))=P(θ0,ψ0)(2logΛn(θ0)χ1,1α2)P(χ12χ1,1α2)=1α.P_{(\theta_0, \psi_0)}(\theta_0 \in C_P(X)) = P_{(\theta_0, \psi_0)}(-2\log\Lambda_n(\theta_0) \le \chi^2_{1, 1-\alpha}) \to P(\chi^2_1 \le \chi^2_{1, 1-\alpha}) = 1 - \alpha.

∎ — using Wilks’ theorem (Topic 18 §18.6 Thm 4), continuous mapping ( formalML: ), and Theorem 1.

Example 10 Normal mean with unknown variance — profile recovers the t-CI

For XiN(μ,σ2)X_i \sim \mathcal{N}(\mu, \sigma^2) iid, the conditional MLE of σ\sigma at fixed μ\mu is σ^(μ)=n1(Xiμ)2\hat\sigma(\mu) = \sqrt{n^{-1}\sum (X_i - \mu)^2}. Plugging into the Normal log-likelihood and simplifying, the profile for μ\mu becomes

P(μ)=n2log(2πn(Xiμ)2)n2.\ell_P(\mu) = -\frac{n}{2}\log\left(\frac{2\pi}{n}\sum(X_i - \mu)^2\right) - \frac{n}{2}.

The profile MLE is μ^n=Xˉ\hat\mu_n = \bar X. Expanding 2[P(Xˉ)P(μ)]χ1,1α2-2[\ell_P(\bar X) - \ell_P(\mu)] \le \chi^2_{1, 1-\alpha} and rearranging:

n(Xˉμ)2σ^2(Xˉ)exp ⁣(χ1,1α2n)nn.\frac{n(\bar X - \mu)^2}{\hat\sigma^2(\bar X)} \le \exp\!\left(\frac{\chi^2_{1, 1-\alpha}}{n}\right) \cdot n - n.

At nn large this is nearly n(Xˉμ)2/S2χ1,1α2n(\bar X - \mu)^2/S^2 \le \chi^2_{1, 1-\alpha} — the asymptotic Wald-CI form. The exact t-CI is n(Xˉμ)/Stn1,α/2\sqrt n(\bar X - \mu)/S \le t_{n-1, \alpha/2}; the ratio tn1,α/22/χ1,1α2t_{n-1, \alpha/2}^2 / \chi^2_{1, 1-\alpha} is the source of the asymptotic-vs-exact gap. At n=30n = 30, t29,0.02524.18t_{29, 0.025}^2 \approx 4.18 vs χ1,0.9523.84\chi^2_{1, 0.95} \approx 3.84, so the profile CI is slightly wider than the t-CI — an 8% gap that shrinks as n1n^{-1}.

Example 11 Gamma shape with unknown rate

For XiGamma(α,β)X_i \sim \text{Gamma}(\alpha, \beta) iid, the conditional MLE of β\beta at fixed α\alpha has a closed form: the rate MLE score equation is nα/βXi=0n\alpha/\beta - \sum X_i = 0, giving β^(α)=α/Xˉ\hat\beta(\alpha) = \alpha/\bar X. The profile for α\alpha is

P(α)=nαlog(α/Xˉ)nαnlogΓ(α)+(α1)logXi.\ell_P(\alpha) = n\alpha\log(\alpha/\bar X) - n\alpha - n\log\Gamma(\alpha) + (\alpha - 1)\sum \log X_i.

The profile MLE α^n\hat\alpha_n is the solution of P(α)=0\ell_P'(\alpha) = 0 — no closed form; solve numerically. The profile CI is then the χ12\chi^2_1-threshold contour around α^n\hat\alpha_n. For a seeded sample of n=30n = 30 from Gamma(2,1)\text{Gamma}(2, 1), the profile CI for α\alpha contains α=2.0\alpha = 2.0 with the expected 1α1 - \alpha probability — confirmed empirically in ProfileLikelihoodExplorer and test 19J of the testing harness.

Remark 13 Wilks as the engine, not a lemma

Proof 4 is three lines long. The heavy lifting — the Taylor expansion of the log-likelihood, remainder control, the asymptotic-normality-to-chi-squared step — all lives in the Wilks proof at Topic 18 §18.6, which Topic 19 consumes as a black-box engine. This is characteristic of how testing-theoretic machinery propagates: once Wilks is established, every composite-LRT and every profile-CI coverage fact follows from a one-line invocation. The pedagogical cost of not re-deriving Wilks at each use is zero; the reader who wants the details goes back to Topic 18, and the new material of Topic 19 stays about CIs rather than recycling Wilks.

Remark 14 Why profile CIs are the GLM default

In generalized linear models — logistic, Poisson, gamma regression, log-linear models — the workhorse CI for a single coefficient is the profile-likelihood CI, computed by refitting the model with the coefficient fixed at a grid of values and tracing the deviance drop. R’s confint() on a glm object defaults to this; the car::Confint extension makes it the explicit recommendation. The reason is the combination of (i) reparameterization invariance (Rem 8), (ii) asymptotic efficiency via Wilks (this proof), and (iii) correct coverage at boundaries (no Wald-type pathology). The cost is computational: each CI endpoint requires a refit. For large models this can be prohibitive, and practitioners fall back to Wald with sandwich variance estimators — a deliberate trade of statistical precision for compute.

Remark 15 Vector-θ profile CIs via envelope theorem

When θ\theta is vector-valued and we want a CI for a scalar function g(θ)g(\theta), the profile is over {ψ:g(ψ)=c}\{\psi : g(\psi) = c\} at each target value cc. Differentiability of P\ell_P in cc follows from the Danskin envelope theorem: dP/dcd\ell_P/dc equals /c\partial \ell / \partial c evaluated at the conditional MLE, under regularity. The vector-θ profile-CI theory is developed in formalML: ; for Topic 19 we stay scalar.


19.8 Coverage Diagnostics: Actual vs Nominal

Every CI procedure in §19.3–§19.7 advertises nominal coverage 1α1 - \alpha. The question §19.8 asks is: what is the actual coverage at every true parameter value? For discrete families — binomial, Poisson — the answer is not ”1α1 - \alpha everywhere.” It oscillates with pp, sometimes dips below 1α1 - \alpha (Wald’s catastrophic failure), sometimes stays safely above (Clopper–Pearson’s conservatism). Getting the diagnostic right is the difference between a procedure you can trust and one you can’t.

Binomial coverage at n = 20, 100, 500 with α = 0.05. Wald (amber) oscillates below nominal across p ∈ [0.005, 0.995] and crashes near the boundary. Wilson (purple) hovers around 0.95 with modest sawtooth. Clopper–Pearson (green) is always at or above 0.95 but often substantially — the price of exact coverage. Horizontal dashed line at the nominal 0.95 level.

Definition 8 Actual coverage

For a CI procedure CC at nominal level 1α1 - \alpha, the actual coverage at parameter θ\theta is

Cov(θ;C,n,α)=Pθ(θC(X)),\text{Cov}(\theta; C, n, \alpha) = P_\theta(\theta \in C(X)),

where XX is a sample of size nn from PθP_\theta. The nominal level is advertised; the actual level is the performance. For continuous families and asymptotic CIs, Cov(θ;C,n,α)1α\text{Cov}(\theta; C, n, \alpha) \to 1 - \alpha as nn \to \infty; for discrete families, the limit is one-sided inequality (conservative) or oscillating (non-conservative).

Theorem 6 Wald CI under-coverage at the Bernoulli boundary

For iid Bernoulli(p)(p) with nn fixed, the Wald CI CWald(X)=p^n±zα/2p^n(1p^n)/nC_{\rm Wald}(X) = \hat p_n \pm z_{\alpha/2}\sqrt{\hat p_n(1-\hat p_n)/n} satisfies

limp0+Cov(p;CWald,n,α)=0,\lim_{p \to 0^+}\text{Cov}(p; C_{\rm Wald}, n, \alpha) = 0,

for every n<n < \infty. That is, the Wald interval has actual coverage going to zero as pp approaches the boundary. Proof sketch. As p0+p \to 0^+, the probability that X=0X = 0 in all nn draws is (1p)n1(1-p)^n \to 1; conditional on X=0X = 0, p^n=0\hat p_n = 0 and the Wald CI is [0,0][0, 0] — which does not contain any p>0p > 0. The remaining events have probability 0\to 0. Formally provable via BRO2001’s exact-coverage calculation for p0p \downarrow 0. ∎

Example 12 Monte Carlo coverage at n ∈ {20, 100, 500}, BRO2001 Table 1

At α=0.05\alpha = 0.05, for pp varying over a fine grid:

(n,p)(n, p)WaldWilsonClopper–Pearson
(20,0.05)(20, 0.05)0.3680.368 (under)0.9210.921 (under slightly)0.9900.990 (conservative)
(20,0.10)(20, 0.10)0.8780.878 (under)0.9610.961 (good)0.9970.997 (conservative)
(20,0.50)(20, 0.50)0.9520.952 (good)0.9590.959 (good)0.9880.988 (conservative)
(100,0.05)(100, 0.05)0.8730.873 (under)0.9360.936 (under slightly)0.9720.972 (conservative)
(500,0.05)(500, 0.05)0.9410.941 (near)0.9520.952 (good)0.9590.959 (near)

Source: BRO2001 Table 1, extended to n=500n = 500. The pattern: Wald under-covers across small nn and small pp; Wilson oscillates around nominal with small amplitude; Clopper–Pearson is always at or above nominal but often notably so. Numerical reproduction via actualCoverageBinomial in the testing harness (test 19D).

Remark 16 Anti-conservative vs conservative — why both are bad

Three coverage failure modes, three implications:

  • Anti-conservative (actual < nominal). Your 95% CI covers less than 95% of the time. Inferences drawn from it are over-confident — reject the null too often (size inflation); treat confidence bands as narrower than they actually are. Worst case for hypothesis testing, because false positives are expensive.

  • Conservative (actual > nominal). Your 95% CI covers more than 95% of the time. Inferences are under-powered — fail to reject when the effect is real (power loss); present intervals wider than necessary. Bad for A/B test throughput; OK for regulatory submissions where guaranteed coverage matters more than efficiency.

  • Correct (actual ≈ nominal). Gold standard; what asymptotic theory promises.

Wald’s failure mode is anti-conservative (undersizing), which is the worse failure for testing applications. Clopper–Pearson’s conservatism is the “safe” failure — it costs power but doesn’t inflate Type I error. Wilson threads the needle by oscillating around nominal with small amplitude.

Remark 17 Coverage as a misspecification tool

Running a coverage simulation — generate many samples from a known model, apply your CI procedure, count coverage — is the quickest diagnostic for whether your asymptotic theory is accurate at your actual nn. If the simulated coverage is far from nominal, something is wrong: the sample size is too small for asymptotic validity, the parametric model is misspecified, or the CI procedure doesn’t match the data-generating process. Each of these has a different fix, but all start by seeing that coverage is wrong. In practice, running a coverage simulation on your specific setup before trusting the default CI is cheap and often illuminating — especially for GLMs at small cell counts or for hierarchical models where nominal coverage can be off by 510%5-10\%.


19.9 One-Sided CIs and TOST Equivalence Testing

The CIs so far have been two-sided — sets [L(X),U(X)][L(X), U(X)] bounded on both sides. Two variants matter in practice. One-sided CIs bound the parameter only from above (or only from below) — useful when you care about a worst-case guarantee on one side (toxicity rate, contamination level). TOST (two one-sided tests) flips the question: rather than testing “is θ\theta equal to θ0\theta_0?” it tests “is θ\theta equivalent to θ0\theta_0 to within margin δ\delta?” TOST is the FDA-standard framework for bioequivalence trials.

TOST geometry. Left: conventional test of H₀: μ = 0 with a 95% z-CI. Failing to reject H₀ does NOT establish equivalence — wide CIs can fail to reject everything. Right: TOST framing. Two one-sided tests of H₀^L: μ ≤ −δ and H₀^U: μ ≥ +δ at level α each. Equivalence is established iff the (1 − 2α) = 90% two-sided CI fits entirely inside the equivalence region [−δ, +δ]. FDA bioequivalence: ±δ = log 1.25 on log-ratio scale (the "80/125 rule").

Definition 9 One-sided confidence interval

A one-sided upper-bound confidence interval for θ\theta at level 1α1-\alpha is a data-measurable U(X)U(X) satisfying Pθ(θU(X))1αP_\theta(\theta \le U(X)) \ge 1 - \alpha for every θ\theta. Equivalent to inverting a right-tailed level-α\alpha test of H0:θ=θ0H_0: \theta = \theta_0 vs H1:θ<θ0H_1: \theta < \theta_0. A one-sided lower-bound CI is defined symmetrically with L(X)L(X) satisfying Pθ(θL(X))1αP_\theta(\theta \ge L(X)) \ge 1 - \alpha.

Geometrically: (,U(X)](-\infty, U(X)] or [L(X),)[L(X), \infty) as the confidence region. One-sided CIs use zαz_\alpha (the upper-α\alpha quantile), not zα/2z_{\alpha/2} — all the “missing” coverage goes on the bounded side.

Theorem 7 TOST equivalence procedure

Fix θ0\theta_0 and an equivalence margin δ>0\delta > 0. The non-equivalence null is H0:θθ0δH_0: |\theta - \theta_0| \ge \delta; the equivalence alternative is H1:θθ0<δH_1: |\theta - \theta_0| < \delta. The Two One-Sided Tests (TOST) procedure rejects H0H_0 iff both

φL=1{reject H0L:θθ0δ at level α},φU=1{reject H0U:θθ0+δ at level α}\varphi^L = \mathbf{1}\{\text{reject } H_0^L: \theta \le \theta_0 - \delta \text{ at level } \alpha\}, \qquad \varphi^U = \mathbf{1}\{\text{reject } H_0^U: \theta \ge \theta_0 + \delta \text{ at level } \alpha\}

reject, i.e. φL=φU=1\varphi^L = \varphi^U = 1. TOST rejects the non-equivalence null iff the conventional (12α)(1 - 2\alpha) two-sided confidence interval for θ\theta is contained entirely within [θ0δ,θ0+δ][\theta_0 - \delta, \theta_0 + \delta].

Level α\alpha. Each of the two one-sided tests has size α\le \alpha; the intersection has size at most α\alpha under any single point in H0H_0. Attribution: formalML: , the foundational paper introducing the two-tests-inversion framework for FDA bioequivalence.

Example 13 One-sided upper bound on a toxicity rate

A Phase I safety trial in 50 patients observes 2 serious adverse events (p^=0.04\hat p = 0.04). Regulatory requirement: a one-sided upper 97.5% bound on the true AE rate. Using the Wilson construction inverted to one-sided:

pU=p^+z2/(2n)+zp^(1p^)/n+z2/(4n2)1+z2/n0.137,p_U = \frac{\hat p + z^2/(2n) + z\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)}}{1 + z^2/n} \approx 0.137,

with z=z0.025=1.96z = z_{0.025} = 1.96 and n=50n = 50. Interpretation: we can conclude with 97.5% confidence that the true AE rate is at most 13.7%13.7\%. The report does not mention a lower bound because the regulator doesn’t care — only the worst case matters for safety.

Example 14 FDA bioequivalence — the 80/125 rule

A generic drug is “bioequivalent” to the reference if the ratio of mean drug concentrations (test / reference) is between 0.800.80 and 1.251.25. On the log scale, this is ±log(1.25)=±0.2231\pm \log(1.25) = \pm 0.2231. A bioequivalence trial measures log(test/reference)=μ\log(\text{test} / \text{reference}) = \mu with n=24n = 24 paired subjects, observing μ^=0.05\hat\mu = 0.05, S=0.18S = 0.18, so SE(μ^)=0.0367\text{SE}(\hat\mu) = 0.0367. TOST at α=0.05\alpha = 0.05:

  • H0L:μ0.2231H_0^L: \mu \le -0.2231 rejected iff μ^(0.2231)>t23,0.05SE\hat\mu - (-0.2231) > t_{23, 0.05} \cdot \text{SE}: 0.2731>1.7140.0367=0.06300.2731 > 1.714 \cdot 0.0367 = 0.0630 — rejected.
  • H0U:μ0.2231H_0^U: \mu \ge 0.2231 rejected iff 0.2231μ^>t23,0.05SE0.2231 - \hat\mu > t_{23, 0.05} \cdot \text{SE}: 0.1731>0.06300.1731 > 0.0630 — rejected.

Both rejected \Rightarrow equivalence established. Equivalently, the 90% (=12α= 1 - 2\alpha) CI for μ\mu is μ^±t23,0.05SE=[0.013,0.113]\hat\mu \pm t_{23, 0.05} \cdot \text{SE} = [-0.013, 0.113], which fits inside [0.2231,0.2231][-0.2231, 0.2231]. Both formulations agree.

Remark 18 Noninferiority vs equivalence

Three related frameworks with similar arithmetic but different stakes:

  • Equivalence (TOST above). Reject non-equivalence iff [L,U][θ0δ,θ0+δ][L, U] \subset [\theta_0 - \delta, \theta_0 + \delta]. Symmetric.

  • Noninferiority. One-sided: reject non-inferiority iff the upper (or lower) bound of the CI is below (or above) the threshold. Used when the question is “is the new treatment at worst only δ\delta worse than the reference?” — a weaker claim than equivalence.

  • Superiority. Classical test: reject null iff the CI excludes θ0\theta_0. The default when you want to show the new treatment is better, not just equivalent or noninferior.

Confusing the three is common; each is a different level-α\alpha claim. FDA’s preferred default for generics is bioequivalence (TOST); for biosimilars and branded-drug effectiveness claims, noninferiority and superiority trials are standard — and the TOST framework generalizes to all three by choice of which tail(s) to test.


19.10 Limitations and Forward Look

Topic 19 built the frequentist confidence-interval framework for scalar θ\theta. Four directions for deeper study, all deferred to specific later topics or tracks.

Topic 19's forward connections. Central: test-CI duality (Theorem 1). Outward arrows to Track 6 (regression CIs via Wald-t per §21.8 Thm 8 and LRT/profile-likelihood per §21.8 Rem 15), Track 7 (Bayesian credible intervals — different semantics, same numerics under flat priors), Track 8 (bootstrap CIs — distribution-free extension), and Topic 20 (simultaneous CIs with FWER/FDR control). Backward arrow: Topic 18 §18.10 Rem 22 — the organizing-principle remark that Topic 19 fulfilled.

Remark 19 Scope boundary — what Topic 19 did not cover

Six topics stated in pointers, none proved in Topic 19.

  1. Bootstrap CIs. Percentile, BCa, and studentized bootstrap intervals are the nonparametric analog of the pivotal-CI machinery of §19.3. Track 8 develops the theory — Efron 1979, Efron & Tibshirani 1993, Hall 1992 — with the key asymptotic correctness result (Hall’s second-order accuracy for BCa) as the anchor.

  2. Bayesian credible intervals. Track 7’s territory. HPD (highest-posterior-density) intervals, posterior-probability bands, Lindley’s paradox. Rem 3 of §19.1 sets the frequentist/Bayesian contrast; the full Bayesian theory is Track 7.

  3. Simultaneous CIs, confidence ellipsoids, Hotelling’s T2T^2. Vector-θ\theta and multiple-parameter confidence regions require test-family inversion with FWER control (Bonferroni, Scheffé, Tukey, Working–Hotelling). LEH2005 Ch. 7 and Ch. 8 are the canonical references; Topic 20 §20.9 delivers the simultaneous-CI construction as the dual of the §20.4 FWER procedures — Thm 8 plus the SimultaneousCIBands interactive artifact.

  4. Fieller’s theorem, ratio CIs. The CI for a ratio μ1/μ2\mu_1/\mu_2 of Normal means requires Fieller 1954’s machinery — the “confidence set” can be unbounded, disconnected, or empty depending on the sign of the denominator. Niche but important in some drug-discovery contexts; LEH2005 §9.2 has the derivation.

  5. Permutation-based CIs. Invert permutation tests to get distribution-free CIs for location, scale, and dispersion parameters. Fisher’s original exact test framework, extended by Romano 1989. Track 8.

  6. Sequential CIs and confidence sequences. Modern always-valid inference: intervals that remain valid under optional stopping. formalML: and contemporaneous A/B-testing platform literature (Optimizely, LinkedIn, AirBnB) — critical for modern online experimentation. Not covered; formalml.com/ab-testing-platforms gives the production angle.

Remark 20 Bootstrap — the nonparametric extension

The bootstrap ( formalML: ) replaces the assumption of a known parametric family with resampling-with-replacement from the empirical distribution. The percentile bootstrap CI is the (α/2,1α/2)(\alpha/2, 1 - \alpha/2) quantiles of the bootstrap distribution of the statistic; the BCa (bias-corrected accelerated) bootstrap corrects for first-order bias and skewness. Under regularity BCa achieves second-order accuracy — coverage error O(n1)O(n^{-1}) vs percentile’s O(n1/2)O(n^{-1/2}). Track 8 develops the full theory, including the conditions under which the bootstrap fails (sample extrema, non-regular estimands). The key modern application: deriving confidence intervals for ML model performance metrics (test accuracy, ROC AUC, precision at kk) without assuming a specific distributional form.

Remark 21 Bayesian credible intervals — different question, related answer

A (1α)(1-\alpha) Bayesian credible interval is a set CC with posterior probability π(θCX)=1α\pi(\theta \in C \mid X) = 1 - \alpha. For Normal data with known variance and a flat improper prior, the credible interval coincides numerically with the z-CI — same endpoints, different interpretation. For skewed posteriors (Beta, Gamma, mixture) the credible interval and the frequentist CI diverge: the credible interval can be asymmetric in ways the frequentist CI cannot capture unless explicitly constructed (e.g., LRT CIs are asymmetric too). Topic 25 develops the theory; the key takeaways for Topic 19 are (1) frequentist coverage \ne Bayesian credibility in general, (2) the two coincide under specific prior choices (flat improper prior for Normal mean — §25.8 Rem 17), and (3) frequentist guarantees are over data, Bayesian guarantees are over parameter. Topic 25 §25.8 Thm 5 (Bernstein–von Mises) proves the two frameworks agree asymptotically. Different questions, compatible answers under compatibility conditions.

Remark 22 Cheat sheet — CI choice by situation
SituationChoiceRationale
Normal mean, σ\sigma knownz-CI (§19.3 Ex 3)Exact pivot
Normal mean, σ\sigma unknownt-CI (§19.3 Ex 3)Exact pivot via Basu
Normal varianceχ²-CI (§19.3 Ex 4)Exact pivot, asymmetric
Binomial proportion, not near boundaryWilson (§19.5)Asymptotic, stays in [0,1][0, 1]
Binomial proportion, rare eventWilson or Clopper–PearsonWilson default; CP if strict coverage
Binomial proportion, regulatoryClopper–Pearson (§19.6)Exact conservative
Binomial, quick hand calculationAgresti–Coull plus-4 (Rem 9)Approximates Wilson
Poisson rateWald or Score (§19.4)Asymptotic; Wilson analog works
GLM coefficientProfile likelihood (§19.7)Reparameterization invariant
GLM coefficient, large modelWald (§19.4)Speed — refit cost of profile
One-sided bound on risk / rateOne-sided Wilson (§19.9)Worst-case guarantee
Bioequivalence / noninferiorityTOST (§19.9)FDA standard
Nonparametric, distribution unknownOrder-statistic CIs for quantiles (§29.7); bootstrap (Track 8)No parametric assumption
Posterior-probability claim neededBayesian credible (Topic 25)Different framework

Default: Wilson for binomial, profile-LRT for GLM coefficients, t-CI for Normal mean, bootstrap BCa for everything else.

Remark 23 Track 5 closing — where this framework lives in ML

Topic 17 built the framework; Topic 18 delivered optimality; Topic 19 is the CI dual. Topic 20 closes the track with multiple testing — every technique of Topics 17–19 applied to many hypotheses simultaneously with FWER/FDR control, culminating in the featured Benjamini–Hochberg proof and the Bonferroni / Šidák simultaneous CIs that dualize the FWER procedures.

Beyond Track 5, the framework continues to matter. Track 6 (Regression). Every GLM coefficient CI is a Wald or LRT CI of §19.4; the F-test for linear regression is Wilks (§21.8 Thm 9 sharpens Topic 18’s χk2\chi^2_k limit to the exact Fk,np1F_{k, n-p-1} distribution). Track 7 (Bayesian). The contrast with frequentist coverage is where Bayesian inference earns its keep. Topic 25 §25.6 introduces credible intervals; §25.8 shows their asymptotic numerical agreement with Wald CIs under BvM. Track 8 (Nonparametric). Bootstrap CIs are the distribution-free extension; permutation tests invert to distribution-free CIs.

On formalML: : A/B test confidence intervals on conversion rates use Wilson by default; always-valid confidence sequences extend the framework to sequential monitoring; conformal prediction extends it to distribution-free predictive intervals; causal-inference packages (DoubleML, causalml) report sandwich-Wald CIs on treatment effects; PAC-Bayes generalization bounds are uniform confidence statements on the hypothesis class. The test-CI duality of §19.2 is the statistical grammar underlying all of it — one theorem, a hundred specializations, one pedagogical frame.


References

  1. Wilson, Edwin B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22(158), 209–212.

  2. Clopper, Charles J., and Egon S. Pearson. (1934). The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika, 26(4), 404–413.

  3. Neyman, Jerzy. (1937). Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Philosophical Transactions of the Royal Society A, 236(767), 333–380.

  4. Wilks, Samuel S. (1938). The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Annals of Mathematical Statistics, 9(1), 60–62.

  5. Rao, C. Radhakrishna. (1948). Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with Applications to Problems of Estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50–57.

  6. Schuirmann, Donald J. (1987). A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680.

  7. Agresti, Alan, and Brent A. Coull. (1998). Approximate Is Better than ‘Exact’ for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.

  8. Brown, Lawrence D., T. Tony Cai, and Anirban DasGupta. (2001). Interval Estimation for a Binomial Proportion. Statistical Science, 16(2), 101–133.

  9. Casella, George, and Roger L. Berger. (2002). Statistical Inference (2nd ed.). Pacific Grove, CA: Duxbury.

  10. Lehmann, Erich L., and Joseph P. Romano. (2005). Testing Statistical Hypotheses (3rd ed.). Springer Texts in Statistics. New York: Springer.