intermediate 60 min read · April 26, 2026

Model Selection & Information Criteria

When every candidate model fits the training data differently, a principled ranking criterion is the only honest arbiter. AIC (Akaike) estimates out-of-sample log-likelihood asymptotically; BIC (Schwarz) approximates the Bayesian marginal likelihood; Mallows' Cp is the Gaussian special case; Stone's 1977 theorem identifies LOO-CV with AIC; Yang 2005 proves that selection consistency and prediction efficiency are mutually incompatible. Track 6 closes here.

24.1 The model-selection problem

Given data y\mathbf{y} and a candidate family {M1,M2,,MM}\{\mathcal{M}_1, \mathcal{M}_2, \ldots, \mathcal{M}_M\} of statistical models — each with parameter space Θk\Theta_k and likelihood Lk(θk;y)L_k(\theta_k; \mathbf{y}) — the model-selection problem is to choose one model from the family by a procedure that has a defensible asymptotic justification. Maximum log-likelihood ^\hat\ell alone is not such a procedure: ^\hat\ell is monotone in model complexity, so it always picks the largest candidate.

Definition 1 Prediction risk

For an estimator θ^\hat\theta trained on y=(Y1,,Yn)\mathbf{y} = (Y_1, \ldots, Y_n) and a fresh observation Ynewf0Y_{\text{new}} \sim f_0 from the same data-generating distribution, the prediction risk under loss LL is

R(θ^)  :=  EYnew ⁣[L(θ^;Ynew)].R(\hat\theta) \;:=\; \mathbb{E}_{Y_{\text{new}}}\!\left[L(\hat\theta;\, Y_{\text{new}})\right].

For log-likelihood loss L(θ^;y)=logf(y;θ^)L(\hat\theta; y) = -\log f(y; \hat\theta), the risk is the negative expected log-predictive

R(θ^)  =  EYnew ⁣[logf(Ynew;θ^)]  =  EL(θ^).R(\hat\theta) \;=\; -\mathbb{E}_{Y_{\text{new}}}\!\left[\log f(Y_{\text{new}}; \hat\theta)\right] \;=\; -\mathrm{EL}(\hat\theta).

Definition 2 Optimism gap

The training empirical risk is R^(θ^)=(1/n)i=1nL(θ^;Yi)\hat R(\hat\theta) = (1/n)\sum_{i=1}^n L(\hat\theta; Y_i). The optimism gap measures how much in-sample loss underestimates out-of-sample loss in expectation:

opt  :=  E ⁣[R(θ^)]E ⁣[R^(θ^)].\mathrm{opt} \;:=\; \mathbb{E}\!\left[R(\hat\theta)\right] - \mathbb{E}\!\left[\hat R(\hat\theta)\right].

For correctly-specified parametric models satisfying Topic 14’s regularity conditions, opt=k/n+o(1/n)\mathrm{opt} = k/n + o(1/n) where k=dim(θ)k = \dim(\theta) — the optimism is exactly the parameter count divided by sample size, to leading order.

Example 1 Polynomial fit on sin(2πx): in-sample lies, out-of-sample doesn't

Generate n=80n=80 observations from y=sin(2πx)+N(0,0.252)y = \sin(2\pi x) + \mathcal{N}(0, 0.25^2) with xU(0,1)x \sim \mathcal{U}(0, 1). Fit polynomials of degree d=0,1,,12d = 0, 1, \ldots, 12 by ordinary least squares. The training RSS\mathrm{RSS} is monotone-decreasing in dd (every extra coefficient can only reduce the in-sample residual sum). The out-of-sample prediction risk, however, follows a U-shape: it falls as dd rises from 00 to 6\sim 6 (model approximates the sine well), then climbs as dd exceeds 8\sim 8 (the polynomial wiggles to fit noise). Figure 1 shows the gap.

Two-panel figure. Left: training RSS vs polynomial degree d on the sin DGP, monotone-decreasing from d=0 to d=12. Right: out-of-sample prediction risk on a fresh test set of size 1000, U-shaped with minimum near d=6. Vertical reference line at d=6 in both panels. Title: 'In-sample fit lies; out-of-sample risk does not'.
Remark 1 Selection as a decision problem

Model selection is a decision problem over the candidate set {M1,,MM}\{\mathcal{M}_1, \ldots, \mathcal{M}_M\} — distinct from parameter estimation, which decides over Θk\Theta_k for a fixed Mk\mathcal{M}_k. AIC, BIC, and cross-validation correspond to three different loss functions on the model space, each with its own asymptotic guarantee.

Remark 2 Three question families: consistency, efficiency, sparsity recovery

Three asymptotic targets organize the literature: selection consistency (P(M^=M)1\mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}_*) \to 1 when the truth is in the candidate set — BIC’s target), prediction efficiency (minimax-rate optimal L2L^2 risk in the misspecified-nonparametric regime — AIC and CV’s target), and sparsity recovery (correctly identifying supp(β)\mathrm{supp}(\boldsymbol\beta_*) — Topic 23’s lasso target). §24.6 Thm 5 shows the first two are formally incompatible.

24.2 Mallows’ CpC_p — the Gaussian-linear predecessor

Before Akaike’s information-theoretic framework, Mallows (1973) proposed a Gaussian-linear-specific criterion that estimates expected scaled prediction risk via a complexity penalty calibrated against a reference variance.

Definition 3 Mallows' $C_p$

For an OLS fit of a candidate model with kk free parameters (intercept + slopes + σ2\sigma^2) to nn observations under the Gaussian linear model y=Xβ+N(0,σ2I)\mathbf{y} = \mathbf{X}\boldsymbol\beta + \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}), with residual sum of squares RSSk\mathrm{RSS}_k, Mallows’ CpC_p is

Cp  :=  RSSkσ^ref2+2kn,C_p \;:=\; \frac{\mathrm{RSS}_k}{\hat\sigma^2_{\text{ref}}} + 2k - n,

where σ^ref2\hat\sigma^2_{\text{ref}} is the MLE error variance from the largest candidate model (the conventional reference).

Example 2 $C_p$ unbiasedness for scaled prediction risk (embedded derivation)

Under the Gaussian linear model with σ^ref2\hat\sigma^2_{\text{ref}} known (or treated as known via the reference convention), the expected scaled in-sample residual sum is

E ⁣[RSSk/σ2]  =  nk.\mathbb{E}\!\left[\mathrm{RSS}_k / \sigma^2\right] \;=\; n - k.

Adding 2kn2k - n to both sides:

E[Cp]  =  (nk)+2kn  =  k.\mathbb{E}[C_p] \;=\; (n - k) + 2k - n \;=\; k.

A separate calculation (Mallows 1973 §3) gives the expected scaled prediction risk on a fresh test set of size nn:

E ⁣[Rscaled]  =  k+(model bias term).\mathbb{E}\!\left[\mathrm{R}_{\text{scaled}}\right] \;=\; k + (\text{model bias term}).

Under correct specification (model bias =0= 0), E[Cp]=E[Rscaled]=k\mathbb{E}[C_p] = \mathbb{E}[\mathrm{R}_{\text{scaled}}] = k. Otherwise CpC_p overestimates the parameter count by exactly the model bias — making CpC_p minus its parameter count a calibrated estimator of model bias, the original use case Mallows 1973 §3 emphasized.

Remark 3 $C_p$ ≡ AIC under Gaussian-linear errors

Under the Gaussian linear model, AIC=nlog(RSS/n)+2k\mathrm{AIC} = n\log(\mathrm{RSS}/n) + 2k (dropping additive constants), and a Taylor expansion of log(RSS/n)\log(\mathrm{RSS}/n) around σref2\sigma^2_{\text{ref}} recovers CpC_p up to higher-order terms. The two criteria rank candidates identically under Gaussian-homoscedastic errors; AIC is the strict generalization to other exponential families (§24.3 Thm 1).

Remark 4 Choice of $\hat\sigma^2_{\text{ref}}$

The convention σ^ref2=RSSfull/(nkfull)\hat\sigma^2_{\text{ref}} = \mathrm{RSS}_{\text{full}}/(n - k_{\text{full}}) — using the unbiased estimator from the largest candidate — has the cleanest theory: σ^ref2\hat\sigma^2_{\text{ref}} is unbiased for σ2\sigma^2 if the largest model is correctly specified, and CpC_p‘s argmin then targets the bias-variance-optimal submodel. T10.8 in regression.test.ts pins σ^ref2=0.0497586381\hat\sigma^2_{\text{ref}} = 0.0497586381 for the canonical POLY_DGP at n=80n=80, giving argmindCp=6\arg\min_d C_p = 6.

Example 3 $C_p$ on the polynomial DGP (T10 pinned)

On the canonical POLY_DGP (n=80n=80, σ=0.25\sigma=0.25, seed 4242), with σ^ref2=0.0497586381\hat\sigma^2_{\text{ref}} = 0.0497586381 from the d=12d = 12 fit, the Mallows CpC_p values for d=0,1,,12d = 0, 1, \ldots, 12 have argmin at d=6d = 6 with Cp=21.4569C_p = 21.4569 (T10.8). The argmin coincides with AIC’s argmin (T10.3), illustrating Rem 3’s CpAICC_p \equiv \mathrm{AIC} equivalence on the Gaussian linear model. Figure 2 plots the CpC_p curve alongside the in-sample RSS/σ^ref2\mathrm{RSS}/\hat\sigma^2_{\text{ref}} to make the +2kn+2k - n correction visually concrete.

Two curves over polynomial degree d=0..12 on the POLY_DGP. Bottom curve: scaled RSS (RSS/sigma^2_ref) — monotone-decreasing. Top curve: Mallows Cp = scaled RSS + 2k - n — U-shaped with argmin at d=6. Vertical reference at argmin Cp. Title: 'Mallows Cp adds the +2k - n optimism correction'.

Akaike (1974) showed that the optimism gap of Def 2 is exactly k/nk/n to leading order under correct specification + regularity, and that 2^+2k-2\hat\ell + 2k is therefore an asymptotically unbiased estimator of 2n-2n times the expected out-of-sample log-likelihood. This is AIC — and the heart of the modern model-selection framework.

Definition 4 KL divergence and expected log-likelihood

For two densities f0f_0 (truth) and f(;θ)f(\cdot;\theta) (model), the Kullback–Leibler divergence is

D(f0fθ)  :=  EYf0 ⁣[logf0(Y)f(Y;θ)]  =  Ef0[logf0(Y)]EL(θ),D(f_0 \,\|\, f_\theta) \;:=\; \mathbb{E}_{Y \sim f_0}\!\left[\log\frac{f_0(Y)}{f(Y;\theta)}\right] \;=\; \mathbb{E}_{f_0}[\log f_0(Y)] - \mathrm{EL}(\theta),

where the expected log-likelihood under truth is

EL(θ)  :=  EYf0[logf(Y;θ)].\mathrm{EL}(\theta) \;:=\; \mathbb{E}_{Y \sim f_0}[\log f(Y;\theta)].

Since Ef0[logf0(Y)]\mathbb{E}_{f_0}[\log f_0(Y)] does not depend on θ\theta, maximizing EL(θ)\mathrm{EL}(\theta) is equivalent to minimizing D(f0fθ)D(f_0 \,\|\, f_\theta).

Theorem 1 AIC bias-correction (Akaike 1974)

Let Y1,,YnY_1, \ldots, Y_n be iid from unknown density f0f_0, and let {f(;θ):θΘRk}\{f(\cdot;\theta) : \theta \in \Theta \subset \mathbb{R}^k\} be a parametric model satisfying Topic 14’s regularity conditions: smooth log-likelihood, positive-definite Fisher information, MLE asymptotic normality. Let θ^\hat\theta be the MLE, ^=(θ^;y)\hat\ell = \ell(\hat\theta;\mathbf{y}), and define AIC:=2^+2k\mathrm{AIC} := -2\hat\ell + 2k. Then under correct specification (f0{f(;θ)}f_0 \in \{f(\cdot;\theta)\}),

E ⁣[AIC]  =  2nE ⁣[EL(θ^)]+o(1).\mathbb{E}\!\left[\mathrm{AIC}\right] \;=\; -2n\cdot\mathbb{E}\!\left[\mathrm{EL}(\hat\theta)\right] + o(1).

That is, AIC\mathrm{AIC} is asymptotically unbiased for 2n-2n times the expected out-of-sample log-predictive evaluated at the MLE.

Proof 1 Akaike's bias-correction theorem [show]

Setup. Let Y1,,YnY_1, \dots, Y_n be iid from unknown density f0f_0; let {f(;θ):θΘRk}\{f(\cdot;\theta) : \theta \in \Theta \subset \mathbb{R}^k\} be a parametric model satisfying Topic 14’s regularity conditions (smooth \ell, positive-definite Fisher information, MLE asymptotic normality). We want to rank candidate models by expected out-of-sample log-predictive

EL(θ):=EYnewf0[logf(Ynew;θ)],\mathrm{EL}(\theta) := \mathbb{E}_{Y_{\text{new}} \sim f_0}[\log f(Y_{\text{new}}; \theta)],

evaluated at the plug-in estimator θ^=θ^(Y1,,Yn)\hat\theta = \hat\theta(Y_1, \dots, Y_n). The target is E[EL(θ^)]\mathbb{E}[\mathrm{EL}(\hat\theta)]; the naive estimator is ^/n\hat\ell/n. We show the gap is k/nk/n to leading order.

Step 1 — KL-projected parameter. Define

θ0:=argminθΘD(f0f(;θ))=argmaxθΘEL(θ).\theta_0 := \arg\min_{\theta \in \Theta} D(f_0 \,\|\, f(\cdot;\theta)) = \arg\max_{\theta \in \Theta} \mathrm{EL}(\theta).

Under correct specification, f0=f(;θ0)f_0 = f(\cdot; \theta_0). Under misspecification, θ0\theta_0 is the best-in-family parameter.

Step 2 — MLE convergence. Under regularity,

n(θ^θ0)  d  N(0,  K(θ0)1J(θ0)K(θ0)1),\sqrt{n}(\hat\theta - \theta_0) \;\xrightarrow{d}\; \mathcal{N}(\mathbf{0}, \;\mathcal{K}(\theta_0)^{-1} \mathcal{J}(\theta_0) \mathcal{K}(\theta_0)^{-1}),

where J=Varf0[logf]\mathcal{J} = \operatorname{Var}_{f_0}[\nabla\log f] and K=Ef0[2logf]\mathcal{K} = -\mathbb{E}_{f_0}[\nabla^2\log f]. Under correct specification, J=K=:I\mathcal{J} = \mathcal{K} =: \mathcal{I} (Fisher identity, Topic 14 Thm 3), and the sandwich collapses to I1\mathcal{I}^{-1}.

Step 3 — Taylor-expand ^(θ^)\hat\ell(\hat\theta) around θ0\theta_0.

^(θ^)=^(θ0)+(θ^θ0)^(θ0)+12(θ^θ0)2^(θ0)(θ^θ0)+op(1).\hat\ell(\hat\theta) = \hat\ell(\theta_0) + (\hat\theta - \theta_0)^\top \nabla\hat\ell(\theta_0) + \tfrac12 (\hat\theta - \theta_0)^\top \nabla^2\hat\ell(\theta_0)(\hat\theta - \theta_0) + o_p(1).

The score ^(θ0)=ilogf(Yi;θ0)\nabla\hat\ell(\theta_0) = \sum_i \nabla\log f(Y_i;\theta_0) has mean zero under f0f_0 at θ0\theta_0 (stationarity of EL\mathrm{EL}). By LLN, n12^(θ0)K(θ0)n^{-1}\nabla^2\hat\ell(\theta_0) \to -\mathcal{K}(\theta_0).

Step 4 — Taylor-expand EL(θ^)\mathrm{EL}(\hat\theta) around θ0\theta_0.

EL(θ^)=EL(θ0)12(θ^θ0)K(θ0)(θ^θ0)+op(1).\mathrm{EL}(\hat\theta) = \mathrm{EL}(\theta_0) - \tfrac12 (\hat\theta - \theta_0)^\top \mathcal{K}(\theta_0)(\hat\theta - \theta_0) + o_p(1).

First-order vanishes (stationarity of EL\mathrm{EL}); second-order negative since K\mathcal{K} is positive-definite.

Step 5 — Take expectations under f0nf_0^n. Using Step 3 + asymptotic covariance:

E[^(θ^)]=nEL(θ0)+12E ⁣[(θ^θ0)2^(θ0)(θ^θ0)]+o(1).\mathbb{E}[\hat\ell(\hat\theta)] = n\cdot\mathrm{EL}(\theta_0) + \tfrac12\mathbb{E}\!\left[(\hat\theta - \theta_0)^\top \nabla^2\hat\ell(\theta_0)(\hat\theta - \theta_0)\right] + o(1).

With n12^(θ0)K(θ0)n^{-1}\nabla^2\hat\ell(\theta_0) \to -\mathcal{K}(\theta_0) and Cov(n(θ^θ0))K1JK1\operatorname{Cov}(\sqrt n(\hat\theta - \theta_0)) \to \mathcal{K}^{-1}\mathcal{J}\mathcal{K}^{-1}:

E[^(θ^)]=nEL(θ0)12tr(K1J)+o(1).\mathbb{E}[\hat\ell(\hat\theta)] = n\cdot\mathrm{EL}(\theta_0) - \tfrac12\operatorname{tr}(\mathcal{K}^{-1}\mathcal{J}) + o(1).

Similarly from Step 4:

E[EL(θ^)]=EL(θ0)12ntr(K1J)+o(1/n).\mathbb{E}[\mathrm{EL}(\hat\theta)] = \mathrm{EL}(\theta_0) - \tfrac{1}{2n}\operatorname{tr}(\mathcal{K}^{-1}\mathcal{J}) + o(1/n).

Step 6 — Combine. Multiply the EL\mathrm{EL} expression by nn, subtract:

E[^(θ^)]nE[EL(θ^)]=tr(K1J)+o(1).\mathbb{E}[\hat\ell(\hat\theta)] - n\cdot\mathbb{E}[\mathrm{EL}(\hat\theta)] = \operatorname{tr}(\mathcal{K}^{-1}\mathcal{J}) + o(1).

Step 7 — Specialize. Under correct specification, J=K\mathcal{J} = \mathcal{K}, so tr(K1J)=tr(Ik)=k\operatorname{tr}(\mathcal{K}^{-1}\mathcal{J}) = \operatorname{tr}(\mathbf{I}_k) = k. Multiplying by 2-2:

2E[^]+2k=2nE[EL(θ^)]+o(1).-2\mathbb{E}[\hat\ell] + 2k = -2n\cdot\mathbb{E}[\mathrm{EL}(\hat\theta)] + o(1).

The left side is E[AIC]\mathbb{E}[\mathrm{AIC}]. The right side is 2n-2n times the expected log-predictive — what we wanted to estimate. AIC is asymptotically unbiased for 2nE[EL(θ^)]-2n\cdot\mathbb{E}[\mathrm{EL}(\hat\theta)] under correct specification.

Under misspecification (JK\mathcal{J} \neq \mathcal{K}), the correct penalty is 2tr(K^1J^)2\operatorname{tr}(\hat{\mathcal{K}}^{-1}\hat{\mathcal{J}}): Takeuchi’s TIC (Rem 6). ∎ — using Topic 14 Thm 6 (MLE asymptotic normality), Topic 14 Thm 3 (Fisher identity), multivariate delta method (Topic 6).

Example 4 AIC on the polynomial DGP (T10 pinned)

On POLY_DGP (n=80n=80, σ=0.25\sigma=0.25, seed 4242), AIC(d)\mathrm{AIC}(d) values for d=0,1,,12d = 0, 1, \ldots, 12 have argmin at d=6d = 6 with AIC=8.2633\mathrm{AIC} = 8.2633 (T10.3). The AIC\mathrm{AIC} curve is U-shaped: AIC(d=0)=179.0128\mathrm{AIC}(d=0) = 179.0128 (extreme underfit, T10.1); AIC(d=3)=13.5077\mathrm{AIC}(d=3) = 13.5077 (undershooting; T10.2); AIC(d=6)=8.2633\mathrm{AIC}(d=6) = 8.2633 (argmin, recovers the prediction-risk argmin from §24.1 Ex 1); AIC(d=12)=14.9845\mathrm{AIC}(d=12) = 14.9845 (overfit, T10.4). Figure 3 overlays AIC and the bias-correction term +2k+2k to separate the empirical log-likelihood from the optimism penalty.

Three-curve plot over polynomial degree d=0..12 on POLY_DGP. Curve A: -2 log-likelihood, monotone-decreasing. Curve B: +2k bias-correction penalty, linear in d. Curve C: AIC = sum of A and B, U-shaped with argmin at d=6 (vertical reference). Title: 'AIC = -2 log-lik + 2k separates fit from complexity'.
Remark 5 Corrected AIC for small samples (AICc)

AICc corrects AIC’s small-sample bias when n/kn/k is not large (Hurvich & Tsai 1989; Sugiura 1978):

AICc  =  AIC+2k(k+1)nk1.\mathrm{AICc} \;=\; \mathrm{AIC} + \frac{2k(k+1)}{n - k - 1}.

Exact for Gaussian-linear models; asymptotic correction otherwise. The penalty 2k(k+1)/(nk1)02k(k+1)/(n-k-1) \to 0 as nn \to \infty, so AICc and AIC agree asymptotically. Burnham & Anderson 2002 §2.4 recommend AICc as the default whenever n/k<40n/k < 40.

Remark 6 Takeuchi information criterion (TIC) and information geometry

Takeuchi (1976) generalized AIC to misspecified parametric models:

TIC  =  2^+2tr(K^1J^),\mathrm{TIC} \;=\; -2\hat\ell + 2\operatorname{tr}(\hat{\mathcal{K}}^{-1}\hat{\mathcal{J}}),

with J^\hat{\mathcal{J}} the empirical score outer product and K^\hat{\mathcal{K}} the observed Hessian of -\ell. Under correct specification K^1J^Ik\hat{\mathcal{K}}^{-1}\hat{\mathcal{J}} \to \mathbf{I}_k and TIC \to AIC. TIC sees little practical use because J^\hat{\mathcal{J}} and K^\hat{\mathcal{K}} are hard to estimate stably at moderate nn; the broader information-geometric framing lives at formalml.

IC selector — POLY_DGP (sin(2πx) + N(0, 0.25²), n=80)
Slide to highlight a polynomial degree; the right panel shows AIC, AICc, BIC, Mallows' Cp, and 10-fold CV (each curve shifted so its min is 0 for comparability of shape). Argmin markers show each criterion's choice.
AIC argmind = 5
AICc argmind = 5
BIC argmind = 3
Cp argmind = 5
10-fold CV argmind = 5

24.4 Schwarz’s BIC and the Bayesian bridge

Where AIC asks “which model has the best expected out-of-sample predictive accuracy?”, BIC asks the parallel Bayesian question: “which model has the highest posterior probability given the data, under a uniform prior over the model space?”. The answer reduces to Laplace-approximating the marginal likelihood, and the 2^+klogn-2\hat\ell + k\log n form follows.

Definition 5 Bayesian marginal likelihood

For a model M\mathcal{M} with parameter θΘ\theta \in \Theta, prior π(θ)\pi(\theta), and likelihood L(θ;y)L(\theta;\mathbf{y}), the marginal likelihood of the data under M\mathcal{M} is

m(y)  :=  Θπ(θ)L(θ;y)dθ.m(\mathbf{y}) \;:=\; \int_\Theta \pi(\theta)\, L(\theta;\mathbf{y})\,\mathrm d\theta.

The Bayes factor comparing two models M1,M2\mathcal{M}_1, \mathcal{M}_2 is the ratio m1(y)/m2(y)m_1(\mathbf{y})/m_2(\mathbf{y}). Combined with prior model odds π(M1)/π(M2)\pi(\mathcal{M}_1)/\pi(\mathcal{M}_2), it gives posterior model odds.

Theorem 2 BIC as Laplace approximation (Schwarz 1978)

Let f(y;θ)f(\mathbf{y};\theta) be a smooth parametric model with ΘRk\Theta \subset \mathbb{R}^k and prior π(θ)\pi(\theta) continuous and strictly positive at the MLE θ^\hat\theta. Let ^=(θ^;y)\hat\ell = \ell(\hat\theta;\mathbf{y}). Define BIC:=2^+klogn\mathrm{BIC} := -2\hat\ell + k\log n. Then under standard regularity conditions,

2logm(y)  =  BIC+Op(1).-2\log m(\mathbf{y}) \;=\; \mathrm{BIC} + O_p(1).

The leading-order discrepancy between 2logm(y)-2\log m(\mathbf{y}) and BIC\mathrm{BIC} is constant in nn (depending on the prior and Fisher information), so model rankings by BIC and by 2logm(y)-2\log m(\mathbf{y}) agree asymptotically.

Proof 2 Schwarz's BIC as Laplace approximation [show]

Setup. Let f(y;θ)f(\mathbf{y};\theta) be a smooth parametric model with ΘRk\Theta \subset \mathbb{R}^k and prior π(θ)\pi(\theta) continuous and strictly positive at the MLE θ^\hat\theta. The marginal likelihood is

m(y)=Θπ(θ)L(θ;y)dθ=Θexp{(θ;y)+logπ(θ)}dθ.m(\mathbf{y}) = \int_\Theta \pi(\theta) L(\theta;\mathbf{y})\,\mathrm d\theta = \int_\Theta \exp\{\ell(\theta;\mathbf{y}) + \log\pi(\theta)\}\,\mathrm d\theta.

Step 1 — Laplace approximation. Expand \ell to second order around θ^\hat\theta (where (θ^)=0\nabla\ell(\hat\theta) = 0):

(θ)=(θ^)12(θθ^)K^n(θθ^)+o(θθ^2),\ell(\theta) = \ell(\hat\theta) - \tfrac12(\theta - \hat\theta)^\top \hat{\mathcal{K}}_n (\theta - \hat\theta) + o(\|\theta - \hat\theta\|^2),

with K^n=2(θ^)\hat{\mathcal{K}}_n = -\nabla^2\ell(\hat\theta) the observed Fisher information. For iid data, K^n=nK^1\hat{\mathcal{K}}_n = n\hat{\mathcal{K}}_1 (per-observation information, consistent for K(θ0)\mathcal{K}(\theta_0)). The quadratic approximation is tight on a shrinking n1/2n^{-1/2}-neighborhood of θ^\hat\theta; error Op(n1/2)O_p(n^{-1/2}).

Step 2 — Gaussian integral. Treating π(θ)π(θ^)\pi(\theta) \approx \pi(\hat\theta) on the shrinking neighborhood (valid by continuity + posterior concentration):

m(y)π(θ^)exp(^)exp ⁣{12(θθ^)K^n(θθ^)}dθ.m(\mathbf{y}) \approx \pi(\hat\theta)\exp(\hat\ell)\int\exp\!\left\{-\tfrac12(\theta - \hat\theta)^\top \hat{\mathcal{K}}_n (\theta - \hat\theta)\right\}\,\mathrm d\theta.

The Gaussian integral evaluates to (2π)k/2K^n1/2(2\pi)^{k/2}|\hat{\mathcal{K}}_n|^{-1/2}. Substituting K^n=nkK^1|\hat{\mathcal{K}}_n| = n^k|\hat{\mathcal{K}}_1|:

m(y)π(θ^)exp(^)(2π)k/2nk/2K^11/2.m(\mathbf{y}) \approx \pi(\hat\theta)\exp(\hat\ell)(2\pi)^{k/2} n^{-k/2} |\hat{\mathcal{K}}_1|^{-1/2}.

Step 3 — Take logs. Applying 2log-2\log:

2logm(y)2^+klognklog(2π)+logK^12logπ(θ^).-2\log m(\mathbf{y}) \approx -2\hat\ell + k\log n - k\log(2\pi) + \log|\hat{\mathcal{K}}_1| - 2\log\pi(\hat\theta).

Step 4 — Drop Op(1)O_p(1) terms. Under fixed kk, as nn \to \infty: K^1K(θ0)\hat{\mathcal{K}}_1 \to \mathcal{K}(\theta_0) (nonrandom); π(θ^)π(θ0)\pi(\hat\theta) \to \pi(\theta_0); both are O(1)O(1). Therefore

2logm(y)=2^+klogn+Op(1)=BIC+Op(1).-2\log m(\mathbf{y}) = -2\hat\ell + k\log n + O_p(1) = \mathrm{BIC} + O_p(1). \quad\blacksquare

∎ — using Topic 14 Thm 2 (observed/expected information consistency) and the multivariate Laplace approximation (CAS2002 §7.2.3, CLA2008 §3.3). The Op(1)O_p(1) gap is why BIC is prior-free: the prior’s contribution is constant across candidate models and cancels in the ranking.

Theorem 3 BIC selection consistency (stated; CLA2008 §3.2, HAS2009 §7.7)

Suppose the true model M\mathcal{M}_* is in the candidate set {M1,,MM}\{\mathcal{M}_1, \ldots, \mathcal{M}_M\}, and standard regularity conditions hold (smooth log-likelihood, identifiable parameter, positive-definite Fisher information at θ0\theta_0). Let M^BIC=argminkBIC(Mk)\hat{\mathcal{M}}_{\text{BIC}} = \arg\min_k \mathrm{BIC}(\mathcal{M}_k) be the BIC-selected model. Then

P ⁣(M^BIC=M)    1as n.\mathbb{P}\!\left(\hat{\mathcal{M}}_{\text{BIC}} = \mathcal{M}_*\right) \;\longrightarrow\; 1 \quad \text{as } n \to \infty.

BIC is selection-consistent: it identifies the true model with probability 11 in the large-nn limit, when the true model is in the candidate set. Proof outline in CLA2008 §3.2; uses BIC’s Op(1)O_p(1) gap to log marginal likelihood (Thm 2) plus the asymptotic comparison of nested-model marginal likelihoods.

Example 5 BIC on the polynomial DGP (T10 pinned)

On the canonical POLY_DGP, BIC values for d=0,1,,12d = 0, 1, \ldots, 12 have argmin at d=3d = 3 with BIC=25.4179\mathrm{BIC} = 25.4179 (T10.6) — strictly below AIC’s argmin at d=6d = 6. The shift reflects BIC’s logn\log n vs AIC’s 22 per-parameter penalty: at n=80n = 80, log804.38\log 80 \approx 4.38, more than double AIC’s penalty. BIC favors a sparser model. This is a core part of the AIC/BIC tension §24.6 Thm 5 will formalize. Figure 4 visualizes the Laplace approximation underlying BIC.

Two-panel figure. Left: 1D toy log-posterior (orange curve) overlaid with its quadratic Laplace approximation (dashed) at the MAP, with the Gaussian-integral interpretation of BIC as -2 log m(y) ≈ -2 log f(MAP) + k log n. Right: AIC vs BIC curves on POLY_DGP, BIC penalty (k log n) plotted as a faint background line — argmin BIC at d=3, argmin AIC at d=6. Title: 'BIC = Laplace approximation to -2 log marginal likelihood'.
Remark 7 Bayes factors, Akaike weights, and BMA (forward to §24.10 Rem 23)

exp(BIC/2)\exp(-\mathrm{BIC}/2) is proportional (asymptotically, by Thm 2) to the unnormalized posterior model probability; normalizing across the candidate set gives BIC weights wkBIC=exp(ΔBICk/2)/jexp(ΔBICj/2)w_k^{\text{BIC}} = \exp(-\Delta\mathrm{BIC}_k/2) / \sum_j \exp(-\Delta\mathrm{BIC}_j/2). Burnham & Anderson 2002 §2.6 popularized the AIC analog wkAIC=exp(ΔAICk/2)/jexp(ΔAICj/2)w_k^{\text{AIC}} = \exp(-\Delta\mathrm{AIC}_k/2) / \sum_j \exp(-\Delta\mathrm{AIC}_j/2) as Akaike weights. Both are the discrete-model special case of Bayesian model averaging (BMA), where predictions are weighted by posterior model mass instead of conditioning on a single M^\hat{\mathcal{M}}. Full forward-pointer to BMA at §24.10 Rem 23 + Track 7.

Remark 8 Priors on the model space — Track 7 territory

Thm 2’s BIC derivation implicitly assumes a uniform prior on the candidate set {M1,,MM}\{\mathcal{M}_1, \ldots, \mathcal{M}_M\}. Bayesian model comparison admits richer priors — reference priors (Berger–Pericchi), spike-and-slab priors (George–McCulloch 1993), intrinsic priors (Berger–Pericchi 1996) — each tilting the ranking toward sparser or denser models. Full development at Track 7 (Bayesian Foundations).

Remark 9 AIC vs BIC: the $n = e^2 \approx 7.39$ crossover

The per-parameter AIC penalty is 22; the per-parameter BIC penalty is logn\log n. They equalize at n=e27.39n = e^2 \approx 7.39. For all practical sample sizes (n10n \geq 10), BIC penalizes complexity more aggressively than AIC; for n=80n = 80 (the POLY_DGP), the BIC per-parameter penalty log804.38\log 80 \approx 4.38 is more than double AIC’s, explaining T10.6’s argmin shift from d=6d = 6 (AIC) to d=3d = 3 (BIC).

Remark 10 Computing the exact marginal likelihood is hard

The Laplace approximation underlying Thm 2 has Op(1)O_p(1) error — fine for ranking (argmin\arg\min stable up to monotone transforms) but not for reporting a numerical posterior probability. Exact marginal-likelihood computation requires nested sampling (Skilling 2006), thermodynamic integration, or bridge sampling (Meng & Wong 1996); each is a Track-7 topic on its own. BIC’s appeal is that the asymptotic approximation is prior-free and computationally trivial — only ^\hat\ell and kk are needed.

24.5 Stone’s CV ≡ AIC equivalence

Stone (1977) proved that leave-one-out cross-validation and AIC select the same model asymptotically under Gaussian-homoscedastic errors. The result is a tight identification: LOO-CV is not just similar to AIC but converges to it after a logarithm and an additive constant — the two frequentist procedures collapse into one.

Definition 6 Leave-one-out cross-validation

For data (X,y)(\mathbf{X}, \mathbf{y}) with nn rows, let θ^(i)\hat\theta^{(-i)} be the estimator fit on the n1n - 1 rows excluding observation ii, and let y^i(i)\hat y_i^{(-i)} be its prediction at xi\mathbf{x}_i. The leave-one-out cross-validation estimate of mean squared prediction error is

LOO-CV  :=  1ni=1n(yiy^i(i))2.\mathrm{LOO\text{-}CV} \;:=\; \frac{1}{n}\sum_{i=1}^n \left(y_i - \hat y_i^{(-i)}\right)^2.

For Gaussian OLS with hat-matrix diagonal hiih_{ii}, the hat-matrix shortcut (PRESS statistic; Allen 1974) avoids the nn refits:

yiy^i(i)  =  yiy^i1hii.y_i - \hat y_i^{(-i)} \;=\; \frac{y_i - \hat y_i}{1 - h_{ii}}.

The shortcut requires hii<1h_{ii} < 1 for all ii; looCV in regression.ts throws when maxhii0.999\max h_{ii} \geq 0.999.

Theorem 4 Stone's CV ≡ AIC equivalence (Stone 1977)

Consider the Gaussian linear model y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\varepsilon with εNn(0,σ2I)\boldsymbol\varepsilon \sim \mathcal{N}_n(\mathbf{0}, \sigma^2\mathbf{I}) and a fixed full-rank design X\mathbf{X}. Under regularity (balanced design, fixed pp, maxhii=O(logn/n)\max h_{ii} = O(\log n / n)),

nlogLOO-CV  =  AIC+Op(n1),n\log\mathrm{LOO\text{-}CV} \;=\; \mathrm{AIC}^* + O_p(n^{-1}),

where AIC:=nlog(RSS/n)+2k\mathrm{AIC}^* := n\log(\mathrm{RSS}/n) + 2k is AIC up to model-invariant additive constants. Consequently,

argminMLOO-CV(M)  =  argminMAIC(M)+op(1).\arg\min_{\mathcal{M}} \mathrm{LOO\text{-}CV}(\mathcal{M}) \;=\; \arg\min_{\mathcal{M}} \mathrm{AIC}(\mathcal{M}) + o_p(1).

LOO-CV and AIC asymptotically select the same model from any nested family on the Gaussian-linear model.

Proof 3 Stone's cross-validation–AIC equivalence [show]

Setup. Consider the Gaussian linear model y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\varepsilon with εNn(0,σ2I)\boldsymbol\varepsilon \sim \mathcal{N}_n(\mathbf{0}, \sigma^2\mathbf{I}), fixed full-rank design X\mathbf{X} (p+1p+1 columns). Let β^\hat{\boldsymbol\beta} be OLS, y^i=(Xβ^)i\hat y_i = (\mathbf{X}\hat{\boldsymbol\beta})_i, hiih_{ii} the hat-matrix diagonal H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top. Write k=p+2k = p + 2 (coefficients + σ2\sigma^2).

Step 1 — Hat-matrix shortcut. The leave-one-out fit satisfies

yiy^i(i)=yiy^i1hii.y_i - \hat y_i^{(-i)} = \frac{y_i - \hat y_i}{1 - h_{ii}}.

(Standard OLS identity via Sherman–Morrison; HAS2009 §7.10 gives full 5-line derivation.) Squaring and summing:

LOO-CV=1ni=1n(yiy^i)2(1hii)2.\mathrm{LOO\text{-}CV} = \frac{1}{n}\sum_{i=1}^n\frac{(y_i - \hat y_i)^2}{(1 - h_{ii})^2}.

Step 2 — Uniform leverage. Under regularity (balanced design, fixed pp), hii=tr(H)=p+1\sum h_{ii} = \operatorname{tr}(\mathbf{H}) = p+1 and maxhii=O(logn/n)\max h_{ii} = O(\log n / n). So

1(1hii)2=1+2hii+3hii2+=1+2(p+1)n+O(n2)\frac{1}{(1 - h_{ii})^2} = 1 + 2h_{ii} + 3h_{ii}^2 + \cdots = 1 + \frac{2(p+1)}{n} + O(n^{-2})

uniformly in ii.

Step 3 — Substitute.

LOO-CV=(1+2(p+1)n+O(n2))σ^MLE2=σ^MLE2(1+2(p+1)n)+O(n2),\mathrm{LOO\text{-}CV} = \left(1 + \frac{2(p+1)}{n} + O(n^{-2})\right) \cdot \hat\sigma^2_{\text{MLE}} = \hat\sigma^2_{\text{MLE}}\left(1 + \frac{2(p+1)}{n}\right) + O(n^{-2}),

with σ^MLE2=RSS/n\hat\sigma^2_{\text{MLE}} = \mathrm{RSS}/n.

Step 4 — AIC on the same model. Dropping model-invariant constants nlog(2π)+nn\log(2\pi) + n:

AIC=nlogσ^MLE2+2(p+2)=nlogσ^MLE2+2k.\mathrm{AIC}^* = n\log\hat\sigma^2_{\text{MLE}} + 2(p+2) = n\log\hat\sigma^2_{\text{MLE}} + 2k.

Using log(σ^2(1+2k/n))=logσ^2+2k/n+O(n2)\log(\hat\sigma^2(1 + 2k/n)) = \log\hat\sigma^2 + 2k/n + O(n^{-2}):

nlogLOO-CV=nlogσ^MLE2+2(p+1)+O(n1)=AIC2+O(n1),n\log\mathrm{LOO\text{-}CV} = n\log\hat\sigma^2_{\text{MLE}} + 2(p+1) + O(n^{-1}) = \mathrm{AIC}^* - 2 + O(n^{-1}),

using k=p+2k = p+2 so 2k2=2(p+1)2k - 2 = 2(p+1).

Step 5 — Equivalence. The 2-2 is model-invariant (depends only on whether σ2\sigma^2 is counted in kk, a family-wide convention). Therefore

argminMnlogLOO-CV(M)=argminMAIC(M)+op(1).\arg\min_{\mathcal{M}} n\log\mathrm{LOO\text{-}CV}(\mathcal{M}) = \arg\min_{\mathcal{M}} \mathrm{AIC}^*(\mathcal{M}) + o_p(1).

LOO-CV and AIC select the same model asymptotically. ∎ — using Topic 21 §21.7 hat-matrix structure and log(1+x)=xx2/2+O(x3)\log(1+x) = x - x^2/2 + O(x^3).

Example 6 Stone equivalence empirically (T10.26 + T10.27)

On POLY_DGP, both argmindLOO-CV=6\arg\min_d \mathrm{LOO\text{-}CV} = 6 and argmindAIC=6\arg\min_d \mathrm{AIC} = 6 over d[0,11]d \in [0, 11] (T10.26 — d=12 is excluded because the monomial Vandermonde is too ill-conditioned for the hat-matrix shortcut; the argmin lies safely inside the range). At the joint argmin d=6d = 6, LOO-CV=0.063835\mathrm{LOO\text{-}CV} = 0.063835 (T10.10) and AIC=218.6963\mathrm{AIC}^* = -218.6963 (computed from the full Gaussian AIC by stripping the n(log2π+1)n(\log 2\pi + 1) constants); the gap nlogLOO-CVAIC<2.5|n\log\mathrm{LOO\text{-}CV} - \mathrm{AIC}^*| < 2.5 (T10.27) is the order-1 constant Proof 3 Step 4 predicts. Figure 5 overlays the LOO-CV curve, the 5-fold and 10-fold CV curves, and the AIC curve over dd.

Multi-curve plot over polynomial degree d=0..12 on POLY_DGP. Curves: LOO-CV, 5-fold CV, 10-fold CV, AIC (rescaled to match LOO-CV's vertical scale). All four curves U-shaped with argmin at d=6 (vertical reference). Legend bottom-left. Title: 'Stone 1977: LOO-CV and AIC select the same model asymptotically'.
Remark 11 Nested cross-validation (discharges Topic 23 §23.8 Rem 19)

The single-loop CV Topic 23 §23.8 used to select λ\lambda should not also serve as the test-error estimator: reporting minλCV(λ)\min_\lambda \mathrm{CV}(\lambda) as test error leaks tuning information into the test estimate (selection bias). Nested cross-validation fixes this: an outer CV loop holds out folds for honest test-error estimation, and within each outer fold, an inner CV loop selects λ\lambda. Bates–Hastie–Tibshirani (2024) give the recent rigorous treatment. The result is an asymptotically unbiased generalization-error estimate at the cost of KouterKinnerK_\text{outer} \cdot K_\text{inner} refits.

Remark 12 $k$-fold CV: bias-variance of $k = 5$ vs $k = 10$

kk-fold CV is the finite-sample analog of LOO: each fold holds out n/kn/k observations. As knk \to n, the procedure converges to LOO. Smaller kk has lower computational cost but higher bias (more held-out fold per refit means more train-set shrinkage). Hastie et al. 2009 §7.10 and Claeskens & Hjort 2008 §4.3 recommend k=10k = 10 as the modern default — slightly biased upward vs LOO but with lower Monte Carlo variance.

Remark 13 Stone's result is AIC-specific

Thm 4 identifies LOO-CV with AIC; no analogous result exists for BIC at fixed kk. The mismatch is asymptotic: BIC’s klognk\log n penalty grows in nn, while CV’s effective penalty grows like 2k2k. To recover a BIC-like criterion from CV, one would have to use a vanishing-fraction held-out set (nval=o(n)n_\text{val} = o(n)), not standard kk-fold (Shao 1993).

CV vs IC comparator — Stone's equivalence on POLY_DGP
Each curve over polynomial degree d is shifted to its own min so all five sit on a common visual scale. Stone's 1977 theorem: argmin of LOO-CV and argmin of AIC coincide asymptotically. Toggle curves to isolate any pair.
Stone equivalence empirical gap
|argmin(LOO-CV) − argmin(AIC)| = 0(coincide)
argmin(LOO) = d = 5; argmin(AIC) = d = 5
CV-fold sensitivity check
argmin(5-fold) = d = 5 · argmin(10-fold) = d = 5
Same DGP, different fold counts. Agreement signals the CV estimate is fold-stable; disagreement signals high CV variance.

24.6 Yang’s incompatibility theorem

The three procedures of §§24.3–24.5 carry different asymptotic guarantees: BIC is selection-consistent (Thm 3), AIC and CV are minimax-rate-optimal for prediction (under regularity). Yang (2005) proved these properties are not just different — they are formally incompatible: no single procedure can deliver both. The asymptotic philosophy must choose.

Theorem 5 Yang's incompatibility (Yang 2005, stated)

Let M^\hat{\mathcal{M}} be a model-selection procedure operating on a candidate family {M1,,MM}\{\mathcal{M}_1, \ldots, \mathcal{M}_M\}. Suppose M^\hat{\mathcal{M}} is selection-consistent in the well-specified regime: when M\mathcal{M}_* is in the candidate set, P(M^=M)1\mathbb{P}(\hat{\mathcal{M}} = \mathcal{M}_*) \to 1. Then M^\hat{\mathcal{M}} is not minimax-rate-optimal in the misspecified-nonparametric regime: there exists a family of underlying truths and a sample-size sequence on which the prediction risk of M^\hat{\mathcal{M}} achieves a strictly worse rate than the minimax-optimal procedure (e.g., AIC or CV).

Proof: Yang 2005 §2–3, via a minimax lower-bound argument outside the scope of Topic 24.

Example 7 Two-regime simulation: BIC consistent, AIC efficient

The Yang race compares procedures across two DGP regimes:

  • Tab A — well-specified. Truth is a degree-3 polynomial; candidate set d[0,8]d \in [0, 8] contains the truth. As nn grows, BIC’s selection frequency at d=3d = 3 tends to 11 (consistency); AIC’s selection frequency at d4d \geq 4 persists at 25%\sim 25\% (asymptotic over-fit). Prediction risks are similar at any finite nn.
  • Tab B — misspecified. Truth is sin(2πx)\sin(2\pi x); candidate set is polynomials d[0,12]d \in [0, 12] (truth not in the candidate set). AIC’s minimax-optimal selection adapts dd to nn; BIC’s logn\log n penalty over-shrinks toward sparsity, giving worse prediction risk on the polynomial scale of the truth’s smoothness. The risks diverge in nn.

Figure 6 plots both regimes side by side.

Two-panel race figure. Left (Tab A, truth = polynomial-d3): selection frequency of BIC at d=3 climbs from ~80% at n=50 to ~99% at n=5000, AIC plateaus at ~75% (with persistent tail at d≥4). Right (Tab B, truth = sin, polynomial-misspecified): prediction risk vs n for BIC, AIC, and 10-fold CV — BIC plateaus higher than AIC and CV, the gap widening with n. Title: 'Selection consistency and prediction efficiency are incompatible (Yang 2005)'.
Remark 14 Philosophical implication: a value judgment, not an empirical question

Thm 5 forces a choice the practitioner cannot duck: either prioritize identifying the right model (BIC) or prioritize accurate predictions (AIC, CV). The choice is not an empirical question to be settled by simulation but a value judgment about the use case — interpretive parsimony vs predictive accuracy. Shao 1997 and Burnham & Anderson 2002 §2.10 give complementary discussions of how to pick a side in a given application context.

Remark 15 Shao 1993's $k\log\log n$ penalty — limited positive result

Shao (1993) showed that leaving out nv=nn1/2n_v = n - n^{1/2} observations per fold (held-out fraction shrinking to zero asymptotically) gives a CV variant that is selection-consistent on Gaussian linear models with bounded pp. The result is a limited positive: Shao’s CV variant is not minimax-rate-optimal for prediction in the nonparametric regime, so it lives on the BIC side of Yang’s split rather than circumventing the incompatibility.

Remark 16 Practical recommendation: report both

Standard practice: report both AIC and BIC. Agreement is reassuring; disagreement is informative — it tells the reader which side of the consistency-vs-efficiency tradeoff matters for the application. For prediction-focused use cases, prefer the AIC argmin and report 10-fold CV as a sanity check; for inference-focused applications where parsimony matters (e.g. model identification in epidemiology), prefer the BIC argmin.

Consistency-vs-efficiency race — Yang's incompatibility (Thm 5)
Truth is a degree-3 polynomial; the candidate set contains the truth. Selection frequency at the correct d climbs toward 1 for BIC as n grows; AIC and CV keep selecting d ≥ 4 a persistent fraction of the time (asymptotic over-fit).
Precomputed at module load: 25 MC replicates per sample size, sample-size sweep 50 / 100 / 200 / 500 / 1000, candidate degrees d ∈ [0, 8].

24.7 Nested-model selection: AIC ≡ LRT with default threshold

For nested models — the smaller obtained by setting qq parameters of the larger to zero — AIC’s preference for the full model is equivalent to a likelihood-ratio test with a fixed default threshold of 2q2q, irrespective of α\alpha. This identifies AIC as a default-threshold LRT and connects Topic 18’s Wilks machinery to the model-selection vocabulary of Topic 24.

Theorem 6 AIC ≡ LRT with default threshold (embedded derivation)

Let Mfull\mathcal{M}_{\text{full}} and Mred\mathcal{M}_{\text{red}} be nested with kfull=kred+qk_{\text{full}} = k_{\text{red}} + q free parameters (q1q \geq 1). Let ^full,^red\hat\ell_{\text{full}}, \hat\ell_{\text{red}} be their MLE log-likelihoods on the same data. Then

ΔAIC  :=  AIC(Mfull)AIC(Mred)  =  (2^full+2kfull)(2^red+2kred)  =  2q2Δ^,\Delta\mathrm{AIC} \;:=\; \mathrm{AIC}(\mathcal{M}_{\text{full}}) - \mathrm{AIC}(\mathcal{M}_{\text{red}}) \;=\; (-2\hat\ell_{\text{full}} + 2k_{\text{full}}) - (-2\hat\ell_{\text{red}} + 2k_{\text{red}}) \;=\; 2q - 2\Delta\hat\ell,

with Δ^=^full^red\Delta\hat\ell = \hat\ell_{\text{full}} - \hat\ell_{\text{red}}. AIC prefers the full model iff ΔAIC<0\Delta\mathrm{AIC} < 0 iff 2Δ^>2q2\Delta\hat\ell > 2q — exactly the LRT rejection rule with threshold 2q2q instead of the chi-square critical value χq,1α2\chi^2_{q,1-\alpha}. BIC uses the analogous threshold qlognq\log n in place of 2q2q: BIC prefers the full model iff 2Δ^>qlogn2\Delta\hat\ell > q\log n.

Example 8 Nested Poisson GLM (T10.12–T10.19 pinned)

On the nested-Poisson DGP (n=200n = 200, ηtrue=1.0+0.8x10.5x2\eta_{\text{true}} = 1.0 + 0.8 x_1 - 0.5 x_2, x3x_3 has no true effect, default_rng(123)), fit the reduced model with {1,x1,x2}\{1, x_1, x_2\} and the full model with {1,x1,x2,x3}\{1, x_1, x_2, x_3\}. The pinned values: β^red(0.9815,0.8216,0.5769)\hat\beta_{\text{red}} \approx (0.9815, 0.8216, -0.5769) (T10.12–T10.14); β^full[3]0.0535\hat\beta_{\text{full}}[3] \approx -0.0535 (T10.15, near-zero as expected). The likelihood ratio is

LR  =  2Δ^    0.5547(T10.16).\mathrm{LR} \;=\; 2\Delta\hat\ell \;\approx\; 0.5547 \quad (\text{T10.16}).

Since q=1q = 1 and the LRT critical value at α=0.05\alpha = 0.05 is χ1,0.9523.84\chi^2_{1, 0.95} \approx 3.84, the LRT does NOT reject (p-value 0.46\approx 0.46).

Now apply Thm 6:

ΔAIC  =  2(1)LR  =  20.5547    1.4453>0(T10.17).\Delta\mathrm{AIC} \;=\; 2(1) - \mathrm{LR} \;=\; 2 - 0.5547 \;\approx\; 1.4453 > 0 \quad (\text{T10.17}).

ΔBIC  =  (1)log(200)LR    5.2980.5547    4.7436>0(T10.18).\Delta\mathrm{BIC} \;=\; (1)\log(200) - \mathrm{LR} \;\approx\; 5.298 - 0.5547 \;\approx\; 4.7436 > 0 \quad (\text{T10.18}).

Both AIC and BIC prefer the reduced model. T10.19 verifies the algebraic identity ΔAIC=2LR\Delta\mathrm{AIC} = 2 - \mathrm{LR} to within 101010^{-10}. Figure 7 plots the chi-square null density with the observed LR and both AIC/BIC thresholds for visual comparison.

Two-panel figure. Top: scatter of fitted μ_i vs observed y_i for the reduced Poisson model on n=200, with y=μ̂ reference line. Bottom: chi-square_1 null density with observed LR=0.5547 marked (in the bulk, p≈0.46 — not rejected); vertical reference lines at AIC threshold (LR=2) and BIC threshold (LR=log 200 ≈ 5.30). Title: 'Nested LRT chi^2_1 null vs observed LR; AIC/BIC thresholds — both prefer reduced'.
Remark 17 AIC's effective $\alpha$ depends on $q$

Thm 6’s identification of AIC with a fixed-threshold LRT means AIC has an effective α\alpha that depends on qq. For q=1q = 1, P(χ122)0.157\mathbb{P}(\chi^2_1 \geq 2) \approx 0.157; for q=5q = 5, P(χ5210)0.075\mathbb{P}(\chi^2_5 \geq 10) \approx 0.075; for q=10q = 10, P(χ10220)0.029\mathbb{P}(\chi^2_{10} \geq 20) \approx 0.029. AIC is more liberal (admits more parameters) at small qq, more conservative at large qq — the opposite of the LRT-with-fixed-α\alpha rule, which has fixed type-I error regardless of qq.

Remark 18 Nested-only caveat — non-nested requires direct IC comparison

Thm 6 applies only to nested comparisons. For non-nested candidates (e.g. degree-5 polynomial vs degree-3 spline with the same effective dimension), the likelihood-ratio statistic is undefined and the LRT framework breaks down. The IC framework still applies: compute AIC\mathrm{AIC} or BIC\mathrm{BIC} for each candidate and rank by the smaller value, no chi-square reference needed.

24.8 Effective degrees of freedom for penalized estimators

Topic 23’s penalized estimators don’t have an integer parameter count: ridge shrinks every coefficient by an amount that depends on λ\lambda and the design’s singular values; lasso zeros out a data-dependent subset. Effective degrees of freedom generalizes the integer kk to a continuous notion of “how many parameters worth of freedom did the fit actually use”, letting AIC/BIC/Cp apply to ridge and lasso paths. This section discharges Topic 23 §23.8 Rem 20.

Definition 7 Effective degrees of freedom (Efron 2004)

For a fitting procedure f^\hat{\mathbf{f}} that maps data y\mathbf{y} to fitted values f^(y)\hat{\mathbf{f}}(\mathbf{y}) on the same nn rows, the effective degrees of freedom is

df(f^)  :=  1σ2i=1nCov ⁣(f^i,yi).\mathrm{df}(\hat{\mathbf{f}}) \;:=\; \frac{1}{\sigma^2}\sum_{i=1}^n \operatorname{Cov}\!\left(\hat f_i, y_i\right).

For OLS, df=p+1\mathrm{df} = p + 1 (intercept + pp slopes — exactly the parameter count). For penalized estimators, df\mathrm{df} can be non-integer and depends continuously on the regularization parameter.

Theorem 7 Ridge effective DOF via SVD (embedded derivation)

For ridge regression on a centered design X~\tilde{\mathbf{X}} with SVD X~=UDV\tilde{\mathbf{X}} = \mathbf{U}\mathbf{D}\mathbf{V}^\top (singular values d1,,dp>0d_1, \ldots, d_p > 0), the smoother matrix is

Hλ  =  X~(X~X~+λI)1X~.\mathbf{H}_\lambda \;=\; \tilde{\mathbf{X}}(\tilde{\mathbf{X}}^\top\tilde{\mathbf{X}} + \lambda\mathbf{I})^{-1}\tilde{\mathbf{X}}^\top.

Substituting the SVD and using X~X~=VD2V\tilde{\mathbf{X}}^\top\tilde{\mathbf{X}} = \mathbf{V}\mathbf{D}^2\mathbf{V}^\top plus the rotational invariance of the trace:

df(λ)  =  tr(Hλ)  =  j=1pdj2dj2+λ.\mathrm{df}(\lambda) \;=\; \operatorname{tr}(\mathbf{H}_\lambda) \;=\; \sum_{j=1}^p \frac{d_j^2}{d_j^2 + \lambda}.

The DOF is monotone-decreasing in λ\lambda: df(0)=p\mathrm{df}(0) = p (unpenalized OLS), df(λ)0\mathrm{df}(\lambda) \to 0 as λ\lambda \to \infty. Equivalently and computationally cheaper, df(λ)=pλtr((X~X~+λI)1)\mathrm{df}(\lambda) = p - \lambda \cdot \operatorname{tr}((\tilde{\mathbf{X}}^\top\tilde{\mathbf{X}} + \lambda\mathbf{I})^{-1}) via Cholesky inversion (the form regression.ts’s hatMatrixTrace uses).

Theorem 8 Lasso effective DOF (stated; Tibshirani–Taylor 2012)

For lasso regression on a centered, orthonormal design (or under the more general restricted-eigenvalue conditions of WAI2019 Ch. 7), the active set A(β^)\mathcal{A}(\hat{\boldsymbol\beta}) — the indices jj with β^j0\hat\beta_j \neq 0 — satisfies

E ⁣[df(β^λ)]  =  E ⁣[A(β^λ)].\mathbb{E}\!\left[\mathrm{df}(\hat{\boldsymbol\beta}_\lambda)\right] \;=\; \mathbb{E}\!\left[|\mathcal{A}(\hat{\boldsymbol\beta}_\lambda)|\right].

The active-set size is an unbiased estimator of expected effective DOF. Caveats: the result fails at “knots” of the lasso path where the active set changes discontinuously, and ties in the optimization can make A|\mathcal{A}| data-dependent in a non-smooth way. Tibshirani & Taylor 2012 give the precise regularity statement.

Example 9 Ridge effective DOF on poly $d=10$ design (T10.20–T10.25 pinned)

For the poly d=10d = 10 design on xU(0,1)x \sim \mathcal{U}(0,1) at n=80n = 80 (the canonical POLY_DGP xx-values), the ridge effective DOF as λ\lambda varies:

λ\lambdadf(λ)=tr(Hλ)\mathrm{df}(\lambda) = \operatorname{tr}(\mathbf{H}_\lambda)Test
0.00.011.00000011.000000T10.20
0.010.014.5590804.559080T10.21
0.10.13.7758483.775848T10.22
1.01.02.927366\mathbf{2.927366}T10.23
10.010.01.9618831.961883T10.24
100.0100.00.8517710.851771T10.25

At λ=0\lambda = 0, df=11\mathrm{df} = 11 exactly (intercept + 10 polynomial coefficients). Even modest regularization (λ=0.01\lambda = 0.01) drops the effective DOF by more than half — the high-order polynomial columns have small singular values and shrink rapidly under the ridge penalty.

Example 10 AIC/BIC overlay on the prostate-cancer lasso path (discharges R2)

Refit the prostate-cancer lasso path of Topic 23 §23.9 Ex 14 (n=97n = 97, p=8p = 8, response lpsa) on a 100-point log-grid of λ\lambda. For each λ\lambda, compute df(λ)=A(β^λ)\mathrm{df}(\lambda) = |\mathcal{A}(\hat{\boldsymbol\beta}_\lambda)| (Thm 8) and apply

AIC(λ)  =  2^(λ)+2(df(λ)+1),BIC(λ)  =  2^(λ)+(df(λ)+1)logn,\mathrm{AIC}(\lambda) \;=\; -2\hat\ell(\lambda) + 2\,(\mathrm{df}(\lambda) + 1), \qquad \mathrm{BIC}(\lambda) \;=\; -2\hat\ell(\lambda) + (\mathrm{df}(\lambda) + 1)\log n,

with the +1+1 counting σ2\sigma^2 per the parameter convention. Overlay the AIC, BIC, and 10-fold CV curves on the lasso path. The AIC argmin coincides with CV’s λ^min\hat\lambda_{\min} in the high-signal regime; BIC favors a sparser model with larger λ^BIC>λ^min\hat\lambda_{\text{BIC}} > \hat\lambda_{\min}, recovering Yang’s (Thm 5) consistency-vs-efficiency split on a worked example. This example is the practical fulfillment of Topic 23 §23.8 Rem 20.

Two-panel figure. Top: prostate-cancer lasso coefficient path over log λ — eight coefficient lines (lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45), lcavol and lweight strongest. Bottom: AIC, BIC, and 10-fold CV curves over the same log-λ axis. Vertical reference lines at AIC argmin, BIC argmin (sparser, more right), and CV λ_min. Legend at bottom-right. Title: 'AIC and BIC overlay on the prostate-cancer lasso path'.
Remark 19 Lasso DOF caveats: ties, knots, non-smoothness

Thm 8’s E[df]=E[A]\mathbb{E}[\mathrm{df}] = \mathbb{E}[|\mathcal{A}|] assumes the active set is well-defined — fine away from path knots, but at λ\lambda values where a coefficient enters or leaves the active set, the cardinality is data-dependent in a non-smooth way. Practical AIC/BIC computations on lasso paths should evaluate at λ\lambda values away from knots, or use a smoothed DOF estimator (e.g., debiased lasso DOF; Wainwright 2019 Ch. 11).

Remark 20 AIC for ridge — Li 1986 asymptotic equivalence to CV

Li (1986) proved an analog of Stone’s Thm 4 for ridge: AIC computed with effective DOF tr(Hλ)\operatorname{tr}(\mathbf{H}_\lambda) asymptotically agrees with LOO-CV for λ\lambda selection on the Gaussian linear model. The practical implication is that AIC-ridge is a valid (and computationally cheap) alternative to CV-ridge: one fit per λ\lambda, no nn-fold refits.

24.9 Worked examples

Three end-to-end workflows pull the §§24.1–24.8 machinery into runnable applied form: a polynomial-degree comparison on simulated data (the Topic-24 canonical example), a nested-Poisson GLM (the §24.7 LRT-as-IC special case), and a lasso-path AIC/BIC overlay on the prostate-cancer dataset (§24.8 worked example).

Example 11 Polynomial sin(2πx): full ranking table

On POLY_DGP, the full 5×135 \times 13 ranking table over d[0,12]d \in [0, 12]:

ddAIC\mathrm{AIC}AICc\mathrm{AICc}BIC\mathrm{BIC}CpC_pLOO-CV\mathrm{LOO\text{-}CV}1010-fold CV
0179.0128183.7768\sim 859\sim 0.522
313.507714.318525.4179\mathbf{25.4179}28.35340.067936\sim 0.069
68.2633\mathbf{8.2633}10.2915\mathbf{10.2915}27.319621.4569\mathbf{21.4569}0.063835\mathbf{0.063835}\sim 0.066
1214.9845\sim 21.4548.3328\sim 26.4n/an/a

(Bold cells are the per-criterion argmins; the \sim entries are notebook-computed approximations not pinned in T10.) The argmin pattern is the canonical Yang signature: AIC, AICc, CpC_p, LOO-CV, and 10-fold CV all select d=6d = 6; BIC alone selects the sparser d=3d = 3. The pinned values come from T10.1–T10.11 (regression.test.ts).

Example 12 Nested Poisson GLM walkthrough (T10.12–T10.19)

The nested-Poisson example of §24.7 Ex 8 is the canonical use case for the AIC ≡ LRT identification. Both AIC and BIC prefer the reduced model (ΔAIC=1.4453,ΔBIC=4.7436\Delta\mathrm{AIC} = 1.4453, \Delta\mathrm{BIC} = 4.7436, both positive), and the LRT does not reject (LR=0.5547,p0.46\mathrm{LR} = 0.5547, p \approx 0.46). Reporting practice: state all three numbers — LR + p-value, ΔAIC\Delta\mathrm{AIC}, ΔBIC\Delta\mathrm{BIC} — alongside the candidate-set description, so the reader can apply their own selection criterion.

Example 13 Prostate lasso-path with CV vs AIC vs BIC argmins

Combining §24.8 Ex 10 with Topic 23 §23.9 Ex 14: on the 100-point log-grid lasso path of the prostate-cancer dataset (n=97n = 97, p=8p = 8), three argmin λ\lambda values emerge:

  • λ^AIC\hat\lambda_{\text{AIC}} — typically aligns with λ^min\hat\lambda_{\min} from 10-fold CV (Stone-Li equivalence; §24.8 Rem 20).
  • λ^1SE\hat\lambda_{\text{1SE}} — Topic 23’s one-SE-rule choice; sparser than λ^min\hat\lambda_{\min}.
  • λ^BIC\hat\lambda_{\text{BIC}} — sparser still; lies between λ^min\hat\lambda_{\min} and the empty-active-set λmax\lambda_{\max}.

The corresponding active-set sizes order as A(λ^AIC)A(λ^1SE)A(λ^BIC)|\mathcal{A}(\hat\lambda_{\text{AIC}})| \geq |\mathcal{A}(\hat\lambda_{\text{1SE}})| \geq |\mathcal{A}(\hat\lambda_{\text{BIC}})|, recovering the consistency-vs-efficiency tradeoff (§24.6 Thm 5) on a real dataset.

Remark 21 Production tooling

Standard implementations: Rstats::AIC, stats::BIC, MASS::stepAIC; Pythonstatsmodels.GenericLikelihoodModel.aic / .bic, sklearn.model_selection.cross_val_score; JuliaStatsBase.aic, StatsBase.bic, MLBase.cross_validate. All are wrappers around 2^+penalty-2\hat\ell + \text{penalty} with the parameter-count convention from the underlying fit object.

Remark 22 Reporting standards: criterion + candidate set must both be named

A reported “AIC =X= X” without the candidate set is uninterpretable — AIC is meaningful only relative to the comparison family. Standard practice: state the candidate family, the criterion, the argmin index, and the full ranking (or at least the Δ\Delta values), so the reader can audit the selection. BUR2002 §2.6 and CLA2008 §1.2 give detailed reporting templates.

24.10 Forward map

Topic 24 closes Track 6’s classical-regression toolkit and opens onto eight forward-pointing developments — each gets a one-paragraph remark below. The arc moves Bayesian (Track 7), then sparsity-aware (Track 8), then ML-native (formalml).

Remark 23 Bayesian model averaging — full pointer (forward from §24.4 Rem 7)

Bayesian model averaging (BMA) averages predictions over the candidate family weighted by posterior model probabilities. Using Thm 2’s BIC approximation, wkBICexp(ΔBICk/2)w_k^{\text{BIC}} \propto \exp(-\Delta\mathrm{BIC}_k/2) are the BMA weights for predictive averaging:

p(ynewy)  =  k=1Mp(ynewMk,y)wkBMA.p(y_{\text{new}} \mid \mathbf{y}) \;=\; \sum_{k=1}^M p(y_{\text{new}} \mid \mathcal{M}_k, \mathbf{y}) \cdot w_k^{\text{BMA}}.

Hoeting et al. (1999) is the canonical methodology survey; Track 7 develops the full Bayesian framework (priors, MCMC for p(ynewMk,y)p(y_{\text{new}} \mid \mathcal{M}_k, \mathbf{y}), posterior model probabilities); formalml’s Bayesian Model Averaging topic covers ML-scale BMA over deep-learning architectures and ensemble approaches.

Remark 24 Post-selection inference — full pointer

A confidence interval reported after a model-selection step has no honest coverage guarantee under the standard frequentist framework — the selection event is data-dependent, so P(θCI^)\mathbb{P}(\theta \in \hat{\mathrm{CI}}) is not the nominal 1α1 - \alpha. Post-selection inference restores validity through several routes: PoSI (Berk–Brown–Buja 2013, simultaneous over all submodels), selective conditioning (Lee–Sun–Sun–Taylor 2016, conditioning on the selection event), debiased lasso (Zhang–Zhang 2014; Javanmard–Montanari 2014, one-step Newton correction), and cross-fitting / double ML (Chernozhukov et al. 2018, sample-splitting for valid causal inference after ML-selected nuisance models). All four directions live at formalml’s Post-Selection Inference and Cross-Fitting topics.

Remark 25 Stepwise / forward-backward selection — dismissed with citations

Stepwise selection (forward, backward, or bidirectional, by AIC or by p-value threshold) is widely available in legacy tooling but is no longer recommended methodology. Harrell (2015) §4.3 and Heinze-Wallisch-Dunkler (2018) document the failure modes: biased coefficient estimates, invalid confidence intervals, and unstable selection across resamples. Modern best practice replaces stepwise with lasso (Topic 23) for the selection step, optionally followed by debiasing (Rem 20) for inference. We omit stepwise from Topic 24’s main exposition for this reason.

Remark 26 DIC / WAIC / PSIS-LOO — Bayesian information criteria (single remark per Q5)

Three Bayesian analogs of AIC have emerged. DIC (Spiegelhalter et al. 2002) estimates the expected predictive log-likelihood using posterior samples and an effective parameter count pDp_D. WAIC (Watanabe 2010) replaces the plug-in pDp_D with a per-observation variance term that is invariant to reparameterization. PSIS-LOO (Vehtari–Gelman–Gabry 2017) computes leave-one-out cross-validation via Pareto-smoothed importance sampling on existing MCMC draws — the de facto standard in modern Bayesian model comparison. Track 7 develops all three; this single remark is the forward pointer.

Remark 27 MDL — minimum description length (Rissanen 1978)

Minimum description length (Rissanen 1978) frames model selection as a data-compression problem: the best model is the one that gives the shortest joint description of (model, data \mid model) using a universal code. Under regularity, the universal-code length of the data is asymptotically ^+(k/2)logn+O(1)\hat\ell + (k/2)\log n + O(1), recovering BIC up to additive constants — MDL and BIC give the same ranking. Grünwald (2007) is the canonical textbook treatment.

Remark 28 Time-series + graphical-model selection (one-sentences each)

Two domain-specific extensions: HQIC (Hannan–Quinn 1979) replaces logn\log n with 2loglogn2\log\log n for time-series model selection, giving a penalty that grows slower than BIC’s but faster than AIC’s. Graphical-model / DAG selection uses BIC with tree-structured priors over DAGs (Heckerman–Geiger–Chickering 1995); for protein-network and gene-regulatory inference, a substantial methodology has emerged on top of this core idea.

Remark 29 High-dimensional IC: Extended BIC, stability selection, knockoffs

When pnp \gg n, classical IC degenerates: the candidate set has 2p2^p subsets, and BIC’s klognk\log n penalty no longer compensates for the multiplicity. Three modern extensions: Extended BIC (Chen–Chen 2008) adds a γklogp\gamma k \log p term to penalize the candidate-set size; stability selection (Meinshausen–Bühlmann 2010) bootstraps lasso to control variable-selection frequency under finite-sample bounds; knockoffs (Barber–Candès 2015) provide finite-sample FDR control on the selected variable set. All three live at formalml’s High-Dimensional Regression topic.

Remark 30 Structural risk minimization (VC theory, Rademacher complexity)

Vapnik’s structural risk minimization generalizes information criteria from parametric models to function classes via complexity measures like VC dimension or Rademacher complexity. The penalty term 2k2k in AIC is replaced by a complexity-based bound on the gap between empirical and population risk; Bartlett & Mendelson (2002) give the Rademacher-complexity foundation. Track 8 develops nonparametric model selection in this language; formalml’s Structural Risk Minimization topic covers the classical Vapnik–Chervonenkis theory.

Remark 31 Track 7 on-ramp: BIC → marginal likelihood → BMA → MCMC

Thm 2’s BIC-Laplace derivation is the gateway to the full Bayesian model-comparison machinery. Topic 25 (Bayesian Foundations) opens Track 7, and the three subsequent Track 7 topics develop:

  • Priors on Θ\Theta and on the model space (Rems 8, 19): conjugate, weakly-informative, reference, intrinsic.
  • Posterior computation via MCMC: Metropolis–Hastings (Topic 26 §26.2), Hamiltonian Monte Carlo (§26.4), NUTS (§26.5; Hoffman & Gelman 2014).
  • Exact marginal likelihood: nested sampling (Skilling 2006), bridge sampling (Meng–Wong 1996), thermodynamic integration.
  • Predictive averaging via posterior predictive checks: BMA (Rem 19), DIC / WAIC / PSIS-LOO (Rem 22).

Topic 24 §24.4’s BIC is the asymptotic shorthand for these computationally heavier procedures.

Forward-map diagram for Topic 24. Central hub of model selection (AIC, BIC, CV, IC ranking) with arrows out to Track 7 (Bayesian model averaging, MCMC, DIC/WAIC/PSIS-LOO, marginal likelihood), Track 8 (structural risk minimization, nonparametric selection, Rademacher complexity), and formalml.com (post-selection inference, debiased lasso, cross-fitting, double ML, high-dimensional selection, knockoffs). Back-arrows to Topic 14 (MLE), Topic 18 (LRT/Wilks), Topic 21 (linear regression), Topic 22 (GLM), Topic 23 (regularization). Track-color coded with Track 6 in blue, Track 7 in amber, Track 8 in green, formalml.com in purple.

Topic 24 closes Track 6. Topic 21 was OLS as orthogonal projection; Topic 22 was IRLS on the exponential family; Topic 23 was penalized estimation as the rescue when those frameworks break; Topic 24 is the model-selection layer above all three. Reciprocal framing of Topic 23: with the effective-DOF generalization of §24.8, Topic 23’s λ\lambda-indexed family becomes a continuous model space — every λ>0\lambda > 0 gives a model with effective parameter count tr(Hλ)\operatorname{tr}(\mathbf{H}_\lambda), and Topic 23’s CV-driven λ\lambda-selection is a special case of Topic 24’s IC-driven model-selection framework. Topic 23 selects within a one-parameter family; Topic 24 selects across discrete or continuous candidate spaces, with a richer asymptotic theory. Track 6 ends here; the next topic shipped is Topic 25 — Track 7 opener — Bayesian Foundations.


References

  1. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
  2. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
  3. Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15(4), 661–675.
  4. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society, Series B, 39(1), 44–47.
  5. Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika, 92(4), 937–950.
  6. Burnham, K. P. & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.). Springer.
  7. Claeskens, G. & Hjort, N. L. (2008). Model Selection and Model Averaging (1st ed.). Cambridge University Press.
  8. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  9. Lehmann, E. L. & Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  10. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  11. Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (1st ed.). Cambridge University Press.