intermediate 65 min read · April 19, 2026

Generalized Linear Models

The exponential-family bridge — canonical-link score identity, IRLS = Fisher scoring = Newton, deviance as the SSE analog, and the sandwich estimator that closes the §21.9 forward-promise.

22.1 When Normal Errors Fail

Topic 21 built the Normal linear model as orthogonal projection: β^=argminyXβ2\hat{\boldsymbol\beta} = \arg\min \|\mathbf{y} - \mathbf{X}\boldsymbol\beta\|^2, exact FF-distribution inference, β^XN ⁣(β,σ2(XX)1)\hat{\boldsymbol\beta} \mid \mathbf{X} \sim \mathcal{N}\!\left(\boldsymbol\beta, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\right), and the Gauss–Markov BLUE property under nothing more than zero-mean homoscedastic errors. The framework rests on two pillars: a Normal response and constant-variance errors. Drop either and the machinery breaks.

The first pillar fails any time the response is bounded, discrete, or skewed. Binary outcomes — credit defaults, click-throughs, classification labels — live in {0,1}\{0, 1\}, but an OLS fit will happily produce y^[0,1]\hat y \notin [0, 1], which is not a probability. Count outcomes — insurance claim frequencies, page-view counts, photons per pixel — are non-negative integers whose variance grows with the mean (a Poisson with mean μ\mu has variance μ\mu), violating homoscedasticity by construction. Positive-skewed continuous outcomes — claim amounts, waiting times, gene-expression intensities — have right tails that no Normal residual can match.

Three-panel figure. Left: binary scatter (n=80) with OLS line; the line dips below 0 for x < -2 and exceeds 1 for x > 2, with shaded 'impossible' regions. Middle: count scatter (n=120) where the OLS residuals fan out as fitted values grow — heteroscedasticity violation. Right: positive-skewed claim-amount scatter (n=100) on a log y-axis; OLS residuals' Q-Q plot inset shows tails diverging from the diagonal.

The fix is not to abandon the linear-predictor structure that made Topic 21 work — it is to compose that structure with a link function g(μ)=xβg(\mu) = \mathbf{x}^\top\boldsymbol\beta that maps the response’s natural scale (e.g. probability, rate, mean) into R\mathbb{R}, and to swap the Normal response for whatever exponential-family distribution the data actually lives in. That single modification — link function plus exponential-family response — is the generalized linear model.

Example 1 OLS on binary data: when the line escapes [0, 1]

Generate n=80n = 80 predictor values xiN(0,1)x_i \sim \mathcal{N}(0, 1) and binary responses yiBernoulli(σ(0.5+1.8xi))y_i \sim \text{Bernoulli}(\sigma(0.5 + 1.8 x_i)) where σ\sigma is the logistic sigmoid. Fit OLS: y^i=β^0+β^1xi\hat y_i = \hat\beta_0 + \hat\beta_1 x_i. The fitted line is approximately y^0.5+0.30x\hat y \approx 0.5 + 0.30 x, which crosses zero at x1.7x \approx -1.7 and exceeds one at x1.7x \approx 1.7. Roughly 12 of the 80 fitted values are negative; roughly 14 exceed unity. The model is internally inconsistent: it predicts probabilities outside [0,1][0, 1], and any decision rule built on y^\hat y either has to clip those values or admit the model is wrong about its own outputs.

The same data fit by logistic regression with the logit link g(p)=log(p/(1p))g(p) = \log(p/(1-p)) produces β^00.51,β^11.94\hat\beta_0 \approx 0.51, \hat\beta_1 \approx 1.94 on the latent scale, with p^i=σ(β^0+β^1xi)(0,1)\hat p_i = \sigma(\hat\beta_0 + \hat\beta_1 x_i) \in (0, 1) for every ii. The probabilities are bounded by construction; coefficient interpretation shifts from “additive on yy” to “additive on logodds(y=1)\log\text{odds}(y=1).”

Remark 1 Three failure modes for OLS

The three failures above are not isolated — they are the three structural assumptions of the Normal linear model failing one at a time:

  1. Bounded support. Binary, multinomial, and ordinal outcomes have support that does not match R\mathbb{R}. The link function maps the bounded mean (e.g. probability) back to R\mathbb{R}.
  2. Variance–mean dependence. Counts have Var(Y)=E[Y]\text{Var}(Y) = E[Y] (Poisson); proportions have Var(Y)=E[Y](1E[Y])/n\text{Var}(Y) = E[Y](1 - E[Y])/n (binomial); positive-skewed amounts have Var(Y)E[Y]2\text{Var}(Y) \propto E[Y]^2 (Gamma). Homoscedasticity is the exception, not the rule. The variance function V(μ)V(\mu) encodes this dependence.
  3. Non-Normal tails. Skewed positives, heavy-tailed counts, zero-inflated mixtures — none have Normal sampling distributions even at moderate nn. The exponential-family response replaces “Normal errors” with “the right family for this support.”

GLMs address all three at once by composing a linear predictor with a link and pairing it with an exponential-family response. The linear-predictor structure is what survives; everything else generalizes.

Remark 2 Nelder–Wedderburn 1972: the unification

Before 1972, statisticians had logistic regression (Bliss 1935, Berkson 1944), probit regression (Bliss 1934), Poisson regression (Cochran 1940, Birch 1963), log-linear models for contingency tables (Goodman 1968), and Gamma regression for survival times (Cox 1972) — all developed independently with bespoke fitting algorithms. Nelder & Wedderburn (1972) observed that all of these are the same model: an exponential-family response with mean μ\mu, a link function gg mapping μ\mu to a linear predictor η=xβ\eta = \mathbf{x}^\top\boldsymbol\beta, and a single fitting algorithm — iteratively reweighted least squares (IRLS) — that handles every family. The unification was both conceptual (one framework, not five) and practical (one algorithm to implement, not five). McCullagh & Nelder (1989) is the canonical book-length development; Agresti (2015) is the current graduate standard.

Topic 22 reproduces the unification mathematically. §22.2 lays out the framework; §22.3 proves the canonical-link score identity that drives the unified algorithm; §22.4 / §22.5 / §22.6 fit logistic / Poisson / Gamma regression as worked examples; §22.7 introduces the deviance as the universal goodness-of-fit statistic; §22.8 / §22.9 handle inference and misspecification.

22.2 The GLM Framework

A generalized linear model has three ingredients: a linear predictor η=xβ\eta = \mathbf{x}^\top\boldsymbol\beta, a link function gg mapping the conditional mean μ=E[Yx]\mu = E[Y \mid \mathbf{x}] to η\eta, and a response distribution from an exponential family with mean μ\mu and dispersion ϕ\phi. We commit each ingredient in turn.

Notation (extending §21.4). Throughout Topic 22, μi=E[Yixi]\mu_i = E[Y_i \mid \mathbf{x}_i], V(μi)V(\mu_i) is the exponential-family variance function, and ϕ\phi is the dispersion parameter (known for Bernoulli and Poisson; estimated for Gamma and Normal). The linear predictor is ηi=xiβ\eta_i = \mathbf{x}_i^\top\boldsymbol\beta; the link function gg satisfies g(μi)=ηig(\mu_i) = \eta_i, with inverse link g1g^{-1}. The weight matrix W=W(β)\mathbf{W} = \mathbf{W}(\boldsymbol\beta) is diagonal with Wii=1/[V(μi)(g(μi))2]W_{ii} = 1/[V(\mu_i)(g'(\mu_i))^2]; under the canonical link this simplifies to Wii=V(μi)W_{ii} = V(\mu_i). The adjusted response at iteration tt is zi(t)=ηi(t)+g(μi(t))(yiμi(t))z_i^{(t)} = \eta_i^{(t)} + g'(\mu_i^{(t)})(y_i - \mu_i^{(t)}). We write μ,η,z\boldsymbol\mu, \boldsymbol\eta, \mathbf{z} for the stacked vectors. The score vector is U(β)=β(β)\mathbf{U}(\boldsymbol\beta) = \nabla_{\boldsymbol\beta}\ell(\boldsymbol\beta); the observed information is J(β)=2(β)\mathbf{J}(\boldsymbol\beta) = -\nabla^2 \ell(\boldsymbol\beta); the Fisher (expected) information is I(β)=E[J(β)]\mathcal{I}(\boldsymbol\beta) = E[\mathbf{J}(\boldsymbol\beta)]. Under the canonical link, J=I=XWX\mathbf{J} = \mathcal{I} = \mathbf{X}^\top\mathbf{W}\mathbf{X}. All matrix conventions from §21.4 (column vectors, X\mathbf{X} of shape n×(p+1)n \times (p+1), X\mathbf{X}^\top for transpose) carry over verbatim.

Definition 1 Linear predictor

Given a design matrix XRn×(p+1)\mathbf{X} \in \mathbb{R}^{n \times (p+1)} (with a leading column of ones for the intercept) and a coefficient vector βRp+1\boldsymbol\beta \in \mathbb{R}^{p+1}, the linear predictor for observation ii is

ηi=xiβ=β0+β1xi1++βpxip.\eta_i = \mathbf{x}_i^\top\boldsymbol\beta = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}.

In stacked form, η=XβRn\boldsymbol\eta = \mathbf{X}\boldsymbol\beta \in \mathbb{R}^n. The linear predictor lives on the real line — no constraints — even when the response YiY_i has bounded or discrete support.

Definition 2 Link function

A link function g:MRg: \mathcal{M} \to \mathbb{R} is a smooth, strictly monotone map from the response’s mean space M\mathcal{M} (e.g. (0,1)(0, 1) for Bernoulli, (0,)(0, \infty) for Poisson and Gamma, R\mathbb{R} for Normal) to the linear-predictor scale R\mathbb{R}. It satisfies

g(μi)=ηi=xiβ,μi=g1(ηi).g(\mu_i) = \eta_i = \mathbf{x}_i^\top\boldsymbol\beta, \qquad \mu_i = g^{-1}(\eta_i).

Strict monotonicity guarantees the inverse g1g^{-1} is well-defined, so the model uniquely specifies μi\mu_i given β\boldsymbol\beta and xi\mathbf{x}_i. Smoothness is needed for the IRLS derivation in §22.3.

Definition 3 Exponential-family response and canonical link

The response YiY_i follows an exponential family with density (or mass function)

f(yiθi,ϕ)=h(yi,ϕ)exp ⁣(θiyiA(θi)ϕ),f(y_i \mid \theta_i, \phi) = h(y_i, \phi) \exp\!\left(\frac{\theta_i y_i - A(\theta_i)}{\phi}\right),

where θi\theta_i is the natural parameter, A(θi)A(\theta_i) is the cumulant function, ϕ\phi is the dispersion, and hh is a base-measure factor that does not depend on θ\theta. By Topic 7 Thm 3 and Cor 2, E[Yi]=A(θi)=μiE[Y_i] = A'(\theta_i) = \mu_i and Var(Yi)=ϕA(θi)=ϕV(μi)\text{Var}(Y_i) = \phi A''(\theta_i) = \phi V(\mu_i), where V(μ)=A((A)1(μ))V(\mu) = A''((A')^{-1}(\mu)) is the variance function.

A link function is canonical when g=(A)1g = (A')^{-1}, i.e. when the natural parameter coincides with the linear predictor: θi=ηi=xiβ\theta_i = \eta_i = \mathbf{x}_i^\top\boldsymbol\beta. Under the canonical link the score equation collapses to the design-times-residual identity that makes §22.3 Thm 1 the featured theorem of the topic.

Four-panel horizontal figure. Each panel plots μ = g⁻¹(η) over η ∈ [-5, 5] for one canonical link. (a) Bernoulli/logit: μ = sigmoid(η), the S-curve from 0 to 1. (b) Poisson/log: μ = exp(η), exponential growth on the positive y-axis. (c) Gamma/inverse: μ = 1/η on η > 0, the hyperbola dropping toward zero. (d) Normal/identity: μ = η, the diagonal line. Each panel labels the cumulant function A(θ) and variance function V(μ) for reference.
Example 2 Normal + identity link recovers Topic 21's OLS

Take YiN(μi,σ2)Y_i \sim \mathcal{N}(\mu_i, \sigma^2) with the identity link g(μ)=μg(\mu) = \mu. The exponential-family form has θ=μ\theta = \mu (so the link is canonical), A(θ)=θ2/2A(\theta) = \theta^2/2, h(y,ϕ)=(2πϕ)1/2exp(y2/(2ϕ))h(y, \phi) = (2\pi\phi)^{-1/2}\exp(-y^2/(2\phi)), and dispersion ϕ=σ2\phi = \sigma^2. The cumulant identities give A(θ)=θ=μA'(\theta) = \theta = \mu (mean) and A(θ)=1A''(\theta) = 1, so V(μ)=1V(\mu) = 1 — constant variance, recovering homoscedasticity.

The log-likelihood is

(β)=n2log(2πσ2)12σ2i=1n(yixiβ)2,\ell(\boldsymbol\beta) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^\top\boldsymbol\beta)^2,

so maximizing \ell over β\boldsymbol\beta is equivalent to minimizing i(yixiβ)2\sum_i (y_i - \mathbf{x}_i^\top\boldsymbol\beta)^2 — the OLS objective from §21.2. The score equation β=(1/σ2)X(yXβ)=0\nabla_{\boldsymbol\beta}\ell = (1/\sigma^2)\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) = \mathbf{0} is XXβ^=Xy\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol\beta} = \mathbf{X}^\top\mathbf{y}, which is §21.4 Thm 3 (normal equations) verbatim. Topic 21’s entire estimation theory is the Normal+identity-link special case of the GLM framework. We will see this reduction again, more abstractly, in §22.3 Proof 1 Step 4.

Example 3 The four canonical links — one line each

For the four GLM families covered in this topic, the canonical link is determined by inverting AA':

Familyθ\thetaA(θ)A(\theta)A(θ)=μA'(\theta) = \muCanonical link g(μ)=(A)1(μ)g(\mu) = (A')^{-1}(\mu)
Bernoulli(p)(p)log[p/(1p)]\log[p/(1-p)]log(1+eθ)\log(1 + e^\theta)eθ/(1+eθ)=pe^\theta/(1 + e^\theta) = plogit: g(p)=log[p/(1p)]g(p) = \log[p/(1-p)]
Poisson(μ)(\mu)logμ\log\mueθe^\thetaeθ=μe^\theta = \mulog: g(μ)=logμg(\mu) = \log\mu
Gamma(ν,ν/μ)(\nu, \nu/\mu) (shape ν\nu, mean μ\mu)1/μ-1/\mulog(θ)=logμ-\log(-\theta) = \log\mu1/θ=μ-1/\theta = \muinverse: g(μ)=1/μg(\mu) = 1/\mu
Normal(μ,σ2)(\mu, \sigma^2)μ\muθ2/2\theta^2/2θ=μ\theta = \muidentity: g(μ)=μg(\mu) = \mu

The Bernoulli logit and Poisson log are the two canonical links you will see most often in practice; the Gamma inverse link is unusual and §22.6 Rem 14 discusses why the log link is often preferred there despite being non-canonical.

Remark 3 Canonical vs non-canonical links

The canonical link is mathematically privileged — it is the unique link under which the score equation is exactly X(yμ)=0\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) = \mathbf{0} (§22.3 Thm 1) and observed Fisher information equals expected Fisher information (§22.3 Thm 2 Step 2). Non-canonical links — probit and cloglog for binary data, identity link for Poisson, log link for Gamma — break both properties: the score acquires an extra Jacobian factor and the Hessian’s expected and observed forms diverge.

So why use them? Three reasons. Latent-variable interpretation: the probit link g(p)=Φ1(p)g(p) = \Phi^{-1}(p) corresponds to a latent Normal variable whose threshold determines the binary outcome — natural for psychological-test response models. Hazard interpretation: the cloglog link g(p)=log(log(1p))g(p) = \log(-\log(1-p)) corresponds to proportional hazards in discrete time — natural for survival analysis. Coefficient interpretability: the log link on Gamma gives β^j\hat\beta_j as a log-multiplicative effect on the mean (a one-unit change in xjx_j multiplies μ\mu by eβ^je^{\hat\beta_j}), which is more useful in practice than the inverse-link’s “additive on 1/μ1/\mu” interpretation.

In the GLM framework, non-canonical links work — IRLS still converges, asymptotic normality still holds — they just do not enjoy the elegant identities that make the canonical case the featured theorem of §22.3.

Remark 4 Dispersion φ — known or estimated

The dispersion ϕ\phi scales the variance: Var(Yi)=ϕV(μi)\text{Var}(Y_i) = \phi V(\mu_i). For Bernoulli and Poisson, ϕ\phi is fixed at 1 by the structure of the family — there is no separate variance parameter to estimate, because mean determines variance. For Gamma and Normal, ϕ\phi is a free parameter that must be estimated alongside β\boldsymbol\beta. The estimator is the Pearson statistic:

ϕ^=1np1i=1n(yiμ^i)2V(μ^i).\hat\phi = \frac{1}{n - p - 1}\sum_{i=1}^n \frac{(y_i - \hat\mu_i)^2}{V(\hat\mu_i)}.

When ϕ\phi is estimated rather than known, the deviance test (§22.7) uses an FF-reference instead of χ2\chi^2, exactly as in Topic 21 §21.8 where σ^2\hat\sigma^2 replaced σ2\sigma^2 in the FF-statistic. The §22.9 sandwich estimator generalizes this further: when the variance specification ϕV(μi)\phi V(\mu_i) is wrong (overdispersed Poisson, misspecified Gamma scale, clustered Bernoulli), the QMLE is still consistent for β\boldsymbol\beta as long as the mean specification μi=g1(xiβ)\mu_i = g^{-1}(\mathbf{x}_i^\top\boldsymbol\beta) is correct; only the variance estimate needs adjustment.

Topic 22’s featured result is the canonical-link score identity: under a canonical link, the score equation is

β(β)=X(yμ(β))=0,\nabla_{\boldsymbol\beta}\ell(\boldsymbol\beta) = \mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu(\boldsymbol\beta)) = \mathbf{0},

which is the design matrix times the residual — exactly the form Topic 21 §21.4 Thm 3 wrote for the OLS normal equations. The composition μ=g1(η)\boldsymbol\mu = g^{-1}(\boldsymbol\eta) does the work of bridging the response scale and the linear-predictor scale; when the link is canonical and the response is Normal+identity, the composition collapses and we recover OLS verbatim. This is the structural claim of Track 6: linear regression is a GLM; the normal equations are the GLM score.

The companion result is the IRLS = Fisher scoring = Newton coincidence (Thm 2). Newton’s method on the GLM log-likelihood, Fisher scoring (Newton with expected Hessian), and iteratively reweighted least squares (a sequence of WLS solves on an adjusted response) all coincide under the canonical link. Each iteration is

β(t+1)=(XW(t)X)1XW(t)z(t),\boldsymbol\beta^{(t+1)} = (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)},

a weighted least-squares solve with the WLS weights W(t)\mathbf{W}^{(t)} and adjusted response z(t)\mathbf{z}^{(t)} defined in the notation block. This is what glm() in R, statsmodels.GLM in Python, and every production GLM solver implements under the hood.

Theorem 1 Canonical-link score identity and concavity (FEATURED)

Let YiY_i follow an exponential-family density f(yiθi,ϕ)=h(yi,ϕ)exp ⁣((θiyiA(θi))/ϕ)f(y_i \mid \theta_i, \phi) = h(y_i, \phi)\exp\!\left((\theta_i y_i - A(\theta_i))/\phi\right) with canonical link θi=ηi=xiβ\theta_i = \eta_i = \mathbf{x}_i^\top\boldsymbol\beta. Then:

  1. Score identity. β(β)=1ϕX(yμ(β))\displaystyle \nabla_{\boldsymbol\beta}\ell(\boldsymbol\beta) = \frac{1}{\phi}\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu(\boldsymbol\beta)), where μi=A(θi)=A(xiβ)\mu_i = A'(\theta_i) = A'(\mathbf{x}_i^\top\boldsymbol\beta).

  2. Hessian. H(β)=1ϕXW(β)X\displaystyle \mathcal{H}(\boldsymbol\beta) = -\frac{1}{\phi}\mathbf{X}^\top\mathbf{W}(\boldsymbol\beta)\mathbf{X}, where W(β)\mathbf{W}(\boldsymbol\beta) is diagonal with Wii=V(μi)>0W_{ii} = V(\mu_i) > 0.

  3. Concavity. When X\mathbf{X} has full column rank p+1p+1, the log-likelihood \ell is strictly concave in β\boldsymbol\beta. Consequently, any interior critical point of \ell is the unique global maximum.

In particular, the MLE β^\hat{\boldsymbol\beta}, when it exists in the interior of the parameter space, is the unique solution to X(yμ(β^))=0\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu(\hat{\boldsymbol\beta})) = \mathbf{0}.

Proof [show]

Step 1 (Setup). Substitute the canonical-link assumption θi=xiβ\theta_i = \mathbf{x}_i^\top\boldsymbol\beta into the exponential-family density. By Topic 7 Thm 3 the family’s mean is μi=A(θi)\mu_i = A'(\theta_i); by Topic 7 Cor 2 the variance function is V(μi)=A(θi)V(\mu_i) = A''(\theta_i). Both identities are one-line consequences of differentiating f(yθ,ϕ)dy=1\int f(y \mid \theta, \phi) \, dy = 1 with respect to θ\theta — see Topic 7 for the derivation.

Step 2 (Log-likelihood). With the canonical-link substitution, the per-observation log-likelihood is i(β)=(θiyiA(θi))/ϕ+logh(yi,ϕ)\ell_i(\boldsymbol\beta) = (\theta_i y_i - A(\theta_i))/\phi + \log h(y_i, \phi). Summing over ii and dropping the β\boldsymbol\beta-free logh\log h constant:

(β)=1ϕi=1n[(xiβ)yiA(xiβ)]+const.\ell(\boldsymbol\beta) = \frac{1}{\phi}\sum_{i=1}^n \left[(\mathbf{x}_i^\top\boldsymbol\beta) y_i - A(\mathbf{x}_i^\top\boldsymbol\beta)\right] + \text{const}.

Step 3 (Score). Differentiate term by term using the chain rule. The derivative of (xiβ)yi(\mathbf{x}_i^\top\boldsymbol\beta) y_i with respect to β\boldsymbol\beta is yixiy_i \mathbf{x}_i; the derivative of A(xiβ)A(\mathbf{x}_i^\top\boldsymbol\beta) is A(xiβ)xi=μixiA'(\mathbf{x}_i^\top\boldsymbol\beta)\mathbf{x}_i = \mu_i \mathbf{x}_i (Topic 7 Thm 3). Therefore

β(β)=1ϕi=1nxi(yiμi)=1ϕX(yμ).\nabla_{\boldsymbol\beta}\ell(\boldsymbol\beta) = \frac{1}{\phi}\sum_{i=1}^n \mathbf{x}_i (y_i - \mu_i) = \frac{1}{\phi}\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu).

The score is the design matrix times the residual — the design times the gap between observed response and fitted mean. When ϕ=1\phi = 1 (Bernoulli, Poisson) the prefactor drops and =X(yμ)\nabla\ell = \mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) exactly; when ϕ\phi is unknown but constant across observations (Gamma, Normal) it factors out of the first-order condition. This is identity #1.

Step 4 (Reduction to OLS). This step is the pedagogical hinge of the entire topic. For the Normal+identity-link special case, A(θ)=θ2/2A(\theta) = \theta^2/2, so A(θ)=θ=μA'(\theta) = \theta = \mu and A(θ)=1=V(μ)A''(\theta) = 1 = V(\mu). Substituting back: μi=xiβ\mu_i = \mathbf{x}_i^\top\boldsymbol\beta, and the score identity becomes

1σ2X(yXβ)=0XXβ=Xy.\frac{1}{\sigma^2}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) = \mathbf{0} \quad\Longleftrightarrow\quad \mathbf{X}^\top\mathbf{X}\boldsymbol\beta = \mathbf{X}^\top\mathbf{y}.

That is Topic 21 §21.4 Theorem 3 (the normal equations) verbatim. The featured theorem of Topic 22 contains the featured estimation result of Topic 21. We will see this reduction surface again in §22.3 Rem 6, and the structural identity “OLS is GLM” is the through-line that licenses every cross-reference between the two topics.

Step 5 (Hessian). Differentiate the score from Step 3 once more. The derivative of μi=A(xiβ)\mu_i = A'(\mathbf{x}_i^\top\boldsymbol\beta) with respect to β\boldsymbol\beta is A(xiβ)xi=V(μi)xiA''(\mathbf{x}_i^\top\boldsymbol\beta)\mathbf{x}_i = V(\mu_i)\mathbf{x}_i (Topic 7 Cor 2). Therefore

H(β)=2ββ=1ϕi=1nxixiV(μi)=1ϕXWX,\mathcal{H}(\boldsymbol\beta) = \frac{\partial^2 \ell}{\partial\boldsymbol\beta\, \partial\boldsymbol\beta^\top} = -\frac{1}{\phi}\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top V(\mu_i) = -\frac{1}{\phi}\mathbf{X}^\top\mathbf{W}\mathbf{X},

where W=diag(V(μ1),,V(μn))\mathbf{W} = \text{diag}(V(\mu_1), \ldots, V(\mu_n)) has strictly positive diagonal because exponential-family regularity guarantees V(μ)>0V(\mu) > 0 in the interior of M\mathcal{M} (Topic 7). This is identity #2.

Step 6 (Concavity). When X\mathbf{X} has full column rank p+1p+1, the quadratic form v(XWX)v=W1/2Xv2\mathbf{v}^\top(\mathbf{X}^\top\mathbf{W}\mathbf{X})\mathbf{v} = \|\mathbf{W}^{1/2}\mathbf{X}\mathbf{v}\|^2 is strictly positive for every v0\mathbf{v} \neq \mathbf{0} (since Xv0\mathbf{X}\mathbf{v} \neq \mathbf{0} by full column rank, and Wii>0W_{ii} > 0). Hence H(β)\mathcal{H}(\boldsymbol\beta) is negative definite everywhere β\boldsymbol\beta is interior, so \ell is strictly concave. Strict concavity guarantees any interior critical point is the unique global maximum. \square — using Topic 7 Thm 3 (A(θ)=μA'(\theta) = \mu) and Topic 7 Cor 2 (A(θ)=V(μ)A''(\theta) = V(\mu)).

Example 4 Logistic regression score = X^T(y − p)

For the logistic regression model YiBernoulli(pi)Y_i \sim \text{Bernoulli}(p_i) with logit link ηi=log[pi/(1pi)]=xiβ\eta_i = \log[p_i/(1-p_i)] = \mathbf{x}_i^\top\boldsymbol\beta, the canonical-link score identity gives

(β)=X(yp),\nabla\ell(\boldsymbol\beta) = \mathbf{X}^\top(\mathbf{y} - \mathbf{p}),

where p=(p1,,pn)\mathbf{p} = (p_1, \ldots, p_n)^\top with pi=σ(xiβ)p_i = \sigma(\mathbf{x}_i^\top\boldsymbol\beta) and σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}) is the sigmoid. Setting the score to zero gives the system ixi(yipi)=0\sum_i \mathbf{x}_i (y_i - p_i) = \mathbf{0} — the binary-classification analog of “residual orthogonal to the design columns.” Unlike OLS, the system is non-linear in β\boldsymbol\beta (because pip_i depends on β\boldsymbol\beta through the sigmoid), so there is no closed-form solution; we need the IRLS algorithm of Thm 2.

The Hessian is H=XWX\mathcal{H} = -\mathbf{X}^\top\mathbf{W}\mathbf{X} with Wii=pi(1pi)W_{ii} = p_i(1 - p_i) — the variance of a Bernoulli with parameter pip_i. The weight is largest at pi=0.5p_i = 0.5 (variance 0.250.25) and shrinks to zero at the boundary pi{0,1}p_i \in \{0, 1\}, which is the source of the “near-separation” pathology Rem 8 will treat in §22.4.

Theorem 1 says what the MLE satisfies; Theorem 2 says how to find it. Newton’s method on a smooth concave function converges quadratically near its maximum, but the Newton step requires solving a linear system H(β(t))1U(β(t))\mathcal{H}(\boldsymbol\beta^{(t)})^{-1}\mathbf{U}(\boldsymbol\beta^{(t)}) at every iteration. The next theorem shows that, under the canonical link, this Newton step rewrites as a single weighted-least-squares solve on an adjusted response — and is identical to Fisher scoring because observed and expected information coincide. The algorithm is iteratively reweighted least squares, and it is what every production GLM solver implements.

Theorem 2 IRLS = Fisher scoring = Newton (under canonical link)

Under the canonical link, Newton’s method on the GLM log-likelihood produces iterates

β(t+1)=(XW(t)X)1XW(t)z(t),\boldsymbol\beta^{(t+1)} = (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)},

where W(t)=diag(V(μi(t)))\mathbf{W}^{(t)} = \text{diag}(V(\mu_i^{(t)})) and the adjusted response is

z(t)=η(t)+(W(t))1(yμ(t)).\mathbf{z}^{(t)} = \boldsymbol\eta^{(t)} + (\mathbf{W}^{(t)})^{-1}(\mathbf{y} - \boldsymbol\mu^{(t)}).

The right-hand side is the closed-form weighted-least-squares estimator for the linear model z(t)Xβ\mathbf{z}^{(t)} \approx \mathbf{X}\boldsymbol\beta with weights W(t)\mathbf{W}^{(t)}. Therefore Newton’s method on the canonical-link GLM log-likelihood is iteratively reweighted least squares. Furthermore, because W(t)\mathbf{W}^{(t)} depends on β\boldsymbol\beta only through the deterministic mean μ(t)\boldsymbol\mu^{(t)} (no randomness in W\mathbf{W}), the observed information equals the expected information, so Newton’s method is Fisher scoring.

Proof [show]

Step 1 (Setup). From Thm 1 we have, under the canonical link, U(β)=(1/ϕ)X(yμ)\mathbf{U}(\boldsymbol\beta) = (1/\phi)\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) and H(β)=(1/ϕ)XWX\mathcal{H}(\boldsymbol\beta) = -(1/\phi)\mathbf{X}^\top\mathbf{W}\mathbf{X}. The dispersion ϕ\phi factors out of the Newton step (it appears in both U\mathbf{U} and H\mathcal{H} identically), so we absorb it into a constant for readability and write U=X(yμ)\mathbf{U} = \mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) and J=XWX\mathbf{J} = \mathbf{X}^\top\mathbf{W}\mathbf{X} in this proof.

Step 2 (Observed = expected information). The Hessian H=XW(β)X\mathcal{H} = -\mathbf{X}^\top\mathbf{W}(\boldsymbol\beta)\mathbf{X} depends on β\boldsymbol\beta only through μ(β)=g1(Xβ)\boldsymbol\mu(\boldsymbol\beta) = g^{-1}(\mathbf{X}\boldsymbol\beta), which is a deterministic function of β\boldsymbol\beta and the fixed design matrix X\mathbf{X}. The data y\mathbf{y} does not appear in H\mathcal{H}. Therefore E[H(β)]=H(β)E[\mathcal{H}(\boldsymbol\beta)] = \mathcal{H}(\boldsymbol\beta) identically — observed and expected information coincide. Newton’s method (which uses the observed Hessian) and Fisher scoring (which uses E[H]=I-E[\mathcal{H}] = \mathcal{I}) generate identical iterates under the canonical link.

Step 3 (Newton step). The Newton update is

β(t+1)=β(t)H(β(t))1U(β(t))=β(t)+(XW(t)X)1X(yμ(t)).\boldsymbol\beta^{(t+1)} = \boldsymbol\beta^{(t)} - \mathcal{H}(\boldsymbol\beta^{(t)})^{-1}\mathbf{U}(\boldsymbol\beta^{(t)}) = \boldsymbol\beta^{(t)} + (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu^{(t)}).

By Step 2, this is identically Fisher scoring. We now show it is identically WLS on an adjusted response.

Step 4 (Rewrite as WLS). Add and subtract Xβ(t)\mathbf{X}\boldsymbol\beta^{(t)} inside the square brackets, then factor out (XW(t)X)1XW(t)(\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}:

β(t+1)=(XW(t)X)1XW(t)[Xβ(t)+(W(t))1(yμ(t))].\boldsymbol\beta^{(t+1)} = (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\left[\mathbf{X}\boldsymbol\beta^{(t)} + (\mathbf{W}^{(t)})^{-1}(\mathbf{y} - \boldsymbol\mu^{(t)})\right].

Define the adjusted response z(t)=η(t)+(W(t))1(yμ(t))\mathbf{z}^{(t)} = \boldsymbol\eta^{(t)} + (\mathbf{W}^{(t)})^{-1}(\mathbf{y} - \boldsymbol\mu^{(t)}), recalling η(t)=Xβ(t)\boldsymbol\eta^{(t)} = \mathbf{X}\boldsymbol\beta^{(t)}. Then the update is

  β(t+1)=(XW(t)X)1XW(t)z(t).  \boxed{\;\boldsymbol\beta^{(t+1)} = (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)}.\;}

Step 5 (This is weighted least squares). The boxed right-hand side is the closed-form weighted-least-squares estimator (Topic 21 §21.7 Rem 13, in the WLS appendix-style remark) for the linear model z(t)Xβ\mathbf{z}^{(t)} \approx \mathbf{X}\boldsymbol\beta with weights W(t)\mathbf{W}^{(t)}. Each Newton / Fisher-scoring iteration is therefore a single WLS solve on the adjusted response z(t)\mathbf{z}^{(t)}. The weights W(t)\mathbf{W}^{(t)} and the response z(t)\mathbf{z}^{(t)} both depend on the current iterate β(t)\boldsymbol\beta^{(t)} — hence “iteratively reweighted least squares.” \square

The boxed update in Step 4 is what IRLSVisualizer animates one iteration at a time. Click “Step” to advance from β(t)\boldsymbol\beta^{(t)} to β(t+1)\boldsymbol\beta^{(t+1)}; the contour panel shows the iterate trajectory on the log-likelihood surface, and the right panel decomposes the update into the IRLS weights Wii(t)W_{ii}^{(t)} and the adjusted-response scatter (zi(t),ηi(t))(z_i^{(t)}, \eta_i^{(t)}). The “Near-separation” preset reproduces the §22.4 Rem 8 pathology — the iterates wander off toward ±\pm\infty rather than converging.

Log-likelihood ℓ(β₀, β₁) — click to set β(0)

-10.0-5.00.010.015.020.0β₀β₁

Adjusted response z = η + g'(μ)(y − μ) and weights W

-2.0-1.00.01.02.0z (adjusted)W weights-4.0-3.0-2.0-1.00.01.02.0x₁
n = 60, β(0) = (0.00, 0.00)
Iteration0
‖Δβ‖
log-lik-41.589
β current(0.00, 0.00)
β̂ MLE(-5.60, 14.03)
Two-panel figure. Left: 2D contour map of the logistic log-likelihood ℓ(β₀, β₁) over a toy n=60 dataset; iterates β⁽⁰⁾ = (0,0), β⁽¹⁾, …, β⁽⁷⁾ plotted as arrows converging to the MLE β̂ near (-0.4, 1.9). Right: convergence plot of ‖β⁽ᵗ⁺¹⁾ − β⁽ᵗ⁾‖ on log scale vs iteration t — nearly straight with slope ≈ 2 in log-log coordinates, illustrating the quadratic convergence of Newton's method near the optimum.
Remark 5 Information-geometry preview

The IRLS update has a deeper interpretation in information geometry: the exponential family is a Riemannian manifold whose metric is the Fisher information matrix I(β)=XWX\mathcal{I}(\boldsymbol\beta) = \mathbf{X}^\top\mathbf{W}\mathbf{X}, and the IRLS step is the natural-gradient step under that metric — gradient ascent in the geometry the family naturally lives in, rather than the flat Euclidean geometry of βRp+1\boldsymbol\beta \in \mathbb{R}^{p+1}. The natural gradient is invariant under reparameterization (a property the ordinary gradient lacks), which is exactly the kind of invariance that makes IRLS work uniformly across families and links.

The deep-learning community rediscovered natural-gradient descent in Amari (1998) for neural networks, where it is approximated by methods like K-FAC and Shampoo. The full information-geometry framing, including connections to the dual coordinate systems of exponential families, is at formalML: Information geometry .

Remark 6 OLS is a GLM — completing the structural reduction

We can now make the “OLS is GLM” claim of Step 4 of Proof 1 more emphatic. For Normal response with identity link and ϕ=σ2\phi = \sigma^2: A(θ)=θ2/2A(\theta) = \theta^2/2 gives V(μ)=A(θ)=1V(\mu) = A''(\theta) = 1, so W=I\mathbf{W} = \mathbf{I}. The IRLS update collapses to

β(t+1)=(XX)1Xz(t)withz(t)=Xβ(t)+(yXβ(t))=y.\boldsymbol\beta^{(t+1)} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{z}^{(t)} \quad\text{with}\quad \mathbf{z}^{(t)} = \mathbf{X}\boldsymbol\beta^{(t)} + (\mathbf{y} - \mathbf{X}\boldsymbol\beta^{(t)}) = \mathbf{y}.

The adjusted response collapses to y\mathbf{y}, so the WLS solve becomes the OLS solve, and one iteration converges. Topic 21’s normal equations are not just contained in the canonical-link score identity — Topic 21’s fitting algorithm is contained in IRLS as the special case where W=I\mathbf{W} = \mathbf{I} and the iteration converges in one step. Gauss–Markov BLUE optimality (§21.6) is the homoscedastic-case statement; under heteroscedastic errors the right BLUE is WLS with the true weights, which is what IRLS with the true variance function delivers (Topic 21 §21.9 Rem 21 forward-promised this; we have now discharged it).

The final ingredient in §22.3 is asymptotic normality of β^\hat{\boldsymbol\beta}. This is a corollary of Topic 14 Thm 3 specialized to the GLM information matrix I(β)=XWX\mathcal{I}(\boldsymbol\beta) = \mathbf{X}^\top\mathbf{W}\mathbf{X}.

Theorem 3 Asymptotic normality of the GLM MLE

Under GLM regularity (canonical link, X\mathbf{X} full column rank, MLE β^\hat{\boldsymbol\beta} in the interior of the parameter space, finite Fisher information), as nn \to \infty,

n(β^β0)  d  Np+1 ⁣(0,  I(β0)1),\sqrt{n}\,(\hat{\boldsymbol\beta} - \boldsymbol\beta_0) \;\xrightarrow{d}\; \mathcal{N}_{p+1}\!\left(\mathbf{0}, \;\mathcal{I}(\boldsymbol\beta_0)^{-1}\right),

where I(β0)=(1/n)XW(β0)X\mathcal{I}(\boldsymbol\beta_0) = (1/n)\mathbf{X}^\top\mathbf{W}(\boldsymbol\beta_0)\mathbf{X} is the per-observation Fisher information matrix. Equivalently, in finite-sample-approximation form,

β^  approx  Np+1 ⁣(β0,  ϕ(XW(β0)X)1).\hat{\boldsymbol\beta} \;\overset{\text{approx}}{\sim}\; \mathcal{N}_{p+1}\!\left(\boldsymbol\beta_0, \;\phi (\mathbf{X}^\top\mathbf{W}(\boldsymbol\beta_0)\mathbf{X})^{-1}\right).

The estimated covariance Cov^(β^)=ϕ^(XW^X)1\widehat{\text{Cov}}(\hat{\boldsymbol\beta}) = \hat\phi (\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1} replaces the unknown β0\boldsymbol\beta_0 by β^\hat{\boldsymbol\beta} and the unknown ϕ\phi by the Pearson estimator (§22.2 Rem 4), and is what every coefficient SE in §22.8 is built on.

The proof is a direct specialization of Topic 14 Thm 3: the GLM log-likelihood satisfies the standard regularity conditions (smoothness in β\boldsymbol\beta, integrable score, finite Fisher information at β0\boldsymbol\beta_0), so n(β^β0)\sqrt{n}(\hat{\boldsymbol\beta} - \boldsymbol\beta_0) converges in distribution to N(0,I1)\mathcal{N}(\mathbf{0}, \mathcal{I}^{-1}); the GLM-specific content is that I=XWX/n\mathcal{I} = \mathbf{X}^\top\mathbf{W}\mathbf{X}/n has the design-matrix-times-variance-weight form. We omit the full restatement of Topic 14 Thm 3 — see §14.5 of that topic for the multivariate Taylor-expansion derivation.

Example 5 IRLS iteration trace on a 2-predictor logistic

On a synthetic logistic dataset with n=60n = 60, p=1p = 1 predictor, and true β=(0.5,1.8)\boldsymbol\beta = (-0.5, 1.8)^\top, IRLS started from β(0)=(0,0)\boldsymbol\beta^{(0)} = (0, 0)^\top produces:

ttβ(t)\boldsymbol\beta^{(t)}β(t)β(t1)\|\boldsymbol\beta^{(t)} - \boldsymbol\beta^{(t-1)}\|(β(t))\ell(\boldsymbol\beta^{(t)})
0(0,0)(0, 0)41.59-41.59
1(0.41,1.45)(-0.41, 1.45)1.511.5131.82-31.82
2(0.50,1.83)(-0.50, 1.83)0.390.3931.65-31.65
3(0.50,1.86)(-0.50, 1.86)0.0260.02631.65-31.65
4(0.50,1.86)(-0.50, 1.86)1.6×1041.6 \times 10^{-4}31.65-31.65
5(0.50,1.86)(-0.50, 1.86)9.0×1099.0 \times 10^{-9}31.65-31.65

Five iterations to convergence at Δβ<108\|\Delta\boldsymbol\beta\| < 10^{-8}. The error β(t)β^\|\boldsymbol\beta^{(t)} - \hat{\boldsymbol\beta}\| shrinks roughly quadratically per step (1.510.390.0261.51 \to 0.39 \to 0.026 \to \cdots, each step squaring the previous gap up to a constant) — the signature of Newton-method convergence in a neighborhood of the optimum (Topic 14 Thm 6). The right-panel readout of IRLSVisualizer reproduces this trajectory live.

22.4 Logistic Regression

The Bernoulli response with logit link is the binary-classification workhorse. Every binary predictor — credit default vs. repayment, click vs. no-click, treatment success vs. failure — fits naturally into this framework, and the canonical-link score identity Thm 1 gives the fitting algorithm directly: minimize i(yipi)xi\sum_i (y_i - p_i)\mathbf{x}_i via IRLS with weights Wii=pi(1pi)W_{ii} = p_i(1-p_i).

Definition 4 Logistic regression model

Given binary responses Yi{0,1}Y_i \in \{0, 1\} and predictors xiRp+1\mathbf{x}_i \in \mathbb{R}^{p+1} (with leading intercept column), the logistic regression model is

YixiBernoulli(pi),logit(pi)=log ⁣(pi1pi)=xiβ.Y_i \mid \mathbf{x}_i \sim \text{Bernoulli}(p_i), \qquad \text{logit}(p_i) = \log\!\left(\frac{p_i}{1 - p_i}\right) = \mathbf{x}_i^\top\boldsymbol\beta.

Equivalently, pi=σ(xiβ)p_i = \sigma(\mathbf{x}_i^\top\boldsymbol\beta) where σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}) is the logistic sigmoid. The coefficient βj\beta_j is the log-odds-ratio associated with a one-unit increase in xjx_j (holding others fixed): eβje^{\beta_j} is the multiplicative change in the odds p/(1p)p/(1-p).

By Ex 4, the score is =X(yp)\nabla\ell = \mathbf{X}^\top(\mathbf{y} - \mathbf{p}) and IRLS uses weights Wii=pi(1pi)W_{ii} = p_i(1-p_i). Theorem 3 specializes immediately:

Theorem 4 Wald-z CI for logistic coefficients

Under GLM regularity, the asymptotic distribution β^approxNp+1(β0,(XW^X)1)\hat{\boldsymbol\beta} \overset{\text{approx}}{\sim} \mathcal{N}_{p+1}(\boldsymbol\beta_0, (\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1}) from Thm 3 gives the per-coefficient Wald-z confidence interval

β^j±z1α/2[(XW^X)1]jj,W^ii=p^i(1p^i),\hat\beta_j \pm z_{1-\alpha/2}\sqrt{\left[(\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1}\right]_{jj}}, \qquad \hat W_{ii} = \hat p_i(1 - \hat p_i),

with asymptotic coverage 1α1 - \alpha. The Wald test zj=β^j/SE^(β^j)z_j = \hat\beta_j/\widehat{\text{SE}}(\hat\beta_j) for H0:βj=0H_0: \beta_j = 0 is the dual test (Topic 19 Thm 1). For ϕ=1\phi = 1 Bernoulli, no F-correction is needed; the zz asymptotic applies directly.

The Wald-z CI is the reported default in summary(glm(...)) in R and glm.summary() in Python’s statsmodels — easy to compute, asymptotically correct. But §22.8 will show it has known pathologies near the parameter boundary (separation, rare-event imbalance), where the profile-likelihood CI is more reliable.

Example 6 Credit-default logistic regression (n = 200)

A consumer-finance synthetic example: n=200n = 200 loan applicants with two predictors: standardized age (zage)(z_{age}) and standardized log-income (zinc)(z_{inc}). The default response is Yi=1Y_i = 1 for default, Yi=0Y_i = 0 for repayment, with true coefficients β0=(0.5,1.0,2.0)\boldsymbol\beta_0 = (-0.5, 1.0, 2.0) on the logit scale (intercept, age, income). The notebook fits the model with sm.GLM(y, X, family=Binomial()):

Coefficientβ^j\hat\beta_jSE^(β^j)\widehat{\text{SE}}(\hat\beta_j)zzpp-value
Intercept0.972-0.9720.1970.1974.93-4.93<0.001< 0.001
Age1.3191.3190.2360.2365.595.59<0.001< 0.001
Income1.8691.8690.2820.2826.626.62<0.001< 0.001

Both predictors are highly significant. Coefficient interpretation: a one-standard-deviation increase in age raises log-odds-of-default by 1.321.32 (multiplicative odds change e1.323.74e^{1.32} \approx 3.74); the analogous income coefficient is e1.876.48e^{1.87} \approx 6.48. The full model deviance is Dfull=158.96D_{\text{full}} = 158.96 on np1=197n - p - 1 = 197 residual df, vs the intercept-only deviance Dnull=264.63D_{\text{null}} = 264.63 on 199199 df. The deviance reduction DnullDfull=105.67D_{\text{null}} - D_{\text{full}} = 105.67 on 2 df has p0.001p \ll 0.001 under χ22\chi^2_2 — overwhelming evidence the predictors carry information about default risk.

Two-panel figure. Left: credit-default synthetic scatter (n=200) with age on x-axis, income on y-axis; defaults shown as red filled circles, repayments as grey hollow circles. Fitted logistic decision surface drawn as the contour of p̂ = 0.5 (a near-linear boundary), with shaded region p̂ > 0.5. Right: calibration plot — observed default rate per binned p̂ decile against fitted p̂. The points lie close to the diagonal y = x, indicating well-calibrated predictions.
Remark 7 Probit and cloglog: non-canonical binary links

The logit is canonical, but two other links see practical use. Probit: g(p)=Φ1(p)g(p) = \Phi^{-1}(p) where Φ\Phi is the standard Normal CDF; the model corresponds to a latent Normal variable Yi=xiβ+εiY_i^* = \mathbf{x}_i^\top\boldsymbol\beta + \varepsilon_i with εiN(0,1)\varepsilon_i \sim \mathcal{N}(0, 1) and Yi=1[Yi>0]Y_i = \mathbb{1}[Y_i^* > 0]. Common in psychometric / response-threshold modeling. Complementary log-log (cloglog): g(p)=log(log(1p))g(p) = \log(-\log(1 - p)) corresponds to a discrete-time hazard interpretation — natural for survival-analysis-style binary outcomes. Both are non-canonical: IRLS still fits them (with weights Wii=(g(μi))2/V(μi)W_{ii} = (g'(\mu_i))^{-2}/V(\mu_i) rather than the simpler V(μi)V(\mu_i) form), and asymptotic normality still applies, but the score identity is no longer X(yμ)\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) exactly. In practice, logit and probit give nearly identical fitted probabilities and indistinguishable LRT pp-values; the choice is interpretive (log-odds vs latent-Normal) rather than statistical.

Remark 8 Separation: when the MLE escapes to infinity

Quasi-complete separation occurs when the predictors perfectly (or nearly perfectly) separate the two classes — say, all yi=1y_i = 1 when xiβ>0\mathbf{x}_i^\top\boldsymbol\beta > 0 and all yi=0y_i = 0 otherwise. Geometrically, you can draw a hyperplane in predictor space with all positives on one side. Under separation, the logistic log-likelihood is monotonically increasing along the direction β\boldsymbol\beta (you can always make the fit “more confident” by scaling up β\|\boldsymbol\beta\|), so β^\hat{\boldsymbol\beta} does not exist in the interior of Rp+1\mathbb{R}^{p+1} — IRLS iterates diverge to ±\pm\infty.

Production GLM solvers detect this by stopping when β(t+1)β(t)\|\boldsymbol\beta^{(t+1)} - \boldsymbol\beta^{(t)}\| fails to shrink within a maximum iteration count, returning a “did not converge” warning. The fix is to add a regularization penalty (Firth 1993’s penalized-likelihood approach is one option; ridge / L² penalties are another) — Topic 23 develops penalized GLMs in detail. The IRLSVisualizer “Near-separation” preset reproduces the diverging-iterates pathology so the failure mode is visually unmistakable.

Remark 9 Binary cross-entropy = Bernoulli NLL

The “binary cross-entropy loss” used in deep-learning binary classifiers is, per observation,

BCE(y,p^)=[ylogp^+(1y)log(1p^)],\text{BCE}(y, \hat p) = -\left[y \log \hat p + (1 - y) \log(1 - \hat p)\right],

which is exactly the negative Bernoulli log-likelihood. A neural network with a sigmoid output layer trained to minimize cross-entropy is fitting a logistic GLM where the linear predictor xiβ\mathbf{x}_i^\top\boldsymbol\beta is replaced by a learned representation fθ(xi)f_\theta(\mathbf{x}_i). All of §22.4 — link function, IRLS, deviance, separation pathology — applies; only the parameterization changes. See formalML: Logistic regression for the deep-learning treatment.

Remark 10 Hosmer–Lemeshow goodness-of-fit test

A logistic model can fit the trend well but miscalibrate the probabilities. The Hosmer–Lemeshow test (HOS2013, Ch. 5) checks the latter: partition observations into gg deciles by fitted p^i\hat p_i, count observed and expected positives in each bin, and form a χg22\chi^2_{g - 2} statistic. Large values reject “the model is well-calibrated.” The GLMExplorer calibration plot (right panel of Fig 4) is the visual analog. For deep models, related tools include reliability diagrams and the expected-calibration-error metric — the latter is the binary version of the Hosmer–Lemeshow χ2\chi^2 statistic without the binning step. Standard caveat: the test has low power against subtle miscalibration and is sensitive to the choice of gg (typically g=10g = 10).

The GLMExplorer component lets you fit any of the four GLM families on six preset datasets — switch the family selector between Bernoulli / Poisson / Gamma / Normal, pick a link function, and watch the fit, residuals, and coefficient table update in real time. It is shared across §22.4 / §22.5 / §22.6 with different default presets.

Fit: bernoulli · logit

-3.0-2.0-1.00.01.02.03.00.000.200.400.600.801.00x₁

Residuals vs fitted

0.000.200.400.600.801.00-4.0-2.00.02.04.06.0fitted μ̂
nameβ̂SEzp
intercept-0.6980.211-3.310.0009
β11.2930.2445.311.1e-7
β21.9650.2996.575.2e-11

Family

Link

Deviance D157.396
df197
φ̂1.000
IRLS iter7

Hover a coefficient row to compute its profile-LRT CI.

22.5 Poisson Regression

The Poisson response with log link is the workhorse for count outcomes: claim frequencies, page-view rates, photons per pixel, gene-read counts. The log link enforces μ>0\mu > 0 (counts are non-negative), and the multiplicative coefficient interpretation (eβ^je^{\hat\beta_j} = rate ratio per unit increase in xjx_j) is more useful in practice than the additive interpretation an identity link would give.

Definition 5 Poisson regression model

Given count responses Yi{0,1,2,}Y_i \in \{0, 1, 2, \ldots\}, predictors xi\mathbf{x}_i, and (optionally) an exposure offset ti>0t_i > 0, the Poisson regression model is

YixiPoisson(μi),log(μi)=xiβ+log(ti).Y_i \mid \mathbf{x}_i \sim \text{Poisson}(\mu_i), \qquad \log(\mu_i) = \mathbf{x}_i^\top\boldsymbol\beta + \log(t_i).

The exposure offset log(ti)\log(t_i) is a covariate with fixed coefficient 1 — exposure-adjustment, not a parameter to estimate (Rem 11). The implied rate per unit exposure is μi/ti=exp(xiβ)\mu_i / t_i = \exp(\mathbf{x}_i^\top\boldsymbol\beta), and eβje^{\beta_j} is the rate ratio: the multiplicative change in expected count per unit increase in xjx_j, holding exposure and other covariates fixed.

The canonical-link score is X(yμ)\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) where μi=exp(xiβ+logti)=tiexp(xiβ)\mu_i = \exp(\mathbf{x}_i^\top\boldsymbol\beta + \log t_i) = t_i \exp(\mathbf{x}_i^\top\boldsymbol\beta). IRLS weights are Wii=V(μi)=μiW_{ii} = V(\mu_i) = \mu_i — heteroscedastic by construction (variance grows linearly with mean), which is why OLS on counts violates the homoscedasticity assumption while Poisson regression handles it naturally.

Example 7 Insurance claim frequency with exposure offset

An insurance carrier observes n=300n = 300 policies with annual claim counts YiY_i, policy-years exposure ti[1,10]t_i \in [1, 10], standardized driver age zage,iz_{age, i}, and a region indicator ri{0,1}r_i \in \{0, 1\}. The fitted model

logE[Yixi]=β0+β1zage,i+β2ri+logti\log E[Y_i \mid \mathbf{x}_i] = \beta_0 + \beta_1 z_{age, i} + \beta_2 r_i + \log t_i

uses the offset logti\log t_i to convert the model from “expected total claims” to “expected claims per policy-year” — without the offset, a policy observed for 10 years would dominate one observed for 1 year, and β^\hat{\boldsymbol\beta} would be biased toward the long-exposure observations. Statsmodels fits this with sm.GLM(y, X, family=Poisson(), offset=np.log(t)). On the synthetic dataset (true β0=(1.24,0.61,0.49)\boldsymbol\beta_0 = (-1.24, 0.61, 0.49), seed 2223), the fit gives β^(1.24,0.61,0.49)\hat{\boldsymbol\beta} \approx (-1.24, 0.61, 0.49) — recovery to within 0.0050.005 of truth on every coefficient, deviance D=330.94D = 330.94 on 297297 df. Coefficient interpretation: each one-SD increase in driver age multiplies the expected claim rate by e0.611.84e^{0.61} \approx 1.84; being in region 1 multiplies it by e0.491.63e^{0.49} \approx 1.63.

Two-panel figure. Left: insurance claim-count scatter against policy-years on log x-axis (n=300); fitted Poisson mean μ̂_i curve with 95% pointwise Wald bands shaded. Right: Pearson residuals r_i^P against fitted μ̂_i; residuals scatter symmetrically around 0 with no systematic trend, indicating the Poisson mean specification is adequate.
Remark 11 The offset interpretation

The offset log(ti)\log(t_i) enters the linear predictor with fixed coefficient 1, not a parameter to estimate. It is not the same as adding log(ti)\log(t_i) as a regular predictor (which would estimate a separate coefficient β^t\hat\beta_t, allowing the rate to scale non-linearly with exposure). The offset says “the rate of claims per unit exposure is exp(xβ)\exp(\mathbf{x}^\top\boldsymbol\beta),” and exposure enters multiplicatively with coefficient 1.

The same construction applies wherever counts have an obvious denominator: cases per population (epidemiology), failures per machine-hour (reliability), defects per lot size (quality control). The denominator goes in as a log-offset; the rate-per-unit-exposure is what β\boldsymbol\beta models.

Remark 12 Log link vs identity link

The log link is canonical and almost always the right choice. The identity link g(μ)=μg(\mu) = \mu is occasionally used for additive rate-difference interpretation (e.g. “this treatment causes 0.3 additional events per person-year”), but it requires the constraint xβ>0\mathbf{x}^\top\boldsymbol\beta > 0 for μ\mu to be a valid Poisson mean, which IRLS does not enforce automatically. The fit can stray into negative territory mid-iteration, causing the deviance to be undefined and IRLS to crash. In practice, identity-link Poisson is fit only when domain context gives a binding lower bound on covariate values (e.g. observational range strictly positive) and the additive interpretation is required for downstream decisions.

Remark 13 Overdispersion: the Poisson mean–variance assumption fails

Poisson regression assumes Var(Yi)=E[Yi]=μi\text{Var}(Y_i) = E[Y_i] = \mu_i. Real count data often violate this: hospital admissions, social-media post counts, insurance claim frequencies all typically have Var(Yi)>E[Yi]\text{Var}(Y_i) > E[Y_i]overdispersion. The point estimate β^\hat{\boldsymbol\beta} is still consistent for β0\boldsymbol\beta_0 under overdispersion (this is the QMLE result Thm 8 will state), but the naive Wald SE is too small, and Wald CIs undercover. Three fixes:

  1. Negative Binomial regression — replace Poisson with YiNegBin(μi,r)Y_i \sim \text{NegBin}(\mu_i, r) where rr is a separate dispersion parameter; full distribution. Standard tool: MASS::glm.nb in R, statsmodels.discrete.NegativeBinomial in Python. Forward-pointed to formalML: Generalized linear models for the full treatment.
  2. Quasi-Poisson — keep the Poisson score equation, but estimate the dispersion ϕ\phi separately via the Pearson statistic. Same point estimates, scaled SEs.
  3. Sandwich SE — keep the Poisson model entirely, but use the HC0–HC3 sandwich estimator from §22.9 to get robust SEs without changing the distributional assumption. This is the QMLE approach Wedderburn 1974 introduced.

The §22.9 SandwichCoverageSimulator shows the coverage gap visually: under overdispersion ratio r=1.8r = 1.8, naive Wald CIs cover at 79%79\% vs the nominal 95%95\%; HC3 sandwich CIs cover at 91.5%91.5\%, much closer to nominal.

22.6 Gamma Regression

The Gamma response with inverse or log link is the third workhorse — for positive-skewed continuous outcomes: insurance claim amounts (as opposed to frequencies), waiting times, healthcare costs, gene-expression intensities. The Gamma allows a shape parameter ν\nu that controls skewness, and the variance scales as μ2\mu^2 (multiplicative noise on the mean rather than additive), which matches the “log-Normal-ish” empirical behavior of these outcomes.

Definition 6 Gamma regression model

Given positive responses Yi>0Y_i > 0 and predictors xi\mathbf{x}_i, the Gamma regression model is

YixiGamma(ν,ν/μi),g(μi)=xiβ,Y_i \mid \mathbf{x}_i \sim \text{Gamma}(\nu, \nu/\mu_i), \qquad g(\mu_i) = \mathbf{x}_i^\top\boldsymbol\beta,

parameterized so that the shape is ν\nu and the rate is ν/μi\nu/\mu_i, giving E[Yi]=μiE[Y_i] = \mu_i and Var(Yi)=μi2/ν\text{Var}(Y_i) = \mu_i^2/\nu. The dispersion is ϕ=1/ν\phi = 1/\nu, and the variance function is V(μ)=μ2V(\mu) = \mu^2. The canonical link is g(μ)=1/μg(\mu) = 1/\mu (the inverse link); the practical link is g(μ)=logμg(\mu) = \log\mu (the log link, non-canonical) — Rem 14 explains why the latter is often preferred.

Example 8 Gamma GLM on insurance claim amounts (fulfills Topic 6 Example 4)

Topic 6 Example 4 introduced insurance claim amounts as a motivating Gamma example and forward-pointed to “GLMs for the full framework.” We discharge that pointer here. On a synthetic claim-amounts dataset with n=250n = 250, claim-severity score ziz_i, true coefficients β0=(1.0,0.5)\boldsymbol\beta_0 = (1.0, 0.5) on the log scale, and shape ν=2\nu = 2 (so ϕ=0.5\phi = 0.5):

Coefficientβ^j\hat\beta_jSE^(β^j)\widehat{\text{SE}}(\hat\beta_j)
Intercept0.9360.9360.0870.087
Severity0.5690.5690.0620.062

Estimated dispersion ϕ^=0.444\hat\phi = 0.444 (true 0.5; recovery within sampling error at n=250n = 250). Deviance D=126.49D = 126.49 on 248248 df. The fitted model is logE[Yizi]=0.936+0.569zi\log E[Y_i \mid z_i] = 0.936 + 0.569 z_i, so a one-SD increase in claim severity multiplies the expected claim amount by e0.5691.77e^{0.569} \approx 1.77. The right panel of Figure 6 below shows deviance residuals against fitted values — the symmetric-around-zero, no-trend pattern signals the Gamma+log specification is adequate; if the dispersion were misspecified or the variance function wrong, the deviance residuals would show heteroscedastic structure. The §22.9 sandwich estimator handles this when it happens.

Two-panel figure. Left: histogram of insurance claim amounts on log x-axis (n=250), with fitted Gamma density overlaid per covariate decile — deciles 1, 5, 10 shown as separate density curves shifted along the x-axis. Right: deviance residuals r_i^D against fitted μ̂_i on log x-axis; residuals scatter symmetrically around 0 with no systematic trend, indicating the Gamma+log specification is adequate.
Remark 14 Canonical inverse link vs practical log link

The Gamma’s canonical link is the inverse: g(μ)=1/μg(\mu) = 1/\mu, so ηi=1/μi\eta_i = 1/\mu_i. Under the inverse link the score identity Thm 1 applies cleanly — =X(yμ)\nabla\ell = \mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) — but coefficient interpretation is awkward: βj\beta_j is the additive change in 1/μ1/\mu per unit change in xjx_j, which has no natural domain meaning (“additive on the reciprocal of expected claim amount”). Worse, the inverse link requires xβ>0\mathbf{x}^\top\boldsymbol\beta > 0 to keep μ>0\mu > 0, and IRLS does not enforce this — the fit can stray into negative territory and crash.

The log link g(μ)=logμg(\mu) = \log\mu is non-canonical for Gamma but solves both problems: μ=exp(xβ)\mu = \exp(\mathbf{x}^\top\boldsymbol\beta) is automatically positive, and β^j\hat\beta_j has a clean multiplicative interpretation (eβ^je^{\hat\beta_j} is the multiplicative change in expected claim amount per unit change in xjx_j). The cost is that IRLS uses different weights Wii=1/[V(μi)(g(μi))2]=μi2μi2=1W_{ii} = 1/[V(\mu_i)(g'(\mu_i))^2] = \mu_i^2 \cdot \mu_i^{-2} = 1 wait — let me recompute: with g(μ)=logμg(\mu) = \log\mu we have g(μ)=1/μg'(\mu) = 1/\mu and V(μ)=μ2V(\mu) = \mu^2, so Wii=1/[μ2μ2]=1W_{ii} = 1/[\mu^2 \cdot \mu^{-2}] = 1. Wait, that gives W=IW = I which is suspicious. Actually the weights for non-canonical links use Wii=1/[V(μi)(g(μi))2]W_{ii} = 1/[V(\mu_i)(g'(\mu_i))^2] in the Fisher scoring form, and 1/[μi2μi2]=11/[\mu_i^2 \cdot \mu_i^{-2}] = 1 is correct: under log-link Gamma, the IRLS weights are constant. The score equation acquires an extra Jacobian factor (the score is no longer X(yμ)\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) exactly), but IRLS still converges and the asymptotic theory still applies. In practice, log-link Gamma is the default in R’s glm(family=Gamma(link="log")) and is what insurance, finance, and biostatistics applications use.

Remark 15 Gamma dispersion — estimated, not fixed

For Bernoulli and Poisson, ϕ=1\phi = 1 is structural — there is no separate variance parameter to estimate. For Gamma, ϕ=1/ν\phi = 1/\nu is a free parameter, estimated by the Pearson statistic

ϕ^=1np1i=1n(yiμ^i)2μ^i2.\hat\phi = \frac{1}{n - p - 1}\sum_{i=1}^n \frac{(y_i - \hat\mu_i)^2}{\hat\mu_i^2}.

The estimate ϕ^\hat\phi enters all coefficient SEs as a scalar multiplier: Cov^(β^)=ϕ^(XW^X)1\widehat{\text{Cov}}(\hat{\boldsymbol\beta}) = \hat\phi (\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1}. For nested-model deviance tests (§22.7), ϕ^\hat\phi also enters: when ϕ\phi is estimated, the deviance test uses an Fk,np1F_{k, n - p - 1} reference instead of χk2\chi^2_k, exactly as in Topic 21 §21.8 where σ^2\hat\sigma^2 replaced σ2\sigma^2. The notebook fit in Ex 8 produced ϕ^=0.444\hat\phi = 0.444 vs true 0.50.5 — within 13%13\% at n=250n = 250, which is good enough for inference.

22.7 Deviance and the Deviance Likelihood-Ratio Test

In Topic 21, the sum of squared errors SSE=i(yiy^i)2\text{SSE} = \sum_i (y_i - \hat y_i)^2 played a triple role: it was the OLS objective, the goodness-of-fit measure, and the test statistic for nested-model comparison via the FF-test (§21.8 Thm 9). The GLM analog of all three is the deviance D(M)=2ϕ[sat(M)]D(\mathcal{M}) = 2\phi[\ell_{\text{sat}} - \ell(\mathcal{M})] — a likelihood-ratio statistic that compares the fitted model M\mathcal{M} to a “saturated” model in which every observation has its own parameter. Differences in deviance across nested models recover 2logΛn-2\log\Lambda_n (Wilks’ statistic) verbatim, so Wilks’ theorem (Topic 18 Thm 4) gives the asymptotic χk2\chi^2_k distribution immediately.

Definition 7 Saturated model and deviance

The saturated model for a GLM with response y\mathbf{y} has one parameter per observation: it sets μ^isat=yi\hat\mu_i^{\text{sat}} = y_i for every ii, so the fitted means coincide with the observations. The saturated log-likelihood is

sat=i=1nlogf(yiyi,ϕ),\ell_{\text{sat}} = \sum_{i=1}^n \log f(y_i \mid y_i, \phi),

the maximum log-likelihood achievable on the data without any constraint from the linear predictor. The deviance of a fitted model M\mathcal{M} with log-likelihood (M)\ell(\mathcal{M}) is

D(M)=2ϕ[sat(M)].D(\mathcal{M}) = 2\phi\left[\ell_{\text{sat}} - \ell(\mathcal{M})\right].

Deviance is a non-negative scalar (since (M)sat\ell(\mathcal{M}) \leq \ell_{\text{sat}} by definition); D(M)=0D(\mathcal{M}) = 0 would mean the fitted model achieves the saturated fit, which happens only on degenerate data (e.g. fitting a logistic with one binary predictor on all-zero responses). For Bernoulli and Poisson (ϕ=1\phi = 1), the deviance is the GLM analog of SSE; for Gamma and Normal, the dispersion factor ϕ\phi converts log-likelihood units to deviance units.

For the four canonical families, the per-observation deviance contributions did_i such that D=idiD = \sum_i d_i are:

  • Bernoulli: di=2[yilog(yi/μ^i)+(1yi)log((1yi)/(1μ^i))]d_i = 2\left[y_i \log(y_i / \hat\mu_i) + (1 - y_i)\log((1 - y_i)/(1 - \hat\mu_i))\right], with 0log000 \log 0 \equiv 0.
  • Poisson: di=2[yilog(yi/μ^i)(yiμ^i)]d_i = 2\left[y_i \log(y_i / \hat\mu_i) - (y_i - \hat\mu_i)\right], with 0log000 \log 0 \equiv 0.
  • Gamma: di=2[(yiμ^i)/μ^ilog(yi/μ^i)]d_i = 2\left[(y_i - \hat\mu_i)/\hat\mu_i - \log(y_i / \hat\mu_i)\right].
  • Normal: di=(yiμ^i)2d_i = (y_i - \hat\mu_i)^2, so D=SSED = \text{SSE} directly — Topic 21’s residual-sum-of-squares statistic is the Normal-family deviance.

The Normal-family identity D=SSED = \text{SSE} is the cleanest expression of the “GLM contains OLS” reduction at the inference layer: every deviance computation in §22.7 specializes back to Topic 21’s SSE machinery for Normal+identity-link.

Theorem 5 Deviance decomposition (GLM analog of SST = SSR + SSE)

For nested GLM models M0M1\mathcal{M}_0 \subset \mathcal{M}_1 with kk linear restrictions on the coefficients,

D(M0)=D(M1)+[D(M0)D(M1)].D(\mathcal{M}_0) = D(\mathcal{M}_1) + \left[D(\mathcal{M}_0) - D(\mathcal{M}_1)\right].

The reduced-model deviance decomposes into the full-model deviance (residual lack-of-fit not explained by the additional predictors) plus the deviance improvement (lack-of-fit explained by adding the kk predictors back). This is the GLM analog of the OLS decomposition SST=SSR+SSE\text{SST} = \text{SSR} + \text{SSE} from §21.8. The deviance improvement D(M0)D(M1)D(\mathcal{M}_0) - D(\mathcal{M}_1) is the test statistic for H0H_0 : the kk restrictions hold.

The decomposition is a tautology — no proof needed, just rearrangement of the definition of DD. The theorem is included for symmetry with the OLS case and to motivate Thm 6, which gives the asymptotic distribution of the deviance improvement.

Theorem 6 Deviance LRT → χ²_k (Wilks specialization)

Let M1\mathcal{M}_1 be a GLM with parameter space βRp+1\boldsymbol\beta \in \mathbb{R}^{p+1}, and let M0\mathcal{M}_0 be the reduced GLM imposing kk linear restrictions Rβ=r\mathbf{R}\boldsymbol\beta = \mathbf{r} on the coefficients. Under H0:Rβ=rH_0: \mathbf{R}\boldsymbol\beta = \mathbf{r} and GLM regularity (full-rank X\mathbf{X}, interior MLE), as nn \to \infty,

D(M0)D(M1)  d  ϕχk2.D(\mathcal{M}_0) - D(\mathcal{M}_1) \;\xrightarrow{d}\; \phi \cdot \chi^2_k.

For fixed-ϕ\phi families (Bernoulli, Poisson) the ϕ=1\phi = 1 asymptotic applies directly; for estimated-ϕ\phi families (Gamma, Normal), the test statistic divided by ϕ^\hat\phi uses an Fk,np1F_{k, n - p - 1} reference to account for dispersion-estimation uncertainty.

Proof [show]

Step 1 (Setup). Let β^0\hat{\boldsymbol\beta}_0 and β^1\hat{\boldsymbol\beta}_1 be the MLEs under the reduced and full models. The full model has p+1p + 1 free parameters; the reduced model has p+1kp + 1 - k free parameters under the kk linear restrictions Rβ=r\mathbf{R}\boldsymbol\beta = \mathbf{r}.

Step 2 (Deviance difference cancels the saturated term). By Def 7, D(M0)D(M1)=2ϕ[sat(β^0)]2ϕ[sat(β^1)]=2ϕ[(β^1)(β^0)]D(\mathcal{M}_0) - D(\mathcal{M}_1) = 2\phi[\ell_{\text{sat}} - \ell(\hat{\boldsymbol\beta}_0)] - 2\phi[\ell_{\text{sat}} - \ell(\hat{\boldsymbol\beta}_1)] = 2\phi[\ell(\hat{\boldsymbol\beta}_1) - \ell(\hat{\boldsymbol\beta}_0)]. The saturated-model log-likelihood cancels.

Step 3 (LRT identification). Topic 18 Def 1 defines the log-likelihood-ratio statistic for nested models as 2logΛn=2[(β^1)(β^0)]-2\log\Lambda_n = 2[\ell(\hat{\boldsymbol\beta}_1) - \ell(\hat{\boldsymbol\beta}_0)]. For ϕ=1\phi = 1 the deviance difference is 2logΛn-2\log\Lambda_n verbatim; for ϕ1\phi \neq 1 they agree up to the scalar ϕ\phi.

Step 4 (Apply Wilks). By Topic 18 Thm 4 (Wilks’ theorem), under H0:Rβ=rH_0 : \mathbf{R}\boldsymbol\beta = \mathbf{r} and standard regularity (full-rank Fisher information, interior MLE, H0H_0 imposes kk smooth constraints), 2logΛndχk2-2\log\Lambda_n \xrightarrow{d} \chi^2_k. Therefore D(M0)D(M1)dϕχk2D(\mathcal{M}_0) - D(\mathcal{M}_1) \xrightarrow{d} \phi \cdot \chi^2_k.

Step 5 (Decision rule). At significance level α\alpha, reject H0H_0 when D(M0)D(M1)>ϕχk,1α2D(\mathcal{M}_0) - D(\mathcal{M}_1) > \phi \chi^2_{k, 1-\alpha}. For estimated ϕ\phi, divide by ϕ^\hat\phi and reject when the quotient exceeds χk,1α2/k=Fk,np1,1α\chi^2_{k, 1-\alpha}/k = F_{k, n-p-1, 1-\alpha} asymptotically. \square — using Topic 18 Thm 4 (Wilks).

The entire proof is six steps of book-keeping — the substance lives in Topic 18 Thm 4. The deviance test is Wilks; we just renamed 2logΛn-2\log\Lambda_n to D0D1D_0 - D_1 to match GLM software output. R’s anova(fit0, fit1, test="Chisq") and Python’s glm1.compare_lr_test(glm0) both compute exactly D(M0)D(M1)D(\mathcal{M}_0) - D(\mathcal{M}_1) and reference it to χk2\chi^2_k.

Example 9 Nested-Poisson deviance test

Continuing the §22.5 insurance-frequency setup: fit a full Poisson model with intercept + 3 predictors + log-offset on n=150n = 150 observations, and a reduced model with intercept + 1 predictor + log-offset (dropping the second and third predictors, k=2k = 2 restrictions). On the synthetic dataset with true β=(1.0,0.5,0.7,0.3)\boldsymbol\beta = (-1.0, 0.5, 0.7, 0.3):

QuantityValue
Full-model deviance D1D_1156.15156.15
Reduced-model deviance D0D_0300.75300.75
Deviance difference D0D1D_0 - D_1144.60144.60 on k=2k = 2 df
χ22\chi^2_2 critical value at α=0.05\alpha = 0.055.995.99
pp-value1030\ll 10^{-30}

Decisive rejection of H0H_0 — the dropped predictors carry highly significant information. The DevianceTestExplorer lets you sweep across families, kk, and α\alpha to see how the test behaves under different effect sizes. The middle panel shows the power curve as a function of non-centrality λ\lambda: for λ=10\lambda = 10 at α=0.05,k=2\alpha = 0.05, k = 2, the test has 80%\sim 80\% power.

Two-panel figure. Left: χ²_2 null density curve (the reference) with rejection region shaded red from the 95% critical value (5.99) to the right tail; observed deviance-difference D_0 − D_1 = 144.60 marked with a vertical green line far in the tail; one-line annotation 'p ≪ 0.001'. Right: power curve — Prob(reject | non-centrality λ) as a function of λ ∈ [0, 20] under the non-central χ²_2(λ) family; horizontal reference line at 0.80 power.
Remark 16 Pearson residuals vs deviance residuals

Two residual flavors are reported by every GLM solver:

  • Pearson: riP=(yiμ^i)/V(μ^i)ϕ^r_i^P = (y_i - \hat\mu_i)/\sqrt{V(\hat\mu_i)\hat\phi} — standardized residual on the response scale, analog of OLS standardized residual.
  • Deviance: riD=sign(yiμ^i)dir_i^D = \text{sign}(y_i - \hat\mu_i)\sqrt{d_i} where did_i is observation ii‘s contribution to DD.

For Normal+identity-link, both reduce to standardized OLS residuals (T8.9 verifies this to machine precision). For other families, the deviance residual is generally preferred for diagnostic plotting because it is more symmetric under correctly-specified models — the Pearson residual inherits the response-scale skewness (e.g. positive skew for Poisson with small μ\mu), while the deviance residual is a square-root transform that smooths the asymmetry. In production, plot riDr_i^D against μ^i\hat\mu_i; systematic structure flags model misspecification (wrong link, missing predictor, wrong variance function).

The Figure 8 grid in §22.9 shows both residual flavors side-by-side for Poisson and Gamma fits — the Pearson residuals fan out heteroscedastically even under correct specification (fitting artifact, not model failure), while the deviance residuals are symmetric and unstructured.

The DevianceTestExplorer component lets you explore the deviance LRT across families and effect sizes. The left panel shows the χk2\chi^2_k null density with the rejection region shaded; the middle panel shows the power curve as a function of non-centrality λ\lambda; the right panel shows the profile-likelihood for βj\beta_j with both the Wald-z CI and the profile-LRT CI rendered, making the asymmetry of the latter visible.

χ²_1 null + observed Δ

χ²_crit = 3.84Δ = 52.480.0020.0040.0060.000.000.100.200.300.400.500.60χ²_1

Power vs λ (k=1, α=0.050)

α = 0.0500.005.0010.0015.0020.000.000.200.400.600.801.00λ

Profile D(β_j) vs Wald-z (j=2)

D̂ + χ²₁,₁₋ₐ0.300.400.500.600.700.800.90110115120β2

Data k = 1 (slider just changes the χ²_k reference axis).

D₀ (reduced)160.370
D₁ (full)107.887
Δ on 1 df52.483
p-value4.34e-13
Profile CI: [0.434, 0.767]
Wald CI: [0.453, 0.746]

22.8 Coefficient Inference: Wald, Score, Profile-Likelihood

Topic 17 §17.9 introduced the asymptotic-equivalence trio — Wald, Score, and Likelihood-Ratio tests — as three flavors of the same χ2\chi^2 inference machinery, each with its own computational and finite-sample tradeoffs. §22.8 specializes the trio to GLM coefficient inference: Wald-z for fast per-coefficient zz-statistics, Score for “only fit the null” forward-selection contexts, and profile-LRT for the high-fidelity CIs that respect the likelihood’s curvature near the boundary. Topic 19 §19.6 forward-promised that profile-LRT is the GLM default; Rem 17 below discharges that promise.

Definition 8 Wald-z, Score, and profile-LRT confidence intervals

For a single GLM coefficient βj\beta_j at confidence level 1α1 - \alpha:

  • Wald-z CI: β^j±z1α/2SE^(β^j)\hat\beta_j \pm z_{1-\alpha/2}\widehat{\text{SE}}(\hat\beta_j), where SE^(β^j)=[ϕ^(XW^X)1]jj\widehat{\text{SE}}(\hat\beta_j) = \sqrt{[\hat\phi(\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1}]_{jj}}. Symmetric about β^j\hat\beta_j; uses the asymptotic Normal distribution of β^j\hat\beta_j from Thm 3.
  • Score (Rao) CI: {βj0:Uj(β0)2/Ijj(β0)χ1,1α2}\{\beta_j^0 : U_j(\boldsymbol\beta_0^*)^2 / \mathcal{I}_{jj}(\boldsymbol\beta_0^*) \leq \chi^2_{1, 1-\alpha}\}, where β0\boldsymbol\beta_0^* is the constrained MLE under βj=βj0\beta_j = \beta_j^0. Computed entirely at the null; never requires the full-model fit.
  • Profile-likelihood CI: {βj0:D(βj0)D(β^)χ1,1α2}\{\beta_j^0 : D(\beta_j^0) - D(\hat{\boldsymbol\beta}) \leq \chi^2_{1, 1-\alpha}\}, where D(βj0)D(\beta_j^0) profiles out the nuisance coefficients (refits the model with βj\beta_j fixed at βj0\beta_j^0 and other coefficients optimized). Asymmetric in general; the gold standard for GLMs.

All three CIs invert the corresponding hypothesis test (Topic 19 Thm 1, the test–CI duality). All three converge to the same χ12\chi^2_1 asymptotic distribution by Topic 18 §18.7 — they disagree only by Op(n1/2)O_p(n^{-1/2}) in finite samples.

Theorem 7 Asymptotic equivalence of Wald, Score, and LRT for GLM coefficients

Under H0:βj=βj0H_0 : \beta_j = \beta_j^0 and GLM regularity, the three test statistics

Wj=(β^jβj0)2Var^(β^j),Sj=Uj(β0)2Ijj(β0),Lj=D(βj0)D(β^),W_j = \frac{(\hat\beta_j - \beta_j^0)^2}{\widehat{\text{Var}}(\hat\beta_j)}, \quad S_j = \frac{U_j(\boldsymbol\beta_0^*)^2}{\mathcal{I}_{jj}(\boldsymbol\beta_0^*)}, \quad L_j = D(\beta_j^0) - D(\hat{\boldsymbol\beta}),

each converge in distribution to χ12\chi^2_1 as nn \to \infty, and pairwise differ by Op(n1/2)O_p(n^{-1/2}). The three corresponding CIs (Wald-z, Score, profile-LRT) therefore agree to leading order asymptotically and differ only in finite-sample behavior — particularly near the parameter boundary, where Wald-z’s symmetric form fails and the others recover correct coverage.

The proof is a direct specialization of Topic 18 §18.7’s general statement to the GLM setting; we omit the restatement.

Example 10 Wald-z vs profile-LRT CI for a near-separation logistic

A logistic regression on n=80n = 80 near-separation data (β0=(0,4)\boldsymbol\beta_0 = (0, 4) — strong slope) produces β^(0.3,4.1)\hat{\boldsymbol\beta} \approx (0.3, 4.1) with substantial coefficient uncertainty due to the weak signal-to-noise at the boundary. The two CIs for β1\beta_1:

CI flavorLowerUpperWidthAsymmetry
Wald-z (95%)1.41.46.96.95.55.5symmetric
Profile-LRT (95%)1.61.69.89.88.28.2upper half 2.7×\approx 2.7\times lower half

The profile-LRT CI is wider and asymmetric, with most of the extra width pushed to the right (large positive β1\beta_1 values). This is the correct picture: near separation, the log-likelihood is much flatter on the “more positive” side of β^1\hat\beta_1 than on the “less positive” side, so a χ12\chi^2_1-cutoff slice of the likelihood profile gives an asymmetric region. The Wald-z CI’s symmetric form artificially compresses the upper bound, undercovering the truth on the high side. This is the “Wald boundary pathology” Topic 18 Rem 16 already named; the profile-LRT CI is the production-standard fix.

Example 11 Wald is link-dependent; LRT is not (logit vs probit)

Take a single binary dataset and fit it under both the logit link (canonical) and the probit link (non-canonical). Each link gives a different β^1\hat\beta_1 on its own scale (logit gives log-odds units; probit gives latent-Normal-z units), so the Wald-z pp-value for H0:β1=0H_0: \beta_1 = 0 depends on which scale you fit:

Linkβ^1\hat\beta_1Wald-zpp-value
logit0.380.382.162.160.0310.031
probit0.210.211.991.990.0470.047
LRT (deviance difference)0.0280.028 (both links)

The Wald-z pp-values disagree across links (0.0310.031 vs 0.0470.047) — a worrying 50%50\% relative difference for a finding that hovers near the 0.050.05 threshold. The LRT pp-value is link-invariant: D(β^0)D(β^1)D(\hat{\boldsymbol\beta}_0) - D(\hat{\boldsymbol\beta}_1) is the same regardless of which link parameterization you fit, because the deviance difference depends only on the maximum likelihoods, which are identical across smooth reparameterizations (Topic 18 Thm 6). This is exactly the “GLM software defaults to LRT for nested-model comparisons” rationale that Topic 18 Rem 17 named — and it is one of the strongest practical reasons to prefer LRT over Wald when the choice is available.

Remark 17 Profile-LRT is the GLM default — discharging Topic 19 §19.6

Topic 19 §19.6 forward-promised that profile-likelihood CIs are the workhorse for GLM coefficients in production software — R’s confint() on a glm object defaults to the profile-LRT, computed by refitting the model with βj\beta_j fixed at a grid of values and tracing the deviance drop until the χ1,0.952=3.84\chi^2_{1, 0.95} = 3.84 threshold is crossed. Three reasons:

  1. Reparameterization invariance (Topic 18 Thm 6, Topic 19 Rem 8): the LRT pp-value is invariant under smooth reparameterization, including the link function (Ex 11 above). Wald-z fails this invariance.
  2. Asymptotic efficiency via Wilks (Topic 18 Thm 4, Topic 19 §19.6): all three CIs (Wald, Score, LRT) are first-order asymptotically efficient, but LRT has better higher-order properties — Bartlett correction, refined coverage in moderate samples.
  3. Correct coverage at boundaries (Ex 10 above): near separation, near-zero variance components, and other boundary cases, profile-LRT respects the likelihood’s curvature; Wald-z does not.

The cost is computational: profile-LRT requires 30×2=60\sim 30 \times 2 = 60 constrained refits per coefficient (for bisection on each side). For a p=10p = 10 GLM on n=10,000n = 10{,}000 data, that is 600\sim 600 IRLS solves — not cheap but very tractable. The coefCIProfileGLM function in regression.ts implements the bisection; the DevianceTestExplorer right panel renders profile-LRT and Wald-z side-by-side for visual comparison.

22.9 Misspecification and the Sandwich Estimator

Topic 21 §21.9 Rem 19 forward-promised a treatment of HC-robust standard errors — the White (1980) sandwich variance estimator that gives consistent SEs under arbitrary heteroscedasticity, even when the homoscedasticity assumption of OLS is wrong. §22.9 discharges that promise, and generalizes it: the sandwich estimator works for any GLM (not just Normal+identity-link OLS) under any variance misspecification, as long as the mean specification μi=g1(xiβ)\mu_i = g^{-1}(\mathbf{x}_i^\top\boldsymbol\beta) is correct. The estimator is consistent because Wedderburn (1974) showed quasi-likelihood — fitting only on the mean–variance relationship without specifying the full distribution — gives a consistent point estimate regardless of higher-moment misspecification. Huber (1967) and White (1980) independently derived the sandwich form A1BA1\mathbf{A}^{-1}\mathbf{B}\mathbf{A}^{-1} as the variance of this quasi-MLE. The three threads converged into modern QMLE / sandwich-SE practice.

Definition 9 Quasi-likelihood score and the QMLE

Given a mean specification μi=g1(xiβ)\mu_i = g^{-1}(\mathbf{x}_i^\top\boldsymbol\beta) and a variance function V(μ)V(\mu) (without committing to a full distributional family), the quasi-likelihood score is

UQ(β)=X(yμ(β)).\mathbf{U}_Q(\boldsymbol\beta) = \mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu(\boldsymbol\beta)).

The quasi-maximum-likelihood estimator (QMLE) β^QMLE\hat{\boldsymbol\beta}_{\text{QMLE}} solves UQ(β^QMLE)=0\mathbf{U}_Q(\hat{\boldsymbol\beta}_{\text{QMLE}}) = \mathbf{0}. When the canonical-link assumption holds, the quasi-score equals the true GLM score (Thm 1), so β^QMLE=β^MLE\hat{\boldsymbol\beta}_{\text{QMLE}} = \hat{\boldsymbol\beta}_{\text{MLE}} — the same estimate, computed by the same IRLS algorithm, with the same point estimates. The difference is in the variance: the MLE assumes the full distribution is correct (giving Var^(β^)=ϕ^(XW^X)1\widehat{\text{Var}}(\hat{\boldsymbol\beta}) = \hat\phi(\mathbf{X}^\top\hat{\mathbf{W}}\mathbf{X})^{-1}), while the QMLE acknowledges the variance specification might be wrong and uses the sandwich estimator below.

Theorem 8 QMLE consistency and the sandwich variance

Suppose the mean specification is correct: E[Yixi]=g1(xiβ0)E[Y_i \mid \mathbf{x}_i] = g^{-1}(\mathbf{x}_i^\top\boldsymbol\beta_0) for some true β0\boldsymbol\beta_0. The variance specification Var(Yixi)\text{Var}(Y_i \mid \mathbf{x}_i) may be arbitrary (overdispersion, cluster-correlation, full distributional misspecification). Then under standard regularity conditions:

  1. Consistency. β^QMLEpβ0\hat{\boldsymbol\beta}_{\text{QMLE}} \xrightarrow{p} \boldsymbol\beta_0 as nn \to \infty.
  2. Asymptotic normality with sandwich variance. n(β^QMLEβ0)dNp+1(0,A1BA1)\sqrt{n}(\hat{\boldsymbol\beta}_{\text{QMLE}} - \boldsymbol\beta_0) \xrightarrow{d} \mathcal{N}_{p+1}(\mathbf{0}, \mathbf{A}^{-1}\mathbf{B}\mathbf{A}^{-1}), where
A=E ⁣[XW(β0)X/n],B=E ⁣[X(yμ)(yμ)X/n].\mathbf{A} = E\!\left[\mathbf{X}^\top \mathbf{W}(\boldsymbol\beta_0) \mathbf{X}/n\right], \qquad \mathbf{B} = E\!\left[\mathbf{X}^\top (\mathbf{y} - \boldsymbol\mu)(\mathbf{y} - \boldsymbol\mu)^\top \mathbf{X}/n\right].

The sandwich form A1BA1\mathbf{A}^{-1}\mathbf{B}\mathbf{A}^{-1} replaces the model-based A1\mathbf{A}^{-1} (which assumes B=A\mathbf{B} = \mathbf{A} — the “information identity” that holds under correct full-distribution specification). Empirical estimators of B\mathbf{B} — HC0 through HC3 — give finite-sample sandwich SEs without requiring the full variance structure.

We state Thm 8 without full proof — the OpO_p manipulation requires Topic 14’s MLE-asymptotics machinery applied to the misspecified-likelihood setting (Huber 1967), then a uniform law of large numbers to turn population A,B\mathbf{A}, \mathbf{B} into their empirical counterparts (White 1980). The full derivation is HUB1967 §3 + WHI1980 §3 (~20 lines of OpO_p algebra). The conceptual content of Thm 8 is what matters: point estimates depend only on mean specification; SE estimation is what varies. Get the mean right, use the sandwich, and your inference is robust to whatever variance distortion the true DGP imposes.

The four empirical sandwich estimators (HC0, HC1, HC2, HC3) differ only in finite-sample bias adjustments to the B\mathbf{B}-term:

  • HC0 (White 1980): B^HC0=(1/n)ir^i2xixi\hat{\mathbf{B}}_{HC0} = (1/n)\sum_i \hat r_i^2 \mathbf{x}_i \mathbf{x}_i^\top where r^i=yiμ^i\hat r_i = y_i - \hat\mu_i — the raw plug-in estimate.
  • HC1 (MacKinnon-White 1985): scale HC0 by n/(np1)n/(n - p - 1) to account for finite-sample shrinkage. Standard vcovHC default in R.
  • HC2: divide r^i2\hat r_i^2 by (1hii)(1 - h_{ii}) where hiih_{ii} is the leverage from §22.3’s IRLS hat matrix.
  • HC3: divide r^i2\hat r_i^2 by (1hii)2(1 - h_{ii})^2 — the Davidson–MacKinnon “preferred” finite-sample estimator, recommended by default for n<250n < 250. Numerical stability requires capping hiih_{ii} at 0.9999 to avoid blow-up at high-leverage observations (the regression.ts implementation enforces this with a console warning when triggered).

In production, HC1 and HC3 are the most-used variants; HC0 is the asymptotic-limit version used in derivations.

Example 12 Sandwich coverage on overdispersed Poisson (n_sim = 200)

Generate n=200n = 200 overdispersed-Poisson responses with true β0=(0.3,0.9)\boldsymbol\beta_0 = (-0.3, 0.9) and dispersion ratio r=Var/E=1.8r = \text{Var}/E = 1.8 (so the Poisson model is correct on the mean but wrong on the variance by a factor of 1.81.8). Repeat 200200 times, refit each replication with glmFit(X, y, FAMILIES.poisson), and compute both the naive Wald CI and the HC3 sandwich CI for β1\beta_1 at the 95%95\% level:

CIEmpirical coverageTargetGap
Naive Wald-z0.790.790.950.9516-16 pp
HC3 sandwich0.9150.9150.950.953.5-3.5 pp

Naive Wald undercovers severely — 79%79\% instead of 95%95\% — because its SE is too small by roughly 1.81.34\sqrt{1.8} \approx 1.34. HC3 recovers nearly nominal coverage with no change to the point estimates. The §22.9 figure renders this directly: each simulation contributes two horizontal CI bars (one naive, one HC3), color-coded green if it covers β1true\beta_1^{\text{true}} or red if it misses. The naive bars miss roughly one in five times; the HC3 bars miss roughly one in twenty.

Example 13 Gamma regression with misspecified dispersion

For a Gamma GLM, the variance function is V(μ)=μ2V(\mu) = \mu^2 and dispersion ϕ\phi is estimated. Suppose the true DGP has Var(Yi)=4μi2\text{Var}(Y_i) = 4 \mu_i^2 (so the dispersion is twice the value the model is implicitly assuming via ϕ^\hat\phi). Naive Wald CIs again undercover; HC3 sandwich CIs recover. The pattern is the same as Poisson: point estimates (β^\hat{\boldsymbol\beta} from IRLS) are unaffected by the dispersion misspecification because the score equation X(yμ)=0\mathbf{X}^\top(\mathbf{y} - \boldsymbol\mu) = \mathbf{0} does not involve ϕ\phi; only the SE estimation needs adjustment. The SandwichCoverageSimulator exposes this with a “misspecifiedGamma” preset.

2×2 grid figure. Top row: Poisson fit — (a) Pearson residuals r_i^P against fitted μ̂_i, (b) deviance residuals r_i^D against the same fitted values. Pearson residuals fan out heteroscedastically; deviance residuals are symmetric and unstructured. Bottom row: same comparison for a Gamma fit. Title: 'Which residual do you trust? Deviance residuals are more symmetric under correct specification.' Two-panel figure. Left: 100 simulated overdispersed-Poisson datasets (Var/E = 1.8); 95% naive Wald CIs for β₁ drawn as horizontal bars stacked vertically, color-coded green (covers β₁^true) or red (misses). Roughly 21 of 100 bars are red. Empirical coverage 0.79 annotated. Right: same 100 simulations with 95% HC3 sandwich CIs; only ~9 of 100 are red. Empirical coverage 0.91 annotated. Title: 'Sandwich estimator restores nominal coverage under variance misspecification.'
Remark 18 Pearson-vs-deviance residual diagnostic — what the discrepancy flags

A useful diagnostic: plot Pearson and deviance residuals on the same axes for a fitted GLM. If both look unstructured, the model is well-specified. If Pearson residuals fan out but deviance residuals are clean, you have variance misspecification with correct mean — exactly the scenario the sandwich estimator handles. If deviance residuals show systematic structure (a U-shape, monotone trend, or fan), you have mean misspecification (wrong link, missing predictor, wrong functional form) and the sandwich does NOT save you — you need to fix the model. Figure 8 above shows the variance-misspecification flavor; the structural-misspecification version is what RegressionDiagnosticsExplorer on Topic 21 illustrates for the OLS case (and the same pattern carries over to GLMs).

Remark 19 Robust GLMs — outlier influence beyond sandwich SEs

Sandwich SEs handle variance misspecification but do nothing about influential outliers — single observations whose leverage and residual combine to dominate β^\hat{\boldsymbol\beta}. The Huber-ified Poisson and bounded-influence logistic estimators (Cantoni & Ronchetti 2001) downweight high-Pearson-residual observations using M-estimator machinery from Topic 15 §15.10–§15.11. Topic 23’s penalized GLMs (ridge, lasso) provide an alternative outlier-control mechanism via shrinkage of the coefficients themselves. Both are deferred from this topic; Thm 8’s sandwich framework is the variance-misspecification fix, while these are the influence-control fixes.

Remark 20 Negative Binomial regression — the full-distribution overdispersion fix

The sandwich approach is the “model-agnostic” fix for Poisson overdispersion: keep the Poisson model, fix the SEs. The “model-correct” alternative is Negative Binomial regression — replace Poisson with YiNegBin(μi,r)Y_i \sim \text{NegBin}(\mu_i, r) where rr is a separate dispersion parameter. The full distribution allows likelihood-based inference (deviance, profile-LRT CIs, etc.) on top of the corrected variance. Standard tooling: MASS::glm.nb in R, statsmodels.discrete.NegativeBinomial in Python. formalML: Generalized linear models covers NegBin in the count-regression chapter.

Remark 21 Wedderburn 1974 — quasi-likelihood predates sandwich SEs

The quasi-likelihood framework is older than the sandwich SE: Wedderburn (1974) introduced the idea that “specifying the mean–variance relationship is enough for consistent estimation” two years after Nelder–Wedderburn (1972) introduced GLMs. Huber (1967) had derived the misspecified-MLE asymptotic-variance form independently in the M-estimation context, and White (1980) made the empirical sandwich estimator standard in econometrics. Three threads, three authors, one consensus by the late 1980s: under correct mean specification, the QMLE / IRLS estimate is consistent, and the sandwich variance fixes the SE. McCullagh–Nelder (1989) Ch. 9 is the canonical book-length development; modern statsmodels and R-sandwich implementations follow that template directly.

The SandwichCoverageSimulator lets you watch the coverage gap close in real time. Pick a family (overdispersed Poisson / misspecified Gamma / clustered Bernoulli), set the misspecification magnitude with the slider, and click “Run 500” — each simulation contributes a CI bar to the top panel and updates the running coverage in the bottom panel. The naive bars consistently undercover; the HC3 bars consistently approach the nominal 1α1 - \alpha line.

Latest 0 simulations · 95% CIs · green = covers β_true, red = misses · upper bar naive, lower bar HC3

β_true = 0.900.000.501.001.50β₁ value

Running coverage (after 0 sims) · target = 0.95

target 0.95Naive0.000HC30.0000.000.250.500.751.00
After k sims0
Naive coverage
HC3 coverage

Naive Wald undercovers under variance misspecification; HC3 sandwich recovers nominal coverage (Topic 22 §22.9 Rem 19).

22.10 Forward Map: Where GLMs Lead

Topic 22 establishes the core GLM framework: link function, exponential-family response, IRLS fitting, deviance inference, sandwich-SE robustness. The framework opens onto five major extensions, each of which is a separate topic or even a separate track. §22.10 is the catalog — one paragraph per extension.

Remark 22 Topic 23 — Penalized GLMs (Ridge, Lasso, Elastic Net)

The next Track 6 topic — Topic 23: Regularization & Penalized Estimation — generalizes IRLS to penalized log-likelihoods: (β)λP(β)\ell(\boldsymbol\beta) - \lambda P(\boldsymbol\beta) where PP is a penalty (squared-2\ell^2 for ridge, 1\ell^1 for lasso, convex combination for elastic net). The L² penalty stabilizes IRLS near separation (Bernoulli) or in high-collinearity designs; the L¹ penalty induces sparsity in β^\hat{\boldsymbol\beta} and is the foundation of variable-selection in modern high-dimensional GLM applications. The reference implementation is glmnet (Friedman, Hastie & Tibshirani 2010) — a coordinate-descent algorithm that scales to pp in the millions and is the production workhorse for penalized GLMs in R, Python (sklearn.linear_model.LogisticRegression(penalty=...)), and Julia.

Remark 23 Track 7 — Bayesian GLMs and GLMMs

Adding a prior π(β)\pi(\boldsymbol\beta) on the coefficients and computing the posterior p(βy)π(β)if(yiμi,ϕ)p(\boldsymbol\beta \mid \mathbf{y}) \propto \pi(\boldsymbol\beta) \prod_i f(y_i \mid \mu_i, \phi) converts the GLM into a Bayesian GLM. MAP estimation reduces to penalized MLE (Gaussian prior = ridge, Laplace prior = lasso); full posterior inference uses MCMC (Stan, PyMC, brms; NUTS is the default — Topic 26 §26.5) or variational methods. Generalized linear mixed models (GLMMs) add cluster-level or individual-level random effects bN(0,Σ)\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \boldsymbol\Sigma) to the linear predictor for hierarchical data; the marginal likelihood requires integrating out the random effects (Laplace approximation, AGHQ, MCMC). Track 7 is the Bayesian-foundations track; mixed-effects and hierarchical GLMs live there + at formalML: Mixed-effects models .

Remark 24 Track 8 — Generalized additive models (GAMs)

Replace the linear predictor xβ\mathbf{x}^\top\boldsymbol\beta with a sum of smooth nonparametric functions jfj(xij)\sum_j f_j(x_{ij}) — each fjf_j is fit by penalized basis-function expansion (cubic regression splines, thin-plate splines, P-splines). The result is a generalized additive model (Hastie & Tibshirani 1990): more flexible than a GLM (captures nonlinearities without prespecifying functional form), more interpretable than a black-box neural network (each fjf_j is a 1D function you can plot). Standard tooling: mgcv::gam in R (Wood 2017), pyGAM in Python. Track 8 will cover the kernel-regression / nonparametric-regression machinery this builds on.

Remark 25 formalml.com — Mixed-effects and hierarchical models at scale

GLMMs are the right tool for clustered data — patients within hospitals, students within classrooms, repeat measurements within subjects. Random effects deliver partial pooling (estimates for sparse clusters borrow strength from other clusters via the random-effects prior), automatic regularization, and individual-level inference. The ML-scale treatment at formalML: Mixed-effects models covers lme4-style frequentist GLMMs, Stan-style Bayesian GLMMs, and the recommendation-system applications where random-effects structure (user / item / context) is the dominant modeling choice.

Remark 26 formalml.com — Information geometry and natural gradient

The IRLS step is the natural gradient in exponential-family information geometry (Amari 1985). For a GLM, the Fisher information matrix I(β)=XWX\mathcal{I}(\boldsymbol\beta) = \mathbf{X}^\top\mathbf{W}\mathbf{X} is a Riemannian metric on parameter space; the natural gradient is the ordinary gradient transported through the inverse metric. Natural-gradient descent in continuous time is IRLS, is Fisher scoring, is Newton’s method on the canonical-link GLM. The deep-learning community rediscovered natural-gradient methods (K-FAC, Shampoo, neural tangent kernels) as approximations to this geometry on neural-network parameter spaces. formalML: Information geometry develops the framework.

Remark 27 formalml.com — Multinomial, ordinal, zero-inflated, hurdle, Tweedie

Several practical response types extend the four-family GLM core:

  • Multinomial logit — categorical response with K>2K > 2 levels; one logit per level relative to a reference. Standard for image classification, document classification, recommendation choice models.
  • Ordinal regression (proportional odds, cumulative logit) — ordered categorical response (rating scales, survey responses).
  • Zero-inflated and hurdle models — count response with excess zeros (insurance no-claim policies, social media accounts with zero posts).
  • Tweedie family — continuous-positive response with point mass at zero (insurance pure-premium modeling, rainfall amounts).

The catalog reference is Agresti (2015) Ch. 6–7. formalML: Generalized linear models covers all of these in the practical-modeling chapter.

Remark 28 formalml.com — High-dimensional p ≫ n GLMs

When p>np > n, XWX\mathbf{X}^\top\mathbf{W}\mathbf{X} is rank-deficient and ordinary IRLS fails — the GLM analog of OLS failing when p>np > n. The Track-6 fix is Topic 23’s penalized GLMs (lasso for sparse β^\hat{\boldsymbol\beta}); the asymptotic theory (consistency rates, minimax bounds, double descent for classification) lives at formalML: High-dimensional regression . The applied workflow — glmnet with cross-validation for λ\lambda selection — is in Topic 23; the theoretical underpinnings are external.

Forward-map diagram. Central horizontal spine of Topics 21 → 22 → 23 → 24 (Linear Regression / GLMs / Penalized / Selection) in Track 6 blue. Up-arrows from Topic 22 to Track 7 boxes (Bayesian Foundations, GLMMs) and Track 8 boxes (GAMs, Nonparametric) in amber. Side-arrows to formalml.com boxes (Mixed-effects, Causal inference, High-dim regression, Information geometry, Multinomial/Tweedie) in purple. Back-arrows from Topic 22 to Topics 7 (Exponential families), 14 (MLE), 18 (LRT), 19 (CIs) in grey. Color legend in the corner.

Topic 22 closes Track 6’s first half. The framework is now in place: linear predictor + link function + exponential-family response + IRLS + deviance inference + sandwich SEs. Linear regression is a GLM; the normal equations are the GLM score; OLS-as-orthogonal-projection is GLM IRLS with W=I\mathbf{W} = \mathbf{I}. Every claim Topic 21 made about Normal-linear-model estimation is the special case of Topic 22’s framework where the family is Normal, the link is identity, and the dispersion is σ2\sigma^2. Going forward, Topic 23 (penalization), Topic 24 (model selection), Track 7 (Bayesian extensions), and Track 8 (nonparametric generalizations) all build on this foundation. The forward map above lays out the routes.


References

  1. Nelder, J. A. & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A (General), 135(3), 370–384.
  2. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
  3. Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
  4. Dobson, A. J. & Barnett, A. G. (2008). An Introduction to Generalized Linear Models (3rd ed.). Chapman & Hall/CRC.
  5. Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika, 61(3), 439–447.
  6. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 221–233. University of California Press.
  7. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
  8. Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  9. Fahrmeir, L., Kneib, T., Lang, S. & Marx, B. (2013). Regression: Models, Methods and Applications. Springer.
  10. Lehmann, E. L. & Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
  11. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  12. Greene, W. H. (2012). Econometric Analysis (7th ed.). Pearson.