Generalized Linear Models
The exponential-family bridge — canonical-link score identity, IRLS = Fisher scoring = Newton, deviance as the SSE analog, and the sandwich estimator that closes the §21.9 forward-promise.
22.1 When Normal Errors Fail
Topic 21 built the Normal linear model as orthogonal projection: , exact -distribution inference, , and the Gauss–Markov BLUE property under nothing more than zero-mean homoscedastic errors. The framework rests on two pillars: a Normal response and constant-variance errors. Drop either and the machinery breaks.
The first pillar fails any time the response is bounded, discrete, or skewed. Binary outcomes — credit defaults, click-throughs, classification labels — live in , but an OLS fit will happily produce , which is not a probability. Count outcomes — insurance claim frequencies, page-view counts, photons per pixel — are non-negative integers whose variance grows with the mean (a Poisson with mean has variance ), violating homoscedasticity by construction. Positive-skewed continuous outcomes — claim amounts, waiting times, gene-expression intensities — have right tails that no Normal residual can match.
The fix is not to abandon the linear-predictor structure that made Topic 21 work — it is to compose that structure with a link function that maps the response’s natural scale (e.g. probability, rate, mean) into , and to swap the Normal response for whatever exponential-family distribution the data actually lives in. That single modification — link function plus exponential-family response — is the generalized linear model.
Generate predictor values and binary responses where is the logistic sigmoid. Fit OLS: . The fitted line is approximately , which crosses zero at and exceeds one at . Roughly 12 of the 80 fitted values are negative; roughly 14 exceed unity. The model is internally inconsistent: it predicts probabilities outside , and any decision rule built on either has to clip those values or admit the model is wrong about its own outputs.
The same data fit by logistic regression with the logit link produces on the latent scale, with for every . The probabilities are bounded by construction; coefficient interpretation shifts from “additive on ” to “additive on .”
The three failures above are not isolated — they are the three structural assumptions of the Normal linear model failing one at a time:
- Bounded support. Binary, multinomial, and ordinal outcomes have support that does not match . The link function maps the bounded mean (e.g. probability) back to .
- Variance–mean dependence. Counts have (Poisson); proportions have (binomial); positive-skewed amounts have (Gamma). Homoscedasticity is the exception, not the rule. The variance function encodes this dependence.
- Non-Normal tails. Skewed positives, heavy-tailed counts, zero-inflated mixtures — none have Normal sampling distributions even at moderate . The exponential-family response replaces “Normal errors” with “the right family for this support.”
GLMs address all three at once by composing a linear predictor with a link and pairing it with an exponential-family response. The linear-predictor structure is what survives; everything else generalizes.
Before 1972, statisticians had logistic regression (Bliss 1935, Berkson 1944), probit regression (Bliss 1934), Poisson regression (Cochran 1940, Birch 1963), log-linear models for contingency tables (Goodman 1968), and Gamma regression for survival times (Cox 1972) — all developed independently with bespoke fitting algorithms. Nelder & Wedderburn (1972) observed that all of these are the same model: an exponential-family response with mean , a link function mapping to a linear predictor , and a single fitting algorithm — iteratively reweighted least squares (IRLS) — that handles every family. The unification was both conceptual (one framework, not five) and practical (one algorithm to implement, not five). McCullagh & Nelder (1989) is the canonical book-length development; Agresti (2015) is the current graduate standard.
Topic 22 reproduces the unification mathematically. §22.2 lays out the framework; §22.3 proves the canonical-link score identity that drives the unified algorithm; §22.4 / §22.5 / §22.6 fit logistic / Poisson / Gamma regression as worked examples; §22.7 introduces the deviance as the universal goodness-of-fit statistic; §22.8 / §22.9 handle inference and misspecification.
22.2 The GLM Framework
A generalized linear model has three ingredients: a linear predictor , a link function mapping the conditional mean to , and a response distribution from an exponential family with mean and dispersion . We commit each ingredient in turn.
Notation (extending §21.4). Throughout Topic 22, , is the exponential-family variance function, and is the dispersion parameter (known for Bernoulli and Poisson; estimated for Gamma and Normal). The linear predictor is ; the link function satisfies , with inverse link . The weight matrix is diagonal with ; under the canonical link this simplifies to . The adjusted response at iteration is . We write for the stacked vectors. The score vector is ; the observed information is ; the Fisher (expected) information is . Under the canonical link, . All matrix conventions from §21.4 (column vectors, of shape , for transpose) carry over verbatim.
Given a design matrix (with a leading column of ones for the intercept) and a coefficient vector , the linear predictor for observation is
In stacked form, . The linear predictor lives on the real line — no constraints — even when the response has bounded or discrete support.
A link function is a smooth, strictly monotone map from the response’s mean space (e.g. for Bernoulli, for Poisson and Gamma, for Normal) to the linear-predictor scale . It satisfies
Strict monotonicity guarantees the inverse is well-defined, so the model uniquely specifies given and . Smoothness is needed for the IRLS derivation in §22.3.
The response follows an exponential family with density (or mass function)
where is the natural parameter, is the cumulant function, is the dispersion, and is a base-measure factor that does not depend on . By Topic 7 Thm 3 and Cor 2, and , where is the variance function.
A link function is canonical when , i.e. when the natural parameter coincides with the linear predictor: . Under the canonical link the score equation collapses to the design-times-residual identity that makes §22.3 Thm 1 the featured theorem of the topic.
Take with the identity link . The exponential-family form has (so the link is canonical), , , and dispersion . The cumulant identities give (mean) and , so — constant variance, recovering homoscedasticity.
The log-likelihood is
so maximizing over is equivalent to minimizing — the OLS objective from §21.2. The score equation is , which is §21.4 Thm 3 (normal equations) verbatim. Topic 21’s entire estimation theory is the Normal+identity-link special case of the GLM framework. We will see this reduction again, more abstractly, in §22.3 Proof 1 Step 4.
For the four GLM families covered in this topic, the canonical link is determined by inverting :
| Family | Canonical link | |||
|---|---|---|---|---|
| Bernoulli | logit: | |||
| Poisson | log: | |||
| Gamma (shape , mean ) | inverse: | |||
| Normal | identity: |
The Bernoulli logit and Poisson log are the two canonical links you will see most often in practice; the Gamma inverse link is unusual and §22.6 Rem 14 discusses why the log link is often preferred there despite being non-canonical.
The canonical link is mathematically privileged — it is the unique link under which the score equation is exactly (§22.3 Thm 1) and observed Fisher information equals expected Fisher information (§22.3 Thm 2 Step 2). Non-canonical links — probit and cloglog for binary data, identity link for Poisson, log link for Gamma — break both properties: the score acquires an extra Jacobian factor and the Hessian’s expected and observed forms diverge.
So why use them? Three reasons. Latent-variable interpretation: the probit link corresponds to a latent Normal variable whose threshold determines the binary outcome — natural for psychological-test response models. Hazard interpretation: the cloglog link corresponds to proportional hazards in discrete time — natural for survival analysis. Coefficient interpretability: the log link on Gamma gives as a log-multiplicative effect on the mean (a one-unit change in multiplies by ), which is more useful in practice than the inverse-link’s “additive on ” interpretation.
In the GLM framework, non-canonical links work — IRLS still converges, asymptotic normality still holds — they just do not enjoy the elegant identities that make the canonical case the featured theorem of §22.3.
The dispersion scales the variance: . For Bernoulli and Poisson, is fixed at 1 by the structure of the family — there is no separate variance parameter to estimate, because mean determines variance. For Gamma and Normal, is a free parameter that must be estimated alongside . The estimator is the Pearson statistic:
When is estimated rather than known, the deviance test (§22.7) uses an -reference instead of , exactly as in Topic 21 §21.8 where replaced in the -statistic. The §22.9 sandwich estimator generalizes this further: when the variance specification is wrong (overdispersed Poisson, misspecified Gamma scale, clustered Bernoulli), the QMLE is still consistent for as long as the mean specification is correct; only the variance estimate needs adjustment.
22.3 Featured: The Canonical-Link Score Identity and IRLS
Topic 22’s featured result is the canonical-link score identity: under a canonical link, the score equation is
which is the design matrix times the residual — exactly the form Topic 21 §21.4 Thm 3 wrote for the OLS normal equations. The composition does the work of bridging the response scale and the linear-predictor scale; when the link is canonical and the response is Normal+identity, the composition collapses and we recover OLS verbatim. This is the structural claim of Track 6: linear regression is a GLM; the normal equations are the GLM score.
The companion result is the IRLS = Fisher scoring = Newton coincidence (Thm 2). Newton’s method on the GLM log-likelihood, Fisher scoring (Newton with expected Hessian), and iteratively reweighted least squares (a sequence of WLS solves on an adjusted response) all coincide under the canonical link. Each iteration is
a weighted least-squares solve with the WLS weights and adjusted response defined in the notation block. This is what glm() in R, statsmodels.GLM in Python, and every production GLM solver implements under the hood.
Let follow an exponential-family density with canonical link . Then:
-
Score identity. , where .
-
Hessian. , where is diagonal with .
-
Concavity. When has full column rank , the log-likelihood is strictly concave in . Consequently, any interior critical point of is the unique global maximum.
In particular, the MLE , when it exists in the interior of the parameter space, is the unique solution to .
Proof [show]
Step 1 (Setup). Substitute the canonical-link assumption into the exponential-family density. By Topic 7 Thm 3 the family’s mean is ; by Topic 7 Cor 2 the variance function is . Both identities are one-line consequences of differentiating with respect to — see Topic 7 for the derivation.
Step 2 (Log-likelihood). With the canonical-link substitution, the per-observation log-likelihood is . Summing over and dropping the -free constant:
Step 3 (Score). Differentiate term by term using the chain rule. The derivative of with respect to is ; the derivative of is (Topic 7 Thm 3). Therefore
The score is the design matrix times the residual — the design times the gap between observed response and fitted mean. When (Bernoulli, Poisson) the prefactor drops and exactly; when is unknown but constant across observations (Gamma, Normal) it factors out of the first-order condition. This is identity #1.
Step 4 (Reduction to OLS). This step is the pedagogical hinge of the entire topic. For the Normal+identity-link special case, , so and . Substituting back: , and the score identity becomes
That is Topic 21 §21.4 Theorem 3 (the normal equations) verbatim. The featured theorem of Topic 22 contains the featured estimation result of Topic 21. We will see this reduction surface again in §22.3 Rem 6, and the structural identity “OLS is GLM” is the through-line that licenses every cross-reference between the two topics.
Step 5 (Hessian). Differentiate the score from Step 3 once more. The derivative of with respect to is (Topic 7 Cor 2). Therefore
where has strictly positive diagonal because exponential-family regularity guarantees in the interior of (Topic 7). This is identity #2.
Step 6 (Concavity). When has full column rank , the quadratic form is strictly positive for every (since by full column rank, and ). Hence is negative definite everywhere is interior, so is strictly concave. Strict concavity guarantees any interior critical point is the unique global maximum. — using Topic 7 Thm 3 () and Topic 7 Cor 2 ().
For the logistic regression model with logit link , the canonical-link score identity gives
where with and is the sigmoid. Setting the score to zero gives the system — the binary-classification analog of “residual orthogonal to the design columns.” Unlike OLS, the system is non-linear in (because depends on through the sigmoid), so there is no closed-form solution; we need the IRLS algorithm of Thm 2.
The Hessian is with — the variance of a Bernoulli with parameter . The weight is largest at (variance ) and shrinks to zero at the boundary , which is the source of the “near-separation” pathology Rem 8 will treat in §22.4.
Theorem 1 says what the MLE satisfies; Theorem 2 says how to find it. Newton’s method on a smooth concave function converges quadratically near its maximum, but the Newton step requires solving a linear system at every iteration. The next theorem shows that, under the canonical link, this Newton step rewrites as a single weighted-least-squares solve on an adjusted response — and is identical to Fisher scoring because observed and expected information coincide. The algorithm is iteratively reweighted least squares, and it is what every production GLM solver implements.
Under the canonical link, Newton’s method on the GLM log-likelihood produces iterates
where and the adjusted response is
The right-hand side is the closed-form weighted-least-squares estimator for the linear model with weights . Therefore Newton’s method on the canonical-link GLM log-likelihood is iteratively reweighted least squares. Furthermore, because depends on only through the deterministic mean (no randomness in ), the observed information equals the expected information, so Newton’s method is Fisher scoring.
Proof [show]
Step 1 (Setup). From Thm 1 we have, under the canonical link, and . The dispersion factors out of the Newton step (it appears in both and identically), so we absorb it into a constant for readability and write and in this proof.
Step 2 (Observed = expected information). The Hessian depends on only through , which is a deterministic function of and the fixed design matrix . The data does not appear in . Therefore identically — observed and expected information coincide. Newton’s method (which uses the observed Hessian) and Fisher scoring (which uses ) generate identical iterates under the canonical link.
Step 3 (Newton step). The Newton update is
By Step 2, this is identically Fisher scoring. We now show it is identically WLS on an adjusted response.
Step 4 (Rewrite as WLS). Add and subtract inside the square brackets, then factor out :
Define the adjusted response , recalling . Then the update is
Step 5 (This is weighted least squares). The boxed right-hand side is the closed-form weighted-least-squares estimator (Topic 21 §21.7 Rem 13, in the WLS appendix-style remark) for the linear model with weights . Each Newton / Fisher-scoring iteration is therefore a single WLS solve on the adjusted response . The weights and the response both depend on the current iterate — hence “iteratively reweighted least squares.”
The boxed update in Step 4 is what IRLSVisualizer animates one iteration at a time. Click “Step” to advance from to ; the contour panel shows the iterate trajectory on the log-likelihood surface, and the right panel decomposes the update into the IRLS weights and the adjusted-response scatter . The “Near-separation” preset reproduces the §22.4 Rem 8 pathology — the iterates wander off toward rather than converging.
Log-likelihood ℓ(β₀, β₁) — click to set β(0)
Adjusted response z = η + g'(μ)(y − μ) and weights W
The IRLS update has a deeper interpretation in information geometry: the exponential family is a Riemannian manifold whose metric is the Fisher information matrix , and the IRLS step is the natural-gradient step under that metric — gradient ascent in the geometry the family naturally lives in, rather than the flat Euclidean geometry of . The natural gradient is invariant under reparameterization (a property the ordinary gradient lacks), which is exactly the kind of invariance that makes IRLS work uniformly across families and links.
The deep-learning community rediscovered natural-gradient descent in Amari (1998) for neural networks, where it is approximated by methods like K-FAC and Shampoo. The full information-geometry framing, including connections to the dual coordinate systems of exponential families, is at formalML: Information geometry .
We can now make the “OLS is GLM” claim of Step 4 of Proof 1 more emphatic. For Normal response with identity link and : gives , so . The IRLS update collapses to
The adjusted response collapses to , so the WLS solve becomes the OLS solve, and one iteration converges. Topic 21’s normal equations are not just contained in the canonical-link score identity — Topic 21’s fitting algorithm is contained in IRLS as the special case where and the iteration converges in one step. Gauss–Markov BLUE optimality (§21.6) is the homoscedastic-case statement; under heteroscedastic errors the right BLUE is WLS with the true weights, which is what IRLS with the true variance function delivers (Topic 21 §21.9 Rem 21 forward-promised this; we have now discharged it).
The final ingredient in §22.3 is asymptotic normality of . This is a corollary of Topic 14 Thm 3 specialized to the GLM information matrix .
Under GLM regularity (canonical link, full column rank, MLE in the interior of the parameter space, finite Fisher information), as ,
where is the per-observation Fisher information matrix. Equivalently, in finite-sample-approximation form,
The estimated covariance replaces the unknown by and the unknown by the Pearson estimator (§22.2 Rem 4), and is what every coefficient SE in §22.8 is built on.
The proof is a direct specialization of Topic 14 Thm 3: the GLM log-likelihood satisfies the standard regularity conditions (smoothness in , integrable score, finite Fisher information at ), so converges in distribution to ; the GLM-specific content is that has the design-matrix-times-variance-weight form. We omit the full restatement of Topic 14 Thm 3 — see §14.5 of that topic for the multivariate Taylor-expansion derivation.
On a synthetic logistic dataset with , predictor, and true , IRLS started from produces:
| 0 | — | ||
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | |||
| 5 |
Five iterations to convergence at . The error shrinks roughly quadratically per step (, each step squaring the previous gap up to a constant) — the signature of Newton-method convergence in a neighborhood of the optimum (Topic 14 Thm 6). The right-panel readout of IRLSVisualizer reproduces this trajectory live.
22.4 Logistic Regression
The Bernoulli response with logit link is the binary-classification workhorse. Every binary predictor — credit default vs. repayment, click vs. no-click, treatment success vs. failure — fits naturally into this framework, and the canonical-link score identity Thm 1 gives the fitting algorithm directly: minimize via IRLS with weights .
Given binary responses and predictors (with leading intercept column), the logistic regression model is
Equivalently, where is the logistic sigmoid. The coefficient is the log-odds-ratio associated with a one-unit increase in (holding others fixed): is the multiplicative change in the odds .
By Ex 4, the score is and IRLS uses weights . Theorem 3 specializes immediately:
Under GLM regularity, the asymptotic distribution from Thm 3 gives the per-coefficient Wald-z confidence interval
with asymptotic coverage . The Wald test for is the dual test (Topic 19 Thm 1). For Bernoulli, no F-correction is needed; the asymptotic applies directly.
The Wald-z CI is the reported default in summary(glm(...)) in R and glm.summary() in Python’s statsmodels — easy to compute, asymptotically correct. But §22.8 will show it has known pathologies near the parameter boundary (separation, rare-event imbalance), where the profile-likelihood CI is more reliable.
A consumer-finance synthetic example: loan applicants with two predictors: standardized age and standardized log-income . The default response is for default, for repayment, with true coefficients on the logit scale (intercept, age, income). The notebook fits the model with sm.GLM(y, X, family=Binomial()):
| Coefficient | -value | |||
|---|---|---|---|---|
| Intercept | ||||
| Age | ||||
| Income |
Both predictors are highly significant. Coefficient interpretation: a one-standard-deviation increase in age raises log-odds-of-default by (multiplicative odds change ); the analogous income coefficient is . The full model deviance is on residual df, vs the intercept-only deviance on df. The deviance reduction on 2 df has under — overwhelming evidence the predictors carry information about default risk.
The logit is canonical, but two other links see practical use. Probit: where is the standard Normal CDF; the model corresponds to a latent Normal variable with and . Common in psychometric / response-threshold modeling. Complementary log-log (cloglog): corresponds to a discrete-time hazard interpretation — natural for survival-analysis-style binary outcomes. Both are non-canonical: IRLS still fits them (with weights rather than the simpler form), and asymptotic normality still applies, but the score identity is no longer exactly. In practice, logit and probit give nearly identical fitted probabilities and indistinguishable LRT -values; the choice is interpretive (log-odds vs latent-Normal) rather than statistical.
Quasi-complete separation occurs when the predictors perfectly (or nearly perfectly) separate the two classes — say, all when and all otherwise. Geometrically, you can draw a hyperplane in predictor space with all positives on one side. Under separation, the logistic log-likelihood is monotonically increasing along the direction (you can always make the fit “more confident” by scaling up ), so does not exist in the interior of — IRLS iterates diverge to .
Production GLM solvers detect this by stopping when fails to shrink within a maximum iteration count, returning a “did not converge” warning. The fix is to add a regularization penalty (Firth 1993’s penalized-likelihood approach is one option; ridge / L² penalties are another) — Topic 23 develops penalized GLMs in detail. The IRLSVisualizer “Near-separation” preset reproduces the diverging-iterates pathology so the failure mode is visually unmistakable.
The “binary cross-entropy loss” used in deep-learning binary classifiers is, per observation,
which is exactly the negative Bernoulli log-likelihood. A neural network with a sigmoid output layer trained to minimize cross-entropy is fitting a logistic GLM where the linear predictor is replaced by a learned representation . All of §22.4 — link function, IRLS, deviance, separation pathology — applies; only the parameterization changes. See formalML: Logistic regression for the deep-learning treatment.
A logistic model can fit the trend well but miscalibrate the probabilities. The Hosmer–Lemeshow test (HOS2013, Ch. 5) checks the latter: partition observations into deciles by fitted , count observed and expected positives in each bin, and form a statistic. Large values reject “the model is well-calibrated.” The GLMExplorer calibration plot (right panel of Fig 4) is the visual analog. For deep models, related tools include reliability diagrams and the expected-calibration-error metric — the latter is the binary version of the Hosmer–Lemeshow statistic without the binning step. Standard caveat: the test has low power against subtle miscalibration and is sensitive to the choice of (typically ).
The GLMExplorer component lets you fit any of the four GLM families on six preset datasets — switch the family selector between Bernoulli / Poisson / Gamma / Normal, pick a link function, and watch the fit, residuals, and coefficient table update in real time. It is shared across §22.4 / §22.5 / §22.6 with different default presets.
Fit: bernoulli · logit
Residuals vs fitted
| name | β̂ | SE | z | p |
|---|---|---|---|---|
| intercept | -0.698 | 0.211 | -3.31 | 0.0009 |
| β1 | 1.293 | 0.244 | 5.31 | 1.1e-7 |
| β2 | 1.965 | 0.299 | 6.57 | 5.2e-11 |
Family
Link
Hover a coefficient row to compute its profile-LRT CI.
22.5 Poisson Regression
The Poisson response with log link is the workhorse for count outcomes: claim frequencies, page-view rates, photons per pixel, gene-read counts. The log link enforces (counts are non-negative), and the multiplicative coefficient interpretation ( = rate ratio per unit increase in ) is more useful in practice than the additive interpretation an identity link would give.
Given count responses , predictors , and (optionally) an exposure offset , the Poisson regression model is
The exposure offset is a covariate with fixed coefficient 1 — exposure-adjustment, not a parameter to estimate (Rem 11). The implied rate per unit exposure is , and is the rate ratio: the multiplicative change in expected count per unit increase in , holding exposure and other covariates fixed.
The canonical-link score is where . IRLS weights are — heteroscedastic by construction (variance grows linearly with mean), which is why OLS on counts violates the homoscedasticity assumption while Poisson regression handles it naturally.
An insurance carrier observes policies with annual claim counts , policy-years exposure , standardized driver age , and a region indicator . The fitted model
uses the offset to convert the model from “expected total claims” to “expected claims per policy-year” — without the offset, a policy observed for 10 years would dominate one observed for 1 year, and would be biased toward the long-exposure observations. Statsmodels fits this with sm.GLM(y, X, family=Poisson(), offset=np.log(t)). On the synthetic dataset (true , seed 2223), the fit gives — recovery to within of truth on every coefficient, deviance on df. Coefficient interpretation: each one-SD increase in driver age multiplies the expected claim rate by ; being in region 1 multiplies it by .
The offset enters the linear predictor with fixed coefficient 1, not a parameter to estimate. It is not the same as adding as a regular predictor (which would estimate a separate coefficient , allowing the rate to scale non-linearly with exposure). The offset says “the rate of claims per unit exposure is ,” and exposure enters multiplicatively with coefficient 1.
The same construction applies wherever counts have an obvious denominator: cases per population (epidemiology), failures per machine-hour (reliability), defects per lot size (quality control). The denominator goes in as a log-offset; the rate-per-unit-exposure is what models.
The log link is canonical and almost always the right choice. The identity link is occasionally used for additive rate-difference interpretation (e.g. “this treatment causes 0.3 additional events per person-year”), but it requires the constraint for to be a valid Poisson mean, which IRLS does not enforce automatically. The fit can stray into negative territory mid-iteration, causing the deviance to be undefined and IRLS to crash. In practice, identity-link Poisson is fit only when domain context gives a binding lower bound on covariate values (e.g. observational range strictly positive) and the additive interpretation is required for downstream decisions.
Poisson regression assumes . Real count data often violate this: hospital admissions, social-media post counts, insurance claim frequencies all typically have — overdispersion. The point estimate is still consistent for under overdispersion (this is the QMLE result Thm 8 will state), but the naive Wald SE is too small, and Wald CIs undercover. Three fixes:
- Negative Binomial regression — replace Poisson with where is a separate dispersion parameter; full distribution. Standard tool:
MASS::glm.nbin R,statsmodels.discrete.NegativeBinomialin Python. Forward-pointed to formalML: Generalized linear models for the full treatment. - Quasi-Poisson — keep the Poisson score equation, but estimate the dispersion separately via the Pearson statistic. Same point estimates, scaled SEs.
- Sandwich SE — keep the Poisson model entirely, but use the HC0–HC3 sandwich estimator from §22.9 to get robust SEs without changing the distributional assumption. This is the QMLE approach Wedderburn 1974 introduced.
The §22.9 SandwichCoverageSimulator shows the coverage gap visually: under overdispersion ratio , naive Wald CIs cover at vs the nominal ; HC3 sandwich CIs cover at , much closer to nominal.
22.6 Gamma Regression
The Gamma response with inverse or log link is the third workhorse — for positive-skewed continuous outcomes: insurance claim amounts (as opposed to frequencies), waiting times, healthcare costs, gene-expression intensities. The Gamma allows a shape parameter that controls skewness, and the variance scales as (multiplicative noise on the mean rather than additive), which matches the “log-Normal-ish” empirical behavior of these outcomes.
Given positive responses and predictors , the Gamma regression model is
parameterized so that the shape is and the rate is , giving and . The dispersion is , and the variance function is . The canonical link is (the inverse link); the practical link is (the log link, non-canonical) — Rem 14 explains why the latter is often preferred.
Topic 6 Example 4 introduced insurance claim amounts as a motivating Gamma example and forward-pointed to “GLMs for the full framework.” We discharge that pointer here. On a synthetic claim-amounts dataset with , claim-severity score , true coefficients on the log scale, and shape (so ):
| Coefficient | ||
|---|---|---|
| Intercept | ||
| Severity |
Estimated dispersion (true 0.5; recovery within sampling error at ). Deviance on df. The fitted model is , so a one-SD increase in claim severity multiplies the expected claim amount by . The right panel of Figure 6 below shows deviance residuals against fitted values — the symmetric-around-zero, no-trend pattern signals the Gamma+log specification is adequate; if the dispersion were misspecified or the variance function wrong, the deviance residuals would show heteroscedastic structure. The §22.9 sandwich estimator handles this when it happens.
The Gamma’s canonical link is the inverse: , so . Under the inverse link the score identity Thm 1 applies cleanly — — but coefficient interpretation is awkward: is the additive change in per unit change in , which has no natural domain meaning (“additive on the reciprocal of expected claim amount”). Worse, the inverse link requires to keep , and IRLS does not enforce this — the fit can stray into negative territory and crash.
The log link is non-canonical for Gamma but solves both problems: is automatically positive, and has a clean multiplicative interpretation ( is the multiplicative change in expected claim amount per unit change in ). The cost is that IRLS uses different weights wait — let me recompute: with we have and , so . Wait, that gives which is suspicious. Actually the weights for non-canonical links use in the Fisher scoring form, and is correct: under log-link Gamma, the IRLS weights are constant. The score equation acquires an extra Jacobian factor (the score is no longer exactly), but IRLS still converges and the asymptotic theory still applies. In practice, log-link Gamma is the default in R’s glm(family=Gamma(link="log")) and is what insurance, finance, and biostatistics applications use.
For Bernoulli and Poisson, is structural — there is no separate variance parameter to estimate. For Gamma, is a free parameter, estimated by the Pearson statistic
The estimate enters all coefficient SEs as a scalar multiplier: . For nested-model deviance tests (§22.7), also enters: when is estimated, the deviance test uses an reference instead of , exactly as in Topic 21 §21.8 where replaced . The notebook fit in Ex 8 produced vs true — within at , which is good enough for inference.
22.7 Deviance and the Deviance Likelihood-Ratio Test
In Topic 21, the sum of squared errors played a triple role: it was the OLS objective, the goodness-of-fit measure, and the test statistic for nested-model comparison via the -test (§21.8 Thm 9). The GLM analog of all three is the deviance — a likelihood-ratio statistic that compares the fitted model to a “saturated” model in which every observation has its own parameter. Differences in deviance across nested models recover (Wilks’ statistic) verbatim, so Wilks’ theorem (Topic 18 Thm 4) gives the asymptotic distribution immediately.
The saturated model for a GLM with response has one parameter per observation: it sets for every , so the fitted means coincide with the observations. The saturated log-likelihood is
the maximum log-likelihood achievable on the data without any constraint from the linear predictor. The deviance of a fitted model with log-likelihood is
Deviance is a non-negative scalar (since by definition); would mean the fitted model achieves the saturated fit, which happens only on degenerate data (e.g. fitting a logistic with one binary predictor on all-zero responses). For Bernoulli and Poisson (), the deviance is the GLM analog of SSE; for Gamma and Normal, the dispersion factor converts log-likelihood units to deviance units.
For the four canonical families, the per-observation deviance contributions such that are:
- Bernoulli: , with .
- Poisson: , with .
- Gamma: .
- Normal: , so directly — Topic 21’s residual-sum-of-squares statistic is the Normal-family deviance.
The Normal-family identity is the cleanest expression of the “GLM contains OLS” reduction at the inference layer: every deviance computation in §22.7 specializes back to Topic 21’s SSE machinery for Normal+identity-link.
For nested GLM models with linear restrictions on the coefficients,
The reduced-model deviance decomposes into the full-model deviance (residual lack-of-fit not explained by the additional predictors) plus the deviance improvement (lack-of-fit explained by adding the predictors back). This is the GLM analog of the OLS decomposition from §21.8. The deviance improvement is the test statistic for : the restrictions hold.
The decomposition is a tautology — no proof needed, just rearrangement of the definition of . The theorem is included for symmetry with the OLS case and to motivate Thm 6, which gives the asymptotic distribution of the deviance improvement.
Let be a GLM with parameter space , and let be the reduced GLM imposing linear restrictions on the coefficients. Under and GLM regularity (full-rank , interior MLE), as ,
For fixed- families (Bernoulli, Poisson) the asymptotic applies directly; for estimated- families (Gamma, Normal), the test statistic divided by uses an reference to account for dispersion-estimation uncertainty.
Proof [show]
Step 1 (Setup). Let and be the MLEs under the reduced and full models. The full model has free parameters; the reduced model has free parameters under the linear restrictions .
Step 2 (Deviance difference cancels the saturated term). By Def 7, . The saturated-model log-likelihood cancels.
Step 3 (LRT identification). Topic 18 Def 1 defines the log-likelihood-ratio statistic for nested models as . For the deviance difference is verbatim; for they agree up to the scalar .
Step 4 (Apply Wilks). By Topic 18 Thm 4 (Wilks’ theorem), under and standard regularity (full-rank Fisher information, interior MLE, imposes smooth constraints), . Therefore .
Step 5 (Decision rule). At significance level , reject when . For estimated , divide by and reject when the quotient exceeds asymptotically. — using Topic 18 Thm 4 (Wilks).
The entire proof is six steps of book-keeping — the substance lives in Topic 18 Thm 4. The deviance test is Wilks; we just renamed to to match GLM software output. R’s anova(fit0, fit1, test="Chisq") and Python’s glm1.compare_lr_test(glm0) both compute exactly and reference it to .
Continuing the §22.5 insurance-frequency setup: fit a full Poisson model with intercept + 3 predictors + log-offset on observations, and a reduced model with intercept + 1 predictor + log-offset (dropping the second and third predictors, restrictions). On the synthetic dataset with true :
| Quantity | Value |
|---|---|
| Full-model deviance | |
| Reduced-model deviance | |
| Deviance difference | on df |
| critical value at | |
| -value |
Decisive rejection of — the dropped predictors carry highly significant information. The DevianceTestExplorer lets you sweep across families, , and to see how the test behaves under different effect sizes. The middle panel shows the power curve as a function of non-centrality : for at , the test has power.
Two residual flavors are reported by every GLM solver:
- Pearson: — standardized residual on the response scale, analog of OLS standardized residual.
- Deviance: where is observation ‘s contribution to .
For Normal+identity-link, both reduce to standardized OLS residuals (T8.9 verifies this to machine precision). For other families, the deviance residual is generally preferred for diagnostic plotting because it is more symmetric under correctly-specified models — the Pearson residual inherits the response-scale skewness (e.g. positive skew for Poisson with small ), while the deviance residual is a square-root transform that smooths the asymmetry. In production, plot against ; systematic structure flags model misspecification (wrong link, missing predictor, wrong variance function).
The Figure 8 grid in §22.9 shows both residual flavors side-by-side for Poisson and Gamma fits — the Pearson residuals fan out heteroscedastically even under correct specification (fitting artifact, not model failure), while the deviance residuals are symmetric and unstructured.
The DevianceTestExplorer component lets you explore the deviance LRT across families and effect sizes. The left panel shows the null density with the rejection region shaded; the middle panel shows the power curve as a function of non-centrality ; the right panel shows the profile-likelihood for with both the Wald-z CI and the profile-LRT CI rendered, making the asymmetry of the latter visible.
χ²_1 null + observed Δ
Power vs λ (k=1, α=0.050)
Profile D(β_j) vs Wald-z (j=2)
Data k = 1 (slider just changes the χ²_k reference axis).
22.8 Coefficient Inference: Wald, Score, Profile-Likelihood
Topic 17 §17.9 introduced the asymptotic-equivalence trio — Wald, Score, and Likelihood-Ratio tests — as three flavors of the same inference machinery, each with its own computational and finite-sample tradeoffs. §22.8 specializes the trio to GLM coefficient inference: Wald-z for fast per-coefficient -statistics, Score for “only fit the null” forward-selection contexts, and profile-LRT for the high-fidelity CIs that respect the likelihood’s curvature near the boundary. Topic 19 §19.6 forward-promised that profile-LRT is the GLM default; Rem 17 below discharges that promise.
For a single GLM coefficient at confidence level :
- Wald-z CI: , where . Symmetric about ; uses the asymptotic Normal distribution of from Thm 3.
- Score (Rao) CI: , where is the constrained MLE under . Computed entirely at the null; never requires the full-model fit.
- Profile-likelihood CI: , where profiles out the nuisance coefficients (refits the model with fixed at and other coefficients optimized). Asymmetric in general; the gold standard for GLMs.
All three CIs invert the corresponding hypothesis test (Topic 19 Thm 1, the test–CI duality). All three converge to the same asymptotic distribution by Topic 18 §18.7 — they disagree only by in finite samples.
Under and GLM regularity, the three test statistics
each converge in distribution to as , and pairwise differ by . The three corresponding CIs (Wald-z, Score, profile-LRT) therefore agree to leading order asymptotically and differ only in finite-sample behavior — particularly near the parameter boundary, where Wald-z’s symmetric form fails and the others recover correct coverage.
The proof is a direct specialization of Topic 18 §18.7’s general statement to the GLM setting; we omit the restatement.
A logistic regression on near-separation data ( — strong slope) produces with substantial coefficient uncertainty due to the weak signal-to-noise at the boundary. The two CIs for :
| CI flavor | Lower | Upper | Width | Asymmetry |
|---|---|---|---|---|
| Wald-z (95%) | symmetric | |||
| Profile-LRT (95%) | upper half lower half |
The profile-LRT CI is wider and asymmetric, with most of the extra width pushed to the right (large positive values). This is the correct picture: near separation, the log-likelihood is much flatter on the “more positive” side of than on the “less positive” side, so a -cutoff slice of the likelihood profile gives an asymmetric region. The Wald-z CI’s symmetric form artificially compresses the upper bound, undercovering the truth on the high side. This is the “Wald boundary pathology” Topic 18 Rem 16 already named; the profile-LRT CI is the production-standard fix.
Take a single binary dataset and fit it under both the logit link (canonical) and the probit link (non-canonical). Each link gives a different on its own scale (logit gives log-odds units; probit gives latent-Normal-z units), so the Wald-z -value for depends on which scale you fit:
| Link | Wald-z | -value | |
|---|---|---|---|
| logit | |||
| probit | |||
| LRT (deviance difference) | — | — | (both links) |
The Wald-z -values disagree across links ( vs ) — a worrying relative difference for a finding that hovers near the threshold. The LRT -value is link-invariant: is the same regardless of which link parameterization you fit, because the deviance difference depends only on the maximum likelihoods, which are identical across smooth reparameterizations (Topic 18 Thm 6). This is exactly the “GLM software defaults to LRT for nested-model comparisons” rationale that Topic 18 Rem 17 named — and it is one of the strongest practical reasons to prefer LRT over Wald when the choice is available.
Topic 19 §19.6 forward-promised that profile-likelihood CIs are the workhorse for GLM coefficients in production software — R’s confint() on a glm object defaults to the profile-LRT, computed by refitting the model with fixed at a grid of values and tracing the deviance drop until the threshold is crossed. Three reasons:
- Reparameterization invariance (Topic 18 Thm 6, Topic 19 Rem 8): the LRT -value is invariant under smooth reparameterization, including the link function (Ex 11 above). Wald-z fails this invariance.
- Asymptotic efficiency via Wilks (Topic 18 Thm 4, Topic 19 §19.6): all three CIs (Wald, Score, LRT) are first-order asymptotically efficient, but LRT has better higher-order properties — Bartlett correction, refined coverage in moderate samples.
- Correct coverage at boundaries (Ex 10 above): near separation, near-zero variance components, and other boundary cases, profile-LRT respects the likelihood’s curvature; Wald-z does not.
The cost is computational: profile-LRT requires constrained refits per coefficient (for bisection on each side). For a GLM on data, that is IRLS solves — not cheap but very tractable. The coefCIProfileGLM function in regression.ts implements the bisection; the DevianceTestExplorer right panel renders profile-LRT and Wald-z side-by-side for visual comparison.
22.9 Misspecification and the Sandwich Estimator
Topic 21 §21.9 Rem 19 forward-promised a treatment of HC-robust standard errors — the White (1980) sandwich variance estimator that gives consistent SEs under arbitrary heteroscedasticity, even when the homoscedasticity assumption of OLS is wrong. §22.9 discharges that promise, and generalizes it: the sandwich estimator works for any GLM (not just Normal+identity-link OLS) under any variance misspecification, as long as the mean specification is correct. The estimator is consistent because Wedderburn (1974) showed quasi-likelihood — fitting only on the mean–variance relationship without specifying the full distribution — gives a consistent point estimate regardless of higher-moment misspecification. Huber (1967) and White (1980) independently derived the sandwich form as the variance of this quasi-MLE. The three threads converged into modern QMLE / sandwich-SE practice.
Given a mean specification and a variance function (without committing to a full distributional family), the quasi-likelihood score is
The quasi-maximum-likelihood estimator (QMLE) solves . When the canonical-link assumption holds, the quasi-score equals the true GLM score (Thm 1), so — the same estimate, computed by the same IRLS algorithm, with the same point estimates. The difference is in the variance: the MLE assumes the full distribution is correct (giving ), while the QMLE acknowledges the variance specification might be wrong and uses the sandwich estimator below.
Suppose the mean specification is correct: for some true . The variance specification may be arbitrary (overdispersion, cluster-correlation, full distributional misspecification). Then under standard regularity conditions:
- Consistency. as .
- Asymptotic normality with sandwich variance. , where
The sandwich form replaces the model-based (which assumes — the “information identity” that holds under correct full-distribution specification). Empirical estimators of — HC0 through HC3 — give finite-sample sandwich SEs without requiring the full variance structure.
We state Thm 8 without full proof — the manipulation requires Topic 14’s MLE-asymptotics machinery applied to the misspecified-likelihood setting (Huber 1967), then a uniform law of large numbers to turn population into their empirical counterparts (White 1980). The full derivation is HUB1967 §3 + WHI1980 §3 (~20 lines of algebra). The conceptual content of Thm 8 is what matters: point estimates depend only on mean specification; SE estimation is what varies. Get the mean right, use the sandwich, and your inference is robust to whatever variance distortion the true DGP imposes.
The four empirical sandwich estimators (HC0, HC1, HC2, HC3) differ only in finite-sample bias adjustments to the -term:
- HC0 (White 1980): where — the raw plug-in estimate.
- HC1 (MacKinnon-White 1985): scale HC0 by to account for finite-sample shrinkage. Standard
vcovHCdefault in R. - HC2: divide by where is the leverage from §22.3’s IRLS hat matrix.
- HC3: divide by — the Davidson–MacKinnon “preferred” finite-sample estimator, recommended by default for . Numerical stability requires capping at 0.9999 to avoid blow-up at high-leverage observations (the
regression.tsimplementation enforces this with a console warning when triggered).
In production, HC1 and HC3 are the most-used variants; HC0 is the asymptotic-limit version used in derivations.
Generate overdispersed-Poisson responses with true and dispersion ratio (so the Poisson model is correct on the mean but wrong on the variance by a factor of ). Repeat times, refit each replication with glmFit(X, y, FAMILIES.poisson), and compute both the naive Wald CI and the HC3 sandwich CI for at the level:
| CI | Empirical coverage | Target | Gap |
|---|---|---|---|
| Naive Wald-z | pp | ||
| HC3 sandwich | pp |
Naive Wald undercovers severely — instead of — because its SE is too small by roughly . HC3 recovers nearly nominal coverage with no change to the point estimates. The §22.9 figure renders this directly: each simulation contributes two horizontal CI bars (one naive, one HC3), color-coded green if it covers or red if it misses. The naive bars miss roughly one in five times; the HC3 bars miss roughly one in twenty.
For a Gamma GLM, the variance function is and dispersion is estimated. Suppose the true DGP has (so the dispersion is twice the value the model is implicitly assuming via ). Naive Wald CIs again undercover; HC3 sandwich CIs recover. The pattern is the same as Poisson: point estimates ( from IRLS) are unaffected by the dispersion misspecification because the score equation does not involve ; only the SE estimation needs adjustment. The SandwichCoverageSimulator exposes this with a “misspecifiedGamma” preset.
A useful diagnostic: plot Pearson and deviance residuals on the same axes for a fitted GLM. If both look unstructured, the model is well-specified. If Pearson residuals fan out but deviance residuals are clean, you have variance misspecification with correct mean — exactly the scenario the sandwich estimator handles. If deviance residuals show systematic structure (a U-shape, monotone trend, or fan), you have mean misspecification (wrong link, missing predictor, wrong functional form) and the sandwich does NOT save you — you need to fix the model. Figure 8 above shows the variance-misspecification flavor; the structural-misspecification version is what RegressionDiagnosticsExplorer on Topic 21 illustrates for the OLS case (and the same pattern carries over to GLMs).
Sandwich SEs handle variance misspecification but do nothing about influential outliers — single observations whose leverage and residual combine to dominate . The Huber-ified Poisson and bounded-influence logistic estimators (Cantoni & Ronchetti 2001) downweight high-Pearson-residual observations using M-estimator machinery from Topic 15 §15.10–§15.11. Topic 23’s penalized GLMs (ridge, lasso) provide an alternative outlier-control mechanism via shrinkage of the coefficients themselves. Both are deferred from this topic; Thm 8’s sandwich framework is the variance-misspecification fix, while these are the influence-control fixes.
The sandwich approach is the “model-agnostic” fix for Poisson overdispersion: keep the Poisson model, fix the SEs. The “model-correct” alternative is Negative Binomial regression — replace Poisson with where is a separate dispersion parameter. The full distribution allows likelihood-based inference (deviance, profile-LRT CIs, etc.) on top of the corrected variance. Standard tooling: MASS::glm.nb in R, statsmodels.discrete.NegativeBinomial in Python. formalML: Generalized linear models covers NegBin in the count-regression chapter.
The quasi-likelihood framework is older than the sandwich SE: Wedderburn (1974) introduced the idea that “specifying the mean–variance relationship is enough for consistent estimation” two years after Nelder–Wedderburn (1972) introduced GLMs. Huber (1967) had derived the misspecified-MLE asymptotic-variance form independently in the M-estimation context, and White (1980) made the empirical sandwich estimator standard in econometrics. Three threads, three authors, one consensus by the late 1980s: under correct mean specification, the QMLE / IRLS estimate is consistent, and the sandwich variance fixes the SE. McCullagh–Nelder (1989) Ch. 9 is the canonical book-length development; modern statsmodels and R-sandwich implementations follow that template directly.
The SandwichCoverageSimulator lets you watch the coverage gap close in real time. Pick a family (overdispersed Poisson / misspecified Gamma / clustered Bernoulli), set the misspecification magnitude with the slider, and click “Run 500” — each simulation contributes a CI bar to the top panel and updates the running coverage in the bottom panel. The naive bars consistently undercover; the HC3 bars consistently approach the nominal line.
Latest 0 simulations · 95% CIs · green = covers β_true, red = misses · upper bar naive, lower bar HC3
Running coverage (after 0 sims) · target = 0.95
Naive Wald undercovers under variance misspecification; HC3 sandwich recovers nominal coverage (Topic 22 §22.9 Rem 19).
22.10 Forward Map: Where GLMs Lead
Topic 22 establishes the core GLM framework: link function, exponential-family response, IRLS fitting, deviance inference, sandwich-SE robustness. The framework opens onto five major extensions, each of which is a separate topic or even a separate track. §22.10 is the catalog — one paragraph per extension.
The next Track 6 topic — Topic 23: Regularization & Penalized Estimation — generalizes IRLS to penalized log-likelihoods: where is a penalty (squared- for ridge, for lasso, convex combination for elastic net). The L² penalty stabilizes IRLS near separation (Bernoulli) or in high-collinearity designs; the L¹ penalty induces sparsity in and is the foundation of variable-selection in modern high-dimensional GLM applications. The reference implementation is glmnet (Friedman, Hastie & Tibshirani 2010) — a coordinate-descent algorithm that scales to in the millions and is the production workhorse for penalized GLMs in R, Python (sklearn.linear_model.LogisticRegression(penalty=...)), and Julia.
Adding a prior on the coefficients and computing the posterior converts the GLM into a Bayesian GLM. MAP estimation reduces to penalized MLE (Gaussian prior = ridge, Laplace prior = lasso); full posterior inference uses MCMC (Stan, PyMC, brms; NUTS is the default — Topic 26 §26.5) or variational methods. Generalized linear mixed models (GLMMs) add cluster-level or individual-level random effects to the linear predictor for hierarchical data; the marginal likelihood requires integrating out the random effects (Laplace approximation, AGHQ, MCMC). Track 7 is the Bayesian-foundations track; mixed-effects and hierarchical GLMs live there + at formalML: Mixed-effects models .
Replace the linear predictor with a sum of smooth nonparametric functions — each is fit by penalized basis-function expansion (cubic regression splines, thin-plate splines, P-splines). The result is a generalized additive model (Hastie & Tibshirani 1990): more flexible than a GLM (captures nonlinearities without prespecifying functional form), more interpretable than a black-box neural network (each is a 1D function you can plot). Standard tooling: mgcv::gam in R (Wood 2017), pyGAM in Python. Track 8 will cover the kernel-regression / nonparametric-regression machinery this builds on.
GLMMs are the right tool for clustered data — patients within hospitals, students within classrooms, repeat measurements within subjects. Random effects deliver partial pooling (estimates for sparse clusters borrow strength from other clusters via the random-effects prior), automatic regularization, and individual-level inference. The ML-scale treatment at formalML: Mixed-effects models covers lme4-style frequentist GLMMs, Stan-style Bayesian GLMMs, and the recommendation-system applications where random-effects structure (user / item / context) is the dominant modeling choice.
The IRLS step is the natural gradient in exponential-family information geometry (Amari 1985). For a GLM, the Fisher information matrix is a Riemannian metric on parameter space; the natural gradient is the ordinary gradient transported through the inverse metric. Natural-gradient descent in continuous time is IRLS, is Fisher scoring, is Newton’s method on the canonical-link GLM. The deep-learning community rediscovered natural-gradient methods (K-FAC, Shampoo, neural tangent kernels) as approximations to this geometry on neural-network parameter spaces. formalML: Information geometry develops the framework.
Several practical response types extend the four-family GLM core:
- Multinomial logit — categorical response with levels; one logit per level relative to a reference. Standard for image classification, document classification, recommendation choice models.
- Ordinal regression (proportional odds, cumulative logit) — ordered categorical response (rating scales, survey responses).
- Zero-inflated and hurdle models — count response with excess zeros (insurance no-claim policies, social media accounts with zero posts).
- Tweedie family — continuous-positive response with point mass at zero (insurance pure-premium modeling, rainfall amounts).
The catalog reference is Agresti (2015) Ch. 6–7. formalML: Generalized linear models covers all of these in the practical-modeling chapter.
When , is rank-deficient and ordinary IRLS fails — the GLM analog of OLS failing when . The Track-6 fix is Topic 23’s penalized GLMs (lasso for sparse ); the asymptotic theory (consistency rates, minimax bounds, double descent for classification) lives at formalML: High-dimensional regression . The applied workflow — glmnet with cross-validation for selection — is in Topic 23; the theoretical underpinnings are external.
Topic 22 closes Track 6’s first half. The framework is now in place: linear predictor + link function + exponential-family response + IRLS + deviance inference + sandwich SEs. Linear regression is a GLM; the normal equations are the GLM score; OLS-as-orthogonal-projection is GLM IRLS with . Every claim Topic 21 made about Normal-linear-model estimation is the special case of Topic 22’s framework where the family is Normal, the link is identity, and the dispersion is . Going forward, Topic 23 (penalization), Topic 24 (model selection), Track 7 (Bayesian extensions), and Track 8 (nonparametric generalizations) all build on this foundation. The forward map above lays out the routes.
References
- Nelder, J. A. & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A (General), 135(3), 370–384.
- McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
- Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
- Dobson, A. J. & Barnett, A. G. (2008). An Introduction to Generalized Linear Models (3rd ed.). Chapman & Hall/CRC.
- Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss–Newton method. Biometrika, 61(3), 439–447.
- Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 221–233. University of California Press.
- White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817–838.
- Hosmer, D. W., Lemeshow, S. & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- Fahrmeir, L., Kneib, T., Lang, S. & Marx, B. (2013). Regression: Models, Methods and Applications. Springer.
- Lehmann, E. L. & Romano, J. P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer.
- Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
- Greene, W. H. (2012). Econometric Analysis (7th ed.). Pearson.