intermediate 45 min read · April 14, 2026

Maximum Likelihood Estimation

Likelihood function, score equation, consistency, asymptotic normality, efficiency — the workhorse of parametric inference.

14.1 The Likelihood Function

Topic 13 built the machinery for evaluating estimators — bias, variance, MSE, consistency, asymptotic normality, Fisher information, the Cramér–Rao bound. It did not tell us where estimators come from. For specific families we pulled the sample mean and sample variance out of the air and verified their properties. What we need now is a principle — a recipe that takes a parametric model and spits out an estimator with good properties whenever the regularity conditions hold. The likelihood principle is that recipe, and the maximum-likelihood estimator is the estimator it produces.

The idea is disarmingly simple. Suppose we have a parametric family {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\} and an iid sample X1,,XnX_1, \ldots, X_n. For any candidate value θ\theta, the joint density evaluated at the data — i=1nf(Xi;θ)\prod_{i=1}^n f(X_i; \theta) — measures how “plausible” that θ\theta is as a generator of the data we actually observed. Among all candidates, the maximum-likelihood estimator picks the one that makes the data most probable. It is the peak of a landscape indexed by θ\theta, and everything else in this topic is about that landscape: its shape, its curvature, the rate at which it sharpens as data accumulates, and how to climb it when no closed form is available.

Two panels side by side. Left: the likelihood L(θ) for a Bernoulli(p) sample, a sharply peaked curve on (0, 1) with its maximum at p̂. Right: the log-likelihood ℓ(θ) of the same sample, a concave curve with its maximum at the same p̂ — the logarithm flattens the peak without moving it.
Definition 1 Likelihood Function

Let {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\} be a parametric family of densities (continuous case) or PMFs (discrete case), and let x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n) be an observed sample. The likelihood function is the joint density or PMF evaluated at the observed sample, viewed as a function of the parameter θ\theta with the data held fixed:

L(θ;x)  =  i=1nf(xi;θ).L(\theta; \mathbf{x}) \;=\; \prod_{i=1}^n f(x_i; \theta).

When the data are understood we write L(θ)L(\theta) for short.

The likelihood has the same algebraic form as the joint density, but its role is inverted. The joint density is a function of the data with the parameter fixed — what probability do we assign to different samples under this θ\theta? The likelihood is a function of the parameter with the data fixed — how plausible is each θ\theta in light of the data we observed? The two objects are numerically equal after evaluation, but we read one “across samples” and the other “across parameters.” This inversion is the entire conceptual move of likelihood-based inference.

Two warnings. First, L(θ)L(\theta) is not a probability density over θ\theta. It is not normalized, and the parameter θ\theta is not random in the frequentist setup — it is a fixed (unknown) constant. Calling L(θ)L(\theta) a “probability of θ\theta” is a category error that will mislead you in every Bayesian calculation that follows later. Second, we can always rescale: L(θ)L(\theta) and cL(θ)c \cdot L(\theta) for any c>0c > 0 identify the same maximizer. What matters about the likelihood is its shape, not its height.

Definition 2 Log-Likelihood Function

The log-likelihood is

(θ)  =  logL(θ)  =  i=1nlogf(xi;θ).\ell(\theta) \;=\; \log L(\theta) \;=\; \sum_{i=1}^n \log f(x_i; \theta).

It is the sum (not the product) of per-observation log-densities, and it has the same maximizer as L(θ)L(\theta) because log\log is strictly increasing.

We will essentially never work with L(θ)L(\theta) directly. The log-likelihood (θ)\ell(\theta) is better behaved in every way that matters: products become sums, the derivative is the score function, the second derivative is the (negative) observed Fisher information, and for exponential families the whole object is concave in the natural parameter. All of the asymptotic theory — consistency, asymptotic normality, efficiency — lives on the log-likelihood, not on the likelihood.

Remark 1 Why the logarithm: three reasons in one

The switch from LL to \ell is not cosmetic. Three things happen simultaneously, and each is load-bearing.

(1) Products become sums. For iid data L(θ)=if(Xi;θ)L(\theta) = \prod_i f(X_i; \theta) is a product of nn terms; (θ)=ilogf(Xi;θ)\ell(\theta) = \sum_i \log f(X_i; \theta) is a sum. Sums are what every limit theorem in probability speaks to — the law of large numbers for sample averages, the central limit theorem for standardized sums. The transition LL \to \ell is what lets consistency and asymptotic normality inherit directly from the LLN and CLT applied to logf(X;θ)\log f(X; \theta).

(2) Concavity (often). For exponential families in the natural parameter, (θ)\ell(\theta) is strictly concave: it has a unique maximizer, and any local maximum is the global one. The likelihood L(θ)=e(θ)L(\theta) = e^{\ell(\theta)} is not concave, just log-concave — a strictly weaker property. Working in log-space is what makes uniqueness and Newton-Raphson convergence arguments tractable.

(3) Numerical stability. Products of many small probabilities underflow to zero in floating point. A sample of n=1000n = 1000 Normal observations produces an LL that is effectively zero — but \ell is a perfectly ordinary number on the order of n-n. Anyone who has fit a model to more than a few hundred data points has implicitly used \ell for this reason alone.

Example 1 Likelihood and log-likelihood for Normal(μ, σ²) with σ² known

Let X1,,XnX_1, \ldots, X_n be iid N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known. The per-observation density is f(x;μ)=(2πσ2)1/2exp ⁣((xμ)22σ2)f(x; \mu) = (2\pi\sigma^2)^{-1/2} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). Taking the product over ii and then the logarithm:

(μ)  =  n2log(2πσ2)    12σ2i=1n(Xiμ)2.\ell(\mu) \;=\; -\frac{n}{2}\log(2\pi\sigma^2) \;-\; \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i - \mu)^2.

The first term does not depend on μ\mu, so we can drop it when maximizing; what remains is the negative sum of squared residuals. Maximizing (μ)\ell(\mu) is therefore the same as minimizing i(Xiμ)2\sum_i (X_i - \mu)^2. Least squares and maximum likelihood, for Normal-error models, are the same procedure seen from two angles — one statistical, one geometric.

Example 2 Likelihood for Bernoulli(p)

Let X1,,XnX_1, \ldots, X_n be iid Bernoulli(p)\operatorname{Bernoulli}(p). The PMF is f(x;p)=px(1p)1xf(x; p) = p^x (1-p)^{1-x} for x{0,1}x \in \{0, 1\}. Writing k=iXik = \sum_i X_i for the number of successes,

L(p)  =  pk(1p)nk,(p)  =  klogp+(nk)log(1p).L(p) \;=\; p^k (1-p)^{n-k}, \qquad \ell(p) \;=\; k \log p + (n-k) \log(1-p).

The likelihood is a polynomial in pp of degree nn; the log-likelihood is linear in logp\log p and log(1p)\log(1-p). The shape of (p)\ell(p) — a single peak somewhere between 00 and 11, concave — is already visible in the formula without any calculus.

Example 3 Likelihood for Exponential(λ)

Let X1,,XnX_1, \ldots, X_n be iid Exponential(λ)\operatorname{Exponential}(\lambda) with density f(x;λ)=λeλxf(x; \lambda) = \lambda e^{-\lambda x} for x>0x > 0. Then

L(λ)  =  λnexp ⁣(λi=1nXi),(λ)  =  nlogλλnXˉn.L(\lambda) \;=\; \lambda^n \exp\!\left(-\lambda \sum_{i=1}^n X_i\right), \qquad \ell(\lambda) \;=\; n \log \lambda - \lambda n \bar{X}_n.

Again a clean closed-form log-likelihood, linear in λ\lambda apart from the nlogλn \log \lambda growth term. For any sample with Xˉn>0\bar{X}_n > 0, \ell is strictly concave in λ\lambda (its second derivative is n/λ2<0-n/\lambda^2 < 0), so the maximizer is unique.

14.2 The Maximum Likelihood Estimator

With the log-likelihood in hand, the estimator defines itself.

Definition 3 Maximum Likelihood Estimator

The maximum-likelihood estimator of θ\theta is any value θ^nΘ\hat{\theta}_n \in \Theta that maximizes L(θ)L(\theta) — equivalently, that maximizes (θ)\ell(\theta):

θ^n    argmaxθΘ(θ).\hat{\theta}_n \;\in\; \arg\max_{\theta \in \Theta} \ell(\theta).

When the maximizer is unique, we write θ^n=argmaxθ(θ)\hat{\theta}_n = \arg\max_\theta \ell(\theta). When multiple maxima exist — e.g., on the boundary of Θ\Theta, or at non-smooth points — we pick any of them; the finite-sample behavior can differ, but the asymptotic theory is unaffected.

Two things to notice. The MLE is defined as a maximizer, not a root of the score equation: boundary solutions and points where the score is undefined are still MLEs when the global maximum lives there. For the well-behaved cases that dominate practice — compact or open parameter space, smooth \ell, unique interior maximum — the MLE is the unique root of the score equation, which is what Theorem 1 below captures. We will mostly work in this setting and flag boundary complications when they arise.

Two panels side by side. Left: the log-likelihood ℓ(θ) for a Bernoulli sample as a concave curve; a dashed vertical line marks its peak at p̂. Right: the score function S(θ) = ℓ'(θ) crossing zero at the same p̂, with positive values to the left and negative values to the right — the geometric content of 'zero-score = maximum-likelihood'.
Theorem 1 Score equation characterizes the interior MLE

Suppose ΘR\Theta \subseteq \mathbb{R} is an open interval, (θ)\ell(\theta) is differentiable on Θ\Theta, and the global maximizer θ^n\hat{\theta}_n lies in the interior of Θ\Theta. Then θ^n\hat{\theta}_n satisfies the score equation

Sn(θ^n)  =  i=1ns(θ^n;Xi)  =  0,S_n(\hat{\theta}_n) \;=\; \sum_{i=1}^n s(\hat{\theta}_n; X_i) \;=\; 0,

where s(θ;x)=θlogf(x;θ)s(\theta; x) = \partial_\theta \log f(x; \theta) is the per-observation score.

This is calculus: at an interior maximum of a differentiable function, the derivative is zero. Definition 10 in Point Estimation defined ss and SnS_n; the novelty in Topic 14 is that we now solve the score equation for θ\theta instead of studying its expectation at a fixed θ\theta. The reader should treat Theorem 1 as a practical recipe: to compute an MLE, write down (θ)\ell(\theta), differentiate with respect to θ\theta, set the result to zero, and solve.

The next four examples run this recipe for the four simplest families. They are the workhorses of every applied statistics course, and they are worth deriving cleanly because Topic 14 will invoke them repeatedly.

Example 4 MLE of the Normal mean: θ̂ = X̄

From Example 1, (μ)=n2log(2πσ2)12σ2i(Xiμ)2\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i (X_i - \mu)^2. Differentiating,

ddμ  =  1σ2i=1n(Xiμ).\frac{d\ell}{d\mu} \;=\; \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \mu).

Setting the derivative to zero and dividing by nn gives iXi=nμ\sum_i X_i = n\mu, so

μ^MLE  =  Xˉn.\hat{\mu}_{\text{MLE}} \;=\; \bar{X}_n.

The second derivative n/σ2<0-n/\sigma^2 < 0 confirms this is a maximum, not a minimum. The MLE of the Normal mean is the sample mean. Every other Topic 13 property of Xˉn\bar{X}_n — unbiasedness, variance σ2/n\sigma^2/n, asymptotic normality with rate n\sqrt{n}, efficiency — is now a property of the MLE.

Example 5 MLE of the Bernoulli proportion: p̂ = k/n

From Example 2, (p)=klogp+(nk)log(1p)\ell(p) = k \log p + (n-k) \log(1-p) with k=iXik = \sum_i X_i. Differentiating,

ddp  =  kpnk1p.\frac{d\ell}{dp} \;=\; \frac{k}{p} - \frac{n-k}{1-p}.

Setting this to zero: k(1p)=(nk)pk(1-p) = (n-k)p, so kkp=npkpk - kp = np - kp, and therefore k=npk = np, giving

p^MLE  =  kn  =  Xˉn.\hat{p}_{\text{MLE}} \;=\; \frac{k}{n} \;=\; \bar{X}_n.

The sample proportion is the MLE — unsurprising in hindsight, and intuitive. The second derivative is k/p2(nk)/(1p)2<0-k/p^2 - (n-k)/(1-p)^2 < 0 for p(0,1)p \in (0, 1), confirming a maximum. At the boundary k=0k = 0 or k=nk = n the MLE is 00 or 11, at the edge of the parameter space — a clean example of a boundary solution.

Example 6 MLE of the Exponential rate: λ̂ = 1/X̄

From Example 3, (λ)=nlogλλnXˉn\ell(\lambda) = n \log \lambda - \lambda n \bar{X}_n. Differentiating,

ddλ  =  nλnXˉn.\frac{d\ell}{d\lambda} \;=\; \frac{n}{\lambda} - n\bar{X}_n.

Setting this to zero gives λ=1/Xˉn\lambda = 1/\bar{X}_n:

λ^MLE  =  1Xˉn.\hat{\lambda}_{\text{MLE}} \;=\; \frac{1}{\bar{X}_n}.

Note that the MLE is a nonlinear function of Xˉn\bar{X}_n. Because 1/x1/x is convex, Jensen’s inequality implies E[1/Xˉn]>1/E[Xˉn]=λ\mathbb{E}[1/\bar{X}_n] > 1/\mathbb{E}[\bar{X}_n] = \lambda, so the MLE is biased for finite nn. We will recover bias-free behavior asymptotically, but this is a useful reminder that MLEs are not generally unbiased; they are consistent and asymptotically efficient, not unbiased in finite samples.

Example 7 MLE of the Poisson rate: λ̂ = X̄

For iid Poisson(λ)\operatorname{Poisson}(\lambda) data with PMF f(x;λ)=λxeλ/x!f(x; \lambda) = \lambda^x e^{-\lambda}/x!,

(λ)  =  (i=1nXi)logλnλi=1nlog(Xi!).\ell(\lambda) \;=\; \Big(\sum_{i=1}^n X_i\Big) \log \lambda - n\lambda - \sum_{i=1}^n \log(X_i!).

The final log(Xi!)\log(X_i!) term does not depend on λ\lambda. Differentiating the rest,

ddλ  =  iXiλn,\frac{d\ell}{d\lambda} \;=\; \frac{\sum_i X_i}{\lambda} - n,

and setting this to zero gives λ^MLE=Xˉn\hat{\lambda}_{\text{MLE}} = \bar{X}_n. The Poisson MLE is the sample mean — the same estimator as for the Normal mean and the Bernoulli proportion. This is not a coincidence: in each case the natural sufficient statistic is iXi\sum_i X_i, and the MLE is its normalized version. Exponential families make this pattern universal (Exponential Families, §7.8).

Remark 2 When no closed-form MLE exists

The four examples above all give the MLE as a clean function of the sample mean. That is a feature of their exponential-family structure, not a general property of MLE. For most parametric families — Gamma shape, beta shape, Weibull, Cauchy location, logistic regression, hidden Markov models, neural networks — the score equation has no closed-form solution, and the MLE must be computed numerically. Section 14.7 develops Newton-Raphson for exactly this situation: write down the score and the observed Fisher information, pick a starting value, iterate θ(t+1)=θ(t)+S(θ(t))/J(θ(t))\theta^{(t+1)} = \theta^{(t)} + S(\theta^{(t)}) / J(\theta^{(t)}), and stop when the steps become small. Every optimization method in modern machine learning — SGD, Adam, L-BFGS — is a variation on this theme applied to the log-likelihood or a regularized variant.

Before moving on, explore the log-likelihood landscape for these four families. The panel below draws a sample of size nn from the true parameter θ0\theta_0, plots (θ)\ell(\theta), and overlays the quadratic approximation (θ^)12nI(θ^)(θθ^)2\ell(\hat\theta) - \tfrac{1}{2} n I(\hat\theta)(\theta - \hat\theta)^2 together with the 95% Wald confidence interval θ^±1.96/nI(θ^)\hat\theta \pm 1.96/\sqrt{nI(\hat\theta)}. Sweep nn from 55 to 500500 and watch the likelihood sharpen — the curvature at the peak (Fisher information) grows linearly with nn, so the Wald interval shrinks like 1/n1/\sqrt{n}.

Interactive: Likelihood Surface Explorer
Watch the log-likelihood sharpen around the MLE as n grows. The amber dashed curve is the quadratic approximation at θ̂, whose curvature is the observed Fisher information; the green bar on the right is the 95% Wald confidence interval.
θ̂ 4.766
θ₀ 5.000
ℓ(θ̂) -102.44
I(θ̂) 0.2500
SE 0.2828
95% CI [4.212, 5.320]

14.3 Functional Invariance

The next property is the one the reader is most likely to have used without ever proving — or even articulating as a theorem.

Theorem 2 Functional invariance of the MLE

Let θ^n\hat{\theta}_n be the MLE of θ\theta, and let g:ΘGg : \Theta \to \mathcal{G} be any function (not necessarily one-to-one). Then the MLE of ϕ=g(θ)\phi = g(\theta) is

ϕ^n  =  g(θ^n).\hat{\phi}_n \;=\; g(\hat{\theta}_n).

Invariance is what lets you say “the MLE of the standard deviation is the square root of the MLE of the variance” without a separate derivation. It is what lets a neural-network practitioner say “the MLE of the sigmoid-output probability is σ(xβ^)\sigma(x^\top \hat\beta) once we have the MLE of β^\hat\beta.” It is the glue between the MLE in its natural parameterization and the MLE in whatever parameterization the application requires.

The proof for one-to-one gg is a change-of-variables argument. Let ϕ=g(θ)\phi = g(\theta), so θ=g1(ϕ)\theta = g^{-1}(\phi). The likelihood in the new parameterization is L(ϕ)=L(g1(ϕ))L^*(\phi) = L(g^{-1}(\phi)), and its maximizer is ϕ\phi such that g1(ϕ)=θ^ng^{-1}(\phi) = \hat{\theta}_n, i.e., ϕ=g(θ^n)\phi = g(\hat{\theta}_n). For non-one-to-one gg, the standard device is the profile likelihood: for each ϕ\phi, maximize L(θ)L(\theta) over {θ:g(θ)=ϕ}\{\theta : g(\theta) = \phi\}, then maximize the result over ϕ\phi. Under this definition the invariance property extends without change. Casella & Berger (2002) §7.2.2 gives the detailed argument; the one-to-one case is enough for everything in this topic.

Two panels stacked vertically. Top: log-likelihood ℓ(σ²) as a function of the Normal variance σ², peaking at σ̂². Bottom: log-likelihood ℓ*(σ) for the same data after reparameterizing by σ = √(σ²), peaking at σ̂ = √(σ̂²). The two maximizers are connected by a dashed line showing that invariance transports one maximum directly to the other.
Example 8 Invariance: σ² → σ (standard deviation from variance)

If σ^MLE2=1ni(XiXˉn)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i (X_i - \bar{X}_n)^2 is the MLE of the Normal variance σ2\sigma^2 (as derived from the two-parameter Normal log-likelihood by zeroing the score in σ2\sigma^2), then by invariance the MLE of the standard deviation σ=σ2\sigma = \sqrt{\sigma^2} is

σ^MLE  =  σ^MLE2  =  1ni(XiXˉn)2.\hat{\sigma}_{\text{MLE}} \;=\; \sqrt{\hat{\sigma}^2_{\text{MLE}}} \;=\; \sqrt{\tfrac{1}{n}\sum_i (X_i - \bar{X}_n)^2}.

No separate maximization is needed. Notice that σ^MLE\hat{\sigma}_{\text{MLE}} is biased even though σ^MLE2\hat{\sigma}^2_{\text{MLE}} would be unbiased if we used n1n-1 instead of nn; invariance preserves the MLE but does not preserve unbiasedness under nonlinear transformations. The two properties simply track different things.

Example 9 Invariance: p → odds p/(1−p)

In Bernoulli data, the MLE of pp is p^=Xˉn\hat{p} = \bar{X}_n. The odds are ϕ(p)=p/(1p)\phi(p) = p/(1-p), a strictly increasing transformation on (0,1)(0, 1) (so the one-to-one case applies). By invariance, the MLE of the odds is

ϕ^MLE  =  p^1p^  =  Xˉn1Xˉn.\hat{\phi}_{\text{MLE}} \;=\; \frac{\hat{p}}{1 - \hat{p}} \;=\; \frac{\bar{X}_n}{1 - \bar{X}_n}.

The odds are the natural parameter of the Bernoulli family: η=log ⁣p1p\eta = \log\!\frac{p}{1-p} is the logit link that logistic regression uses. Invariance is what makes “logistic regression on η\eta” equivalent to “MLE of pp then take the log-odds” — the two routes agree at the estimate.

The explorer below shows invariance live: choose a transformation gg, and watch the MLE on the base scale (left) transport to the MLE on the transformed scale (right). The two panels are two views of the same likelihood, not two separate maximizations.

Interactive: Invariance Explorer
The same likelihood on two parameter scales. Left panel: ℓ(θ) peaks at the MLE θ̂. Right panel: ℓ*(φ) = ℓ(g⁻¹(φ)) peaks at g(θ̂). The two maximizers are related by the transformation — you do not need a second maximization to transport the estimate.
θ̂ = 3.8646   ·   g(θ̂) = 1.9659   ·   If σ̂² is the MLE of σ², then √(σ̂²) is the MLE of σ — a direct application of invariance.
Remark 3 Invariance is about the MLE, not its distribution

Invariance transports the point estimate under gg, but it does not transport the sampling distribution. If n(θ^nθ0)dN(0,1/I(θ0))\sqrt{n}(\hat{\theta}_n - \theta_0) \overset{d}{\to} \mathcal{N}(0, 1/I(\theta_0)), what happens to n(g(θ^n)g(θ0))\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0))? Theorem 7 in Topic 13 (delta method) gives the answer:

n(g(θ^n)g(θ0))  d  N ⁣(0,[g(θ0)]2I(θ0)).\sqrt{n}\,(g(\hat{\theta}_n) - g(\theta_0)) \;\overset{d}{\to}\; \mathcal{N}\!\left(0, \frac{[g'(\theta_0)]^2}{I(\theta_0)}\right).

So the asymptotic variance scales by [g(θ0)]2[g'(\theta_0)]^2. A consequence: the Fisher information transforms tensorially, and efficient estimators in one parameterization remain efficient under smooth reparameterization. Invariance is about the point; the delta method is about the spread.

14.4 Consistency

The first asymptotic property is the most basic. As the sample size grows, does the MLE settle on the true parameter?

Four overlaid log-likelihood curves ℓ(θ) normalized to their own maxima, for n = 10, 50, 200, 1000 samples from the same model. As n grows, the curve sharpens dramatically around the true parameter θ₀ (marked with a vertical line). The maximizers θ̂ₙ cluster ever more tightly around θ₀ as n increases — consistency visualized.

Before we can even state consistency, we need a condition that rules out a pathology: two different parameter values giving the same distribution. If f(x;θ0)=f(x;θ1)f(x; \theta_0) = f(x; \theta_1) for θ0θ1\theta_0 \neq \theta_1, no amount of data can tell the parameters apart, and no estimator can be consistent.

Definition 4 KL-identifiability

The family {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\} is Kullback–Leibler identifiable at θ0\theta_0 if for every θθ0\theta \neq \theta_0 in Θ\Theta,

DKL(fθ0fθ)  =  Eθ0 ⁣[logf(X;θ0)f(X;θ)]  >  0.D_{\text{KL}}(f_{\theta_0} \,\|\, f_\theta) \;=\; \mathbb{E}_{\theta_0}\!\left[\log \frac{f(X; \theta_0)}{f(X; \theta)}\right] \;>\; 0.

Equivalently: θ=θ0\theta = \theta_0 is the unique maximizer of θEθ0[logf(X;θ)]\theta \mapsto \mathbb{E}_{\theta_0}[\log f(X; \theta)].

The KL divergence is zero only when the two distributions agree almost surely, so KL-identifiability is the minimal non-degeneracy we need. Every standard parametric family — Normal, Bernoulli, Exponential, Poisson, Gamma, Beta, exponential families in general — is KL-identifiable on its natural parameter space. Non-identifiability shows up in mixture models (label-switching), overparameterized neural networks (many weight configurations realize the same function), and non-identified latent-variable models — all genuine problems that the MLE machinery as stated below does not handle.

Theorem 3 Consistency of the MLE

Let X1,X2,X_1, X_2, \ldots be iid from f(x;θ0)f(x; \theta_0) where θ0ΘR\theta_0 \in \Theta \subseteq \mathbb{R}. Assume (i) Θ\Theta is compact, (ii) the family is KL-identifiable at θ0\theta_0, and (iii) θlogf(x;θ)\theta \mapsto \log f(x; \theta) is continuous in θ\theta for every xx and dominated by an integrable function. Then the MLE θ^n\hat{\theta}_n is weakly consistent:

θ^n  P  θ0.\hat{\theta}_n \;\overset{P}{\to}\; \theta_0.
Proof [show]

Define the population log-likelihood

M(θ)  =  Eθ0[logf(X;θ)],M(\theta) \;=\; \mathbb{E}_{\theta_0}[\log f(X; \theta)],

the thing the sample log-likelihood Mn(θ)=1n(θ)=1nilogf(Xi;θ)M_n(\theta) = \frac{1}{n}\ell(\theta) = \frac{1}{n}\sum_i \log f(X_i; \theta) is trying to estimate. By KL-identifiability, θ0\theta_0 is the unique maximizer of MM: for any θθ0\theta \neq \theta_0,

M(θ0)M(θ)  =  Eθ0 ⁣[logf(X;θ0)f(X;θ)]  =  DKL(fθ0fθ)  >  0.M(\theta_0) - M(\theta) \;=\; \mathbb{E}_{\theta_0}\!\left[\log \frac{f(X; \theta_0)}{f(X; \theta)}\right] \;=\; D_{\text{KL}}(f_{\theta_0} \,\|\, f_\theta) \;>\; 0.

This is the Gibbs inequality: the true density maximizes the expected log-likelihood among all densities. By the strong law of large numbers from Topic 10, applied pointwise to the iid sequence {logf(Xi;θ)}i=1\{\log f(X_i; \theta)\}_{i=1}^\infty at every fixed θ\theta,

Mn(θ)  =  1ni=1nlogf(Xi;θ)  a.s.  M(θ).M_n(\theta) \;=\; \frac{1}{n}\sum_{i=1}^n \log f(X_i; \theta) \;\overset{a.s.}{\to}\; M(\theta).

The next step is the argument that pointwise convergence plus compactness plus continuity upgrades to uniform convergence on Θ\Theta:

supθΘMn(θ)M(θ)  P  0.\sup_{\theta \in \Theta} \lvert M_n(\theta) - M(\theta) \rvert \;\overset{P}{\to}\; 0.

This is a standard Glivenko–Cantelli-type argument: partition Θ\Theta into finitely many small neighborhoods, use pointwise LLN on the centers, and bound oscillations within neighborhoods by the dominating function. Van der Vaart (1998) Theorem 5.7 gives the detailed execution; the only ingredients are the three assumptions above.

Now let ε>0\varepsilon > 0 and consider the open neighborhood Uε={θ:θθ0<ε}U_\varepsilon = \{\theta : |\theta - \theta_0| < \varepsilon\}. The complement ΘUε\Theta \setminus U_\varepsilon is compact, so MM attains its maximum there at some θθ0\theta^* \neq \theta_0; by identifiability,

δ  :=  M(θ0)supθΘUεM(θ)  >  0.\delta \;:=\; M(\theta_0) - \sup_{\theta \in \Theta \setminus U_\varepsilon} M(\theta) \;>\; 0.

On the event {supθMnM<δ/2}\{\sup_\theta |M_n - M| < \delta/2\} — which has probability tending to one by uniform convergence — any maximizer θ^n\hat{\theta}_n of MnM_n must lie in UεU_\varepsilon: if it lay outside, then

Mn(θ^n)    M(θ^n)+δ/2    M(θ0)δ/2  <  M(θ0)δ/2+δ/2  =  M(θ0)    Mn(θ0)+δ/2,M_n(\hat{\theta}_n) \;\leq\; M(\hat{\theta}_n) + \delta/2 \;\leq\; M(\theta_0) - \delta/2 \;<\; M(\theta_0) - \delta/2 + \delta/2 \;=\; M(\theta_0) \;\leq\; M_n(\theta_0) + \delta/2,

contradicting the maximality of θ^n\hat{\theta}_n relative to θ0\theta_0 (for large enough nn, Mn(θ^n)Mn(θ0)M_n(\hat\theta_n) \geq M_n(\theta_0)). Therefore P(θ^nUε)1P(\hat\theta_n \in U_\varepsilon) \to 1, i.e., θ^nPθ0\hat\theta_n \overset{P}{\to} \theta_0. \quad\blacksquare

The two ingredients are worth abstracting. The Gibbs inequalityEθ0[logf(X;θ0)]Eθ0[logf(X;θ)]\mathbb{E}_{\theta_0}[\log f(X; \theta_0)] \geq \mathbb{E}_{\theta_0}[\log f(X; \theta)], with equality only at θ=θ0\theta = \theta_0 — says that the true parameter is the unique maximizer of the population log-likelihood. The LLN says the sample log-likelihood converges to the population log-likelihood. Consistency is just “the argmax of a uniform limit is the argmax of the limit.” Every modern MLE consistency proof is this argument, with different amounts of care invested in the uniform-convergence step for complex parameter spaces.

Remark 4 Regularity conditions: what they buy you

The three conditions in Theorem 3 — compact Θ\Theta, KL-identifiability, continuity with domination — are called regularity conditions. They are standard in most parametric problems, and they can be relaxed in a number of directions:

  • Compactness of Θ\Theta: Usually replaced by “the MLE eventually lies in a compact subset of Θ\Theta with probability one,” which holds under consistency of a coarser preliminary estimator. For open Θ\Theta (e.g., Θ=R\Theta = \mathbb{R} for the Normal mean), this is almost always true, but the proof becomes messier.
  • Identifiability: Non-negotiable. Label-switching in mixture models, equivalent parameterizations in neural networks, and unobservable components in latent-variable models all violate identifiability in different ways. The MLE in these settings is consistent only for the identified quotient — e.g., the set of mixture distributions, not the parameter vector of component labels.
  • Continuity and domination: Continuity of logf\log f in θ\theta is almost always immediate. Domination by an integrable function (so expectations are finite and uniform LLN applies) can fail for heavy-tailed families, which is why the MLE story becomes more delicate for Cauchy and similar distributions.

When the conditions hold, they buy you consistency for free — no explicit convergence rate, no finite-sample guarantees, but the estimator does eventually home in on the truth. Rates and finite-sample behavior are the job of asymptotic normality (§14.5) and large-deviations-type bounds (Topic 12).

14.5 Asymptotic Normality

Consistency says θ^n\hat\theta_n converges to θ0\theta_0. Asymptotic normality says how fast and with what shape. Under regularity, the answer is canonical — the sampling distribution, standardized by n\sqrt{n}, is Gaussian with a variance we can write down exactly.

Three-panel figure. Left: histograms of MLE draws at n = 10, 100, 1000 sample sizes — increasingly concentrated around θ₀ and increasingly Gaussian. Middle: the same MLEs standardized by √n, overlaid with a N(0, 1/I(θ₀)) density; the standardized histograms collapse onto the limiting density. Right: a QQ-plot of standardized MLEs against Normal quantiles, essentially linear for n ≥ 100.
Theorem 4 Asymptotic normality of the MLE

Let X1,X2,X_1, X_2, \ldots be iid from f(x;θ0)f(x; \theta_0) on an open interval Θ\Theta, with θ^n\hat\theta_n the MLE. Assume the conditions of Theorem 3 for consistency, plus (iv) θlogf(x;θ)\theta \mapsto \log f(x; \theta) is twice continuously differentiable in a neighborhood of θ0\theta_0 with dominated second derivative, and (v) the Fisher information I(θ0)(0,)I(\theta_0) \in (0, \infty). Then

n(θ^nθ0)  d  N ⁣(0,1I(θ0)).\sqrt{n}\,(\hat\theta_n - \theta_0) \;\overset{d}{\to}\; \mathcal{N}\!\left(0, \frac{1}{I(\theta_0)}\right).
Proof [show]

Write n(θ)\ell_n(\theta) for the sample log-likelihood and Sn(θ)=n(θ)S_n(\theta) = \ell_n'(\theta) for the total score. Because θ^n\hat\theta_n maximizes n\ell_n in the interior of Θ\Theta (which happens with probability tending to one by consistency) and n\ell_n is twice differentiable, Sn(θ^n)=0S_n(\hat\theta_n) = 0. Taylor-expand SnS_n around θ0\theta_0 to second order, for some θ~n\tilde\theta_n between θ^n\hat\theta_n and θ0\theta_0:

0  =  Sn(θ^n)  =  Sn(θ0)+(θ^nθ0)Sn(θ~n).0 \;=\; S_n(\hat\theta_n) \;=\; S_n(\theta_0) + (\hat\theta_n - \theta_0)\, S_n'(\tilde\theta_n).

Solve for θ^nθ0\hat\theta_n - \theta_0:

θ^nθ0  =  Sn(θ0)Sn(θ~n).\hat\theta_n - \theta_0 \;=\; -\, \frac{S_n(\theta_0)}{S_n'(\tilde\theta_n)}.

Multiply both sides by n\sqrt{n} and reorganize:

n(θ^nθ0)  =  (1/n)Sn(θ0)(1/n)Sn(θ~n).\sqrt{n}\,(\hat\theta_n - \theta_0) \;=\; \frac{(1/\sqrt{n})\, S_n(\theta_0)}{-(1/n)\, S_n'(\tilde\theta_n)}.

The numerator is a standardized sum of iid scores. By Theorem 8 in Point Estimation, the score has mean zero and variance I(θ0)I(\theta_0) under θ0\theta_0, so the central limit theorem gives

1nSn(θ0)  =  1ni=1ns(θ0;Xi)  d  N ⁣(0,I(θ0)).\frac{1}{\sqrt{n}} S_n(\theta_0) \;=\; \frac{1}{\sqrt{n}} \sum_{i=1}^n s(\theta_0; X_i) \;\overset{d}{\to}\; \mathcal{N}\!\big(0, I(\theta_0)\big).

The denominator is the negative of the sample average of second derivatives of the log-density. By consistency θ~nPθ0\tilde\theta_n \overset{P}{\to} \theta_0; by the LLN applied to the iid sequence {1(θ0;Xi)}\{\ell_1''(\theta_0; X_i)\} and the dominated continuity of the second derivative,

1nSn(θ~n)  =  1ni=1nθ2logf(Xi;θ~n)  P  Eθ0 ⁣[θ2logf(X;θ0)]  =  I(θ0).-\frac{1}{n}\, S_n'(\tilde\theta_n) \;=\; -\frac{1}{n}\sum_{i=1}^n \partial_\theta^2 \log f(X_i; \tilde\theta_n) \;\overset{P}{\to}\; -\,\mathbb{E}_{\theta_0}\!\big[\partial_\theta^2 \log f(X; \theta_0)\big] \;=\; I(\theta_0).

The last equality is the “information equals negative expected Hessian” identity (Theorem 8 of Topic 13), which in turn relies on the regularity conditions permitting differentiation under the integral sign.

Put the two pieces together with Slutsky’s theorem: a sequence converging in distribution divided by a sequence converging in probability to a positive constant converges in distribution to the ratio of the limits. Thus

n(θ^nθ0)  d  N(0,I(θ0))I(θ0)  =  N ⁣(0,I(θ0)I(θ0)2)  =  N ⁣(0,1I(θ0)).\sqrt{n}\,(\hat\theta_n - \theta_0) \;\overset{d}{\to}\; \frac{\mathcal{N}(0, I(\theta_0))}{I(\theta_0)} \;=\; \mathcal{N}\!\left(0, \frac{I(\theta_0)}{I(\theta_0)^2}\right) \;=\; \mathcal{N}\!\left(0, \frac{1}{I(\theta_0)}\right). \quad\blacksquare

The proof shows the MLE as a ratio of two averages — the score and the observed Hessian — each of which has a limit theorem. The CLT handles the numerator; the LLN handles the denominator; Slutsky combines them. This is the template for every M-estimator asymptotic-normality proof: identify the estimating equation, Taylor-expand, apply CLT to the leading term, LLN to the curvature term, and Slutsky to combine. The MLE is the estimating equation Sn(θ)=0S_n(\theta) = 0, but the same scaffolding works for robust M-estimators, GMM, quasi-likelihood, and many estimators in modern econometrics.

Example 10 Asymptotic distribution of the Normal MLE

For Normal(μ,σ2)(\mu, \sigma^2) with σ2\sigma^2 known and μ^n=Xˉn\hat\mu_n = \bar{X}_n, I(μ)=1/σ2I(\mu) = 1/\sigma^2 (Example 10 of Topic 13). Theorem 4 gives

n(Xˉnμ)  d  N(0,σ2).\sqrt{n}\,(\bar{X}_n - \mu) \;\overset{d}{\to}\; \mathcal{N}(0, \sigma^2).

This is the classical CLT for the sample mean, now recovered as a special case of the general MLE asymptotic-normality theorem. For Normal data, the “asymptotic” normality is actually exact at every finite nn: XˉnN(μ,σ2/n)\bar{X}_n \sim \mathcal{N}(\mu, \sigma^2/n) for any n1n \geq 1, not just in the limit. The MLE framework gives one unified result; for this particular family the result happens to hold non-asymptotically.

Example 11 Asymptotic distribution of the Bernoulli MLE

For Bernoulli(p)(p) with p^n=Xˉn\hat p_n = \bar{X}_n, the Fisher information is I(p)=1/(p(1p))I(p) = 1/(p(1-p)), so 1/I(p)=p(1p)1/I(p) = p(1-p). Theorem 4 gives

n(Xˉnp)  d  N ⁣(0,p(1p)).\sqrt{n}\,(\bar{X}_n - p) \;\overset{d}{\to}\; \mathcal{N}\!\big(0, p(1-p)\big).

This is the two-sided de Moivre–Laplace theorem in disguise. For a 95% Wald confidence interval we replace pp by p^\hat p in the asymptotic variance:

p^n±1.96p^n(1p^n)n.\hat p_n \,\pm\, 1.96\,\sqrt{\frac{\hat p_n (1 - \hat p_n)}{n}}.

This is the formula every introductory statistics class derives ad hoc — it is the MLE Wald interval applied to the Bernoulli family. The interval has known coverage issues near p=0p = 0 or p=1p = 1; the Wilson score interval and the Agresti–Coull interval are alternatives. What the MLE machinery gives us is the default construction; better intervals require more careful analysis of finite-sample behavior.

The Monte-Carlo explorer below makes Theorem 4 concrete. For each MC draw, we sample nn observations from the true parameter, compute the MLE, and add it to a running histogram. Toggle standardized to watch n(θ^θ0)\sqrt{n}(\hat\theta - \theta_0) collapse onto N(0,1/I(θ0))\mathcal{N}(0, 1/I(\theta_0)) — the asymptotic statement in pixels. The readout reports the MC efficiency CRLB/Var^\mathrm{CRLB}/\widehat{\mathrm{Var}}, which should approach 11 as nn grows.

Interactive: MLE Sampling-Distribution Explorer
Draw Monte-Carlo samples, compute the MLE on each, and watch the sampling distribution converge to the asymptotic Normal of Theorem 4. Toggle the standardized view to see √n(θ̂ − θ₀) collapse onto N(0, 1/I(θ₀)).
MC replicates 0
E[θ̂]
Bias
Var(θ̂)
CRLB = 1/(nI(θ₀)) 0.08000
Efficiency
Remark 5 Finite-sample caution: asymptotics can mislead

The arrow d\overset{d}{\to} is deceptively clean. In finite samples, three things can go wrong:

(1) Slow convergence near boundaries. For Bernoulli with true pp near 00 or 11, the sampling distribution of p^\hat p is very skewed and discrete; the Wald interval undercovers dramatically. The convergence rate is still n\sqrt n, but the hidden constants are huge near boundary points.

(2) Bias at second order. The MLE is consistent but generally not unbiased; the bias shrinks at rate 1/n1/n, so it is negligible asymptotically but can matter at moderate sample sizes. The classical “Bartlett correction” adjusts likelihood-ratio tests for this second-order bias.

(3) Asymptotic efficiency ≠ finite-sample optimality. At fixed nn, a biased estimator with smaller variance (e.g., James–Stein in d3d \geq 3, or a shrinkage estimator) can dominate the MLE in MSE. Asymptotic efficiency says the MLE is optimal as nn \to \infty; for small nn you may well want to look elsewhere. Topic 13 §13.8 developed this point carefully.

A healthy reading of Theorem 4 is: “for large enough nn, with well-behaved parameters, you will not be led astray by the Gaussian approximation.” The phrases “large enough” and “well-behaved” are load-bearing and should be checked by simulation whenever nn is less than a few hundred or the parameter is near a boundary.

Multivariate extension (sidebar)

When θRd\theta \in \mathbb{R}^d is a parameter vector, the same argument runs with the score becoming a gradient n(θ)Rd\nabla \ell_n(\theta) \in \mathbb{R}^d and the Hessian becoming a d×dd \times d matrix. The Fisher information is the d×dd \times d positive semi-definite matrix

I(θ)  =  Eθ ⁣[logf(X;θ)logf(X;θ)]  =  Eθ ⁣[2logf(X;θ)],\mathbf{I}(\theta) \;=\; \mathbb{E}_\theta\!\left[\nabla \log f(X; \theta)\, \nabla \log f(X; \theta)^\top\right] \;=\; -\mathbb{E}_\theta\!\left[\nabla^2 \log f(X; \theta)\right],

and the asymptotic-normality statement becomes

n(θ^nθ0)  d  Nd ⁣(0,I(θ0)1).\sqrt{n}\,(\hat\theta_n - \theta_0) \;\overset{d}{\to}\; \mathcal{N}_d\!\left(\mathbf{0}, \mathbf{I}(\theta_0)^{-1}\right).

The component-wise Wald CI for the jj-th parameter is θ^n,j±zα/2[I(θ^n)1/n]jj\hat\theta_{n,j} \pm z_{\alpha/2} \sqrt{[\mathbf{I}(\hat\theta_n)^{-1}/n]_{jj}}, using the jj-th diagonal of the inverse Fisher information matrix evaluated at the MLE. The proof is notationally heavier — vector Taylor expansion, matrix Slutsky, continuous-mapping of the matrix inverse — but conceptually identical to the scalar case.

The full multivariate machinery, including the role of the Cramér–Rao matrix inequality and the block structure of information matrices for multi-parameter GLMs, is developed in Topic 21 (Linear Regression) and Topic 22 (Generalized Linear Models).

14.6 Asymptotic Efficiency

Consistency and asymptotic normality locate the MLE within the space of well-behaved estimators. Efficiency ranks it.

Two superimposed densities on the same axis — a narrow Gaussian centered at θ₀ (the MLE's asymptotic distribution) and a wider Gaussian also centered at θ₀ (a less efficient competitor, e.g., the sample median on Normal data). A vertical dotted line at θ₀; an annotation shows the ratio of asymptotic variances equal to the inverse Fisher information divided by the competitor's asymptotic variance, here about 0.64 — the sample median extracts ~64% of the information the mean extracts.
Theorem 5 Asymptotic efficiency of the MLE

Under the conditions of Theorem 4, the MLE achieves the Cramér–Rao asymptotic variance: its asymptotic variance 1/I(θ0)1/I(\theta_0) equals the Cramér–Rao lower bound for the class of regular estimators. In particular, for any other regular asymptotically-normal estimator θ~n\tilde\theta_n with n(θ~nθ0)dN(0,v(θ0))\sqrt{n}(\tilde\theta_n - \theta_0) \overset{d}{\to} \mathcal{N}(0, v(\theta_0)),

v(θ0)    1I(θ0).v(\theta_0) \;\geq\; \frac{1}{I(\theta_0)}.
Proof [show]

Theorem 9 in Point Estimation — the Cramér–Rao lower bound — proved that any unbiased estimator θ~n\tilde\theta_n of θ0\theta_0 satisfies Var(θ~n)1/(nI(θ0))\operatorname{Var}(\tilde\theta_n) \geq 1/(n I(\theta_0)), which in asymptotic-variance language means v(θ0)1/I(θ0)v(\theta_0) \geq 1/I(\theta_0) for its limiting Gaussian spread. The MLE, by Theorem 4 above, has asymptotic variance exactly 1/I(θ0)1/I(\theta_0), which is the CRLB. No regular estimator can do strictly better than the MLE asymptotically. \quad\blacksquare

Efficiency is the statement that no one can beat the MLE asymptotically. In the ML literature, this result is sometimes called the “Cramér–Rao bound saturates at the MLE”: the information inequality turns into an equality for the maximum-likelihood estimator. Every other regular estimator wastes information — extracts less precision from the data than the data actually contains — and the MLE is the unique one (within the regular class, at the first asymptotic order) that does not.

Example 12 Efficiency of Normal, Bernoulli, Exponential MLEs — all achieve the CRLB

Line up the three workhorse examples:

FamilyMLE θ^\hat\thetaI(θ)I(\theta)CRLB =1/(nI(θ))= 1/(nI(\theta))AsymVar(θ^)=1/(nI(θ))\operatorname{AsymVar}(\hat\theta) = 1/(nI(\theta))Efficient?
Normal(μ,σ2)(\mu, \sigma^2), σ2\sigma^2 knownXˉn\bar{X}_n1/σ21/\sigma^2σ2/n\sigma^2/nσ2/n\sigma^2/nYes (exact at every nn)
Bernoulli(p)(p)Xˉn\bar{X}_n1/(p(1p))1/(p(1-p))p(1p)/np(1-p)/np(1p)/np(1-p)/nYes (asymptotically)
Exponential(λ)(\lambda)1/Xˉn1/\bar{X}_n1/λ21/\lambda^2λ2/n\lambda^2/nλ2/n\lambda^2/nYes (asymptotically)

All three MLEs achieve the CRLB in the limit; the Normal mean achieves it exactly at every finite nn. Efficiency is not specific to exponential families — it is specific to the MLE — but exponential families make the arithmetic particularly clean because the score is linear in the sufficient statistic.

Remark 6 Hodges' superefficiency: an asymptotic pathology

Theorem 5 says no regular estimator can beat the MLE asymptotically. The catch is the word regular. Joseph Hodges in 1951 constructed the following superefficient estimator: shrink Xˉn\bar{X}_n all the way to 00 whenever Xˉn<n1/4|\bar{X}_n| < n^{-1/4}, and use Xˉn\bar{X}_n otherwise. At θ0=0\theta_0 = 0 the estimator has zero asymptotic variance; at every other θ0\theta_0 it behaves like Xˉn\bar{X}_n.

This does not violate Theorem 5 because the Hodges estimator is irregular at θ0=0\theta_0 = 0 — the sampling distribution does not converge uniformly in θ0\theta_0 near zero. More fundamentally, the maximum risk over a shrinking neighborhood of θ0=0\theta_0 = 0 diverges: the estimator pays for its superefficiency at one point with enormous variance at nearby points. Le Cam’s convolution theorem makes this precise: any estimator that is asymptotically better than the MLE at one point is asymptotically worse at nearby points, in a specific technical sense.

For practical purposes, the lesson is that superefficiency is a local, fragile phenomenon. The MLE is the default for good reason, and the escape routes — shrinkage, empirical Bayes, Stein-type estimators — deliver their gains through bias, regularization, or prior information, not through asymptotic efficiency.

14.7 Computing the MLE: Newton-Raphson

For the Normal, Bernoulli, Exponential, and Poisson families of §14.2, the score equation has a closed-form solution. For most other families — Gamma shape, logistic regression, mixture models, neural networks — it does not, and the MLE must be computed by a root-finder applied to Sn(θ)=0S_n(\theta) = 0. Newton-Raphson is the canonical choice. It is the default in glm() in R, statsmodels in Python, and essentially every optimization engine under the hood of modern ML toolkits; understanding its rate of convergence is understanding why maximum-likelihood fits run as fast as they do.

Two panels side by side. Left: the log-likelihood ℓ(α) for the Gamma(3, 2) shape parameter with n = 50 data points. Four tangent-line steps are shown originating from α⁽⁰⁾ = 1, then α⁽¹⁾, α⁽²⁾, α⁽³⁾; each tangent crosses the α-axis at the next iterate; the sequence converges visibly to the MLE near α = 3. Right: a log-scale convergence plot of |α⁽ᵗ⁾ − α̂| against iteration t — nearly straight with slope approximately 2 on the log-log axes, illustrating quadratic convergence.
Definition 5 Newton-Raphson update rule

Given an iterate θ(t)\theta^{(t)}, the Newton-Raphson update linearizes the score equation around θ(t)\theta^{(t)} and solves the linear equation:

θ(t+1)  =  θ(t)    Sn(θ(t))Sn(θ(t))  =  θ(t)  +  Sn(θ(t))J(θ(t)),\theta^{(t+1)} \;=\; \theta^{(t)} \;-\; \frac{S_n(\theta^{(t)})}{S_n'(\theta^{(t)})} \;=\; \theta^{(t)} \;+\; \frac{S_n(\theta^{(t)})}{J(\theta^{(t)})},

where J(θ)=n(θ)=Sn(θ)J(\theta) = -\ell_n''(\theta) = -S_n'(\theta) is the observed Fisher information. The update is applied until θ(t+1)θ(t)|\theta^{(t+1)} - \theta^{(t)}| falls below a tolerance.

Geometrically, at each step we replace n\ell_n with its local quadratic approximation and jump to the maximum of that parabola. Where the log-likelihood is close to quadratic — and near the MLE it always is, by Taylor — the jump lands close to the true maximum and the next iterate does even better. This is the source of the algorithm’s speed: each iteration roughly squares the error, so 1010-digit precision is reached in a handful of steps.

Fisher scoring is a popular variant: replace the random J(θ(t))J(\theta^{(t)}) with its expectation nI(θ(t))nI(\theta^{(t)}), so the update becomes

θ(t+1)  =  θ(t)  +  Sn(θ(t))nI(θ(t)).\theta^{(t+1)} \;=\; \theta^{(t)} \;+\; \frac{S_n(\theta^{(t)})}{nI(\theta^{(t)})}.

For exponential families in the natural parameter, JJ and nInI coincide and Newton and Fisher scoring are identical. For other families, Fisher scoring often has better stability — the expected information is usually positive everywhere, while the observed information can become non-positive away from the MLE, causing Newton to take a step away from the solution.

Theorem 6 Quadratic convergence of Newton-Raphson

Suppose n\ell_n is three times continuously differentiable in a neighborhood of the MLE θ^\hat\theta, and J(θ^)>0J(\hat\theta) > 0. Then for any starting value θ(0)\theta^{(0)} sufficiently close to θ^\hat\theta, the Newton-Raphson iterates satisfy

θ(t+1)θ^    Cθ(t)θ^2|\theta^{(t+1)} - \hat\theta| \;\leq\; C \cdot |\theta^{(t)} - \hat\theta|^2

for some constant C>0C > 0 depending on θ^\hat\theta and the third derivative of n\ell_n. Equivalently, the number of correct digits roughly doubles per iteration.

Proof [show]

Let et=θ(t)θ^e_t = \theta^{(t)} - \hat\theta denote the error at step tt. Taylor-expand SnS_n around θ^\hat\theta:

Sn(θ(t))  =  Sn(θ^)  +  etSn(θ^)  +  et22Sn(ξt)S_n(\theta^{(t)}) \;=\; S_n(\hat\theta) \;+\; e_t\, S_n'(\hat\theta) \;+\; \frac{e_t^2}{2}\, S_n''(\xi_t)

for some ξt\xi_t between θ(t)\theta^{(t)} and θ^\hat\theta. Since θ^\hat\theta is the MLE, Sn(θ^)=0S_n(\hat\theta) = 0, so

Sn(θ(t))  =  etSn(θ^)  +  et22Sn(ξt).S_n(\theta^{(t)}) \;=\; e_t\, S_n'(\hat\theta) \;+\; \frac{e_t^2}{2}\, S_n''(\xi_t).

Also Taylor-expand the denominator Sn(θ(t))=Sn(θ^)+etSn(ηt)S_n'(\theta^{(t)}) = S_n'(\hat\theta) + e_t\, S_n''(\eta_t) for some ηt\eta_t between the two points. Substitute both expansions into the Newton update θ(t+1)=θ(t)Sn(θ(t))/Sn(θ(t))\theta^{(t+1)} = \theta^{(t)} - S_n(\theta^{(t)})/S_n'(\theta^{(t)}) and subtract θ^\hat\theta:

et+1  =  et    etSn(θ^)+12et2Sn(ξt)Sn(θ^)+etSn(ηt).e_{t+1} \;=\; e_t \;-\; \frac{e_t\, S_n'(\hat\theta) + \tfrac{1}{2} e_t^2\, S_n''(\xi_t)}{S_n'(\hat\theta) + e_t\, S_n''(\eta_t)}.

Combining the two numerator terms over the common denominator and expanding,

et+1  =  et[Sn(θ^)+etSn(ηt)][etSn(θ^)+12et2Sn(ξt)]Sn(θ^)+etSn(ηt)  =  et2[Sn(ηt)12Sn(ξt)]Sn(θ^)+etSn(ηt).e_{t+1} \;=\; \frac{e_t\, [S_n'(\hat\theta) + e_t\, S_n''(\eta_t)] - [e_t\, S_n'(\hat\theta) + \tfrac{1}{2} e_t^2\, S_n''(\xi_t)]}{S_n'(\hat\theta) + e_t\, S_n''(\eta_t)} \;=\; \frac{e_t^2\, [S_n''(\eta_t) - \tfrac{1}{2} S_n''(\xi_t)]}{S_n'(\hat\theta) + e_t\, S_n''(\eta_t)}.

As θ(t)θ^\theta^{(t)} \to \hat\theta, both ξt\xi_t and ηt\eta_t converge to θ^\hat\theta, so the bracketed expression in the numerator approaches 12Sn(θ^)\tfrac{1}{2} S_n''(\hat\theta), and the denominator approaches Sn(θ^)=J(θ^)<0S_n'(\hat\theta) = -J(\hat\theta) < 0. Thus, for et|e_t| small enough,

et+1    Sn(θ^)2Sn(θ^)et2(1+o(1))    Cet2,|e_{t+1}| \;\leq\; \left|\frac{S_n''(\hat\theta)}{2\, S_n'(\hat\theta)}\right| \cdot |e_t|^2 \cdot (1 + o(1)) \;\leq\; C \cdot |e_t|^2,

which is the quadratic-convergence claim. \quad\blacksquare

Quadratic convergence is a very strong statement. Starting from an error of 10210^{-2}, one Newton step drops it to roughly C104C \cdot 10^{-4}, the next to C3108C^3 \cdot 10^{-8}, the next to below machine precision. In practice 551010 iterations are enough for any likelihood that looks remotely parabolic near its maximum. This is why the “default solver” for smooth low-dimensional MLE problems is always some flavor of Newton.

Example 13 Newton-Raphson for the Gamma shape parameter

For iid Gamma(α,β)(\alpha, \beta) data with β\beta known, the log-likelihood is

(α)  =  nαlogβ    nlogΓ(α)  +  (α1)i=1nlogXi    βi=1nXi.\ell(\alpha) \;=\; n\alpha \log \beta \;-\; n \log \Gamma(\alpha) \;+\; (\alpha - 1)\sum_{i=1}^n \log X_i \;-\; \beta \sum_{i=1}^n X_i.

Differentiate once to get the score and once more to get the Hessian:

Sn(α)  =  nlogβnψ(α)+i=1nlogXi,Sn(α)  =  nψ(α),S_n(\alpha) \;=\; n \log \beta - n\psi(\alpha) + \sum_{i=1}^n \log X_i, \qquad S_n'(\alpha) \;=\; -n\psi'(\alpha),

where ψ(α)=Γ(α)/Γ(α)\psi(\alpha) = \Gamma'(\alpha)/\Gamma(\alpha) is the digamma function and ψ(α)\psi'(\alpha) is the trigamma function. The observed information is J(α)=nψ(α)>0J(\alpha) = n\psi'(\alpha) > 0, so Newton-Raphson is

α(t+1)  =  α(t)  +  nlogβnψ(α(t))+ilogXinψ(α(t)).\alpha^{(t+1)} \;=\; \alpha^{(t)} \;+\; \frac{n \log \beta - n\psi(\alpha^{(t)}) + \sum_i \log X_i}{n\psi'(\alpha^{(t)})}.

A standard initial guess is the method-of-moments estimate α(0)=βXˉn\alpha^{(0)} = \beta \bar{X}_n, since E[X]=α/β\mathbb{E}[X] = \alpha/\beta. From this start, convergence typically takes 4466 iterations to machine precision. No closed-form solution exists because ψ1\psi^{-1} has no closed form — but we do not need one; Newton-Raphson handles it numerically.

Example 14 Logistic regression log-likelihood (preview)

Logistic regression is a binary classification model parameterized by regression coefficients β=(β0,β1)R2\beta = (\beta_0, \beta_1) \in \mathbb{R}^2 on a single predictor xx:

pi  =  σ(β0+β1xi),σ(η)  =  11+eη,Yi    Bernoulli(pi).p_i \;=\; \sigma(\beta_0 + \beta_1 x_i), \qquad \sigma(\eta) \;=\; \frac{1}{1 + e^{-\eta}}, \qquad Y_i \;\sim\; \operatorname{Bernoulli}(p_i).

Write ηi=β0+β1xi\eta_i = \beta_0 + \beta_1 x_i. The Bernoulli log-likelihood, summed over nn iid observations,

(β)  =  i=1n[yiηilog(1+eηi)],\ell(\beta) \;=\; \sum_{i=1}^n \big[y_i \eta_i - \log(1 + e^{\eta_i})\big],

has no closed-form maximizer — the score equations

β0  =  i=1n(yipi),β1  =  i=1nxi(yipi)\frac{\partial \ell}{\partial \beta_0} \;=\; \sum_{i=1}^n (y_i - p_i), \qquad \frac{\partial \ell}{\partial \beta_1} \;=\; \sum_{i=1}^n x_i (y_i - p_i)

are not linear in β\beta, since pi=σ(ηi)p_i = \sigma(\eta_i) involves the sigmoid. Newton-Raphson here is the iteratively reweighted least squares (IRLS) algorithm: at each step, linearize around the current β(t)\beta^{(t)} and solve a weighted linear regression with weights pi(1pi)p_i(1 - p_i). In exponential-family language, IRLS is Fisher scoring; in optimization language, it is Newton-Raphson with the observed Hessian approximated by its expectation. The full GLM theory — link functions, canonical parameterizations, deviance — is developed in Topic 22 (Generalized Linear Models).

Two practical warnings: if the two classes are perfectly separable by a hyperplane (a condition called quasi-complete separation), the log-likelihood is unbounded and Newton-Raphson iterates escape to infinity — the MLE does not exist. Regularization (a ridge penalty, or a prior on β\beta) restores finiteness — see Topic 23 §23.6 Ex 8 for the worked example on the EXAMPLE_10_GLM_DATA near-separation preset. And for neural-network-scale problems (dd in the thousands to millions), computing and inverting the Hessian is infeasible; stochastic-gradient methods replace Newton-Raphson but retain its quadratic-approximation spirit when they use second-order information.

The explorer below steps through Newton-Raphson iterates on the log-likelihood. Pick a starting value with the slider, then click Step (one iteration) or Run (until convergence); the green dashed tangent at each iterate shows the slope of the score, and the next iterate is where that tangent meets zero-slope. Toggle Fisher scoring to replace J(θ(t))J(\theta^{(t)}) with nI(θ(t))nI(\theta^{(t)}) — for the Gamma-shape preset, both are nψ(α(t))n\psi'(\alpha^{(t)}) and the paths coincide; for families with non-deterministic observed information they diverge.

Interactive: Newton-Raphson Explorer
Step through Newton-Raphson iterates on the log-likelihood. The green dashed line is the tangent at θ⁽ᵗ⁾; its slope is the score S(θ⁽ᵗ⁾), and the next iterate θ⁽ᵗ⁺¹⁾ is the point where this tangent crosses zero slope (for a concave log-likelihood). The right panel plots |θ⁽ᵗ⁾ − θ̂| on a log scale — quadratic convergence shows up as a roughly straight line of steep negative slope.
iteration t 0
θ⁽ᵗ⁾ 1.0000
θ̂ 3.0759
S(θ⁽ᵗ⁾) 76.476
J(θ⁽ᵗ⁾) 82.247
|θ⁽ᵗ⁾ − θ̂| 2.08e+0
Remark 7 Initialization matters: multiple modes and saddle points

Theorem 6 guarantees quadratic convergence once you are close enough to the MLE. Far from the MLE, three things can go wrong:

(1) Multiple local maxima. Mixture-model likelihoods have label-switching symmetries and often have multiple local maxima. Newton converges to whichever one is closest; standard practice is to run EM or Newton from many random starting points and keep the best.

(2) Saddle points. In higher dimensions, the Hessian can have mixed signs away from the MLE. Newton-Raphson at a saddle point takes a “step” that increases in some directions and decreases in others — the iteration can stall or oscillate. Fisher scoring, by using the expected (positive-definite) information, avoids this pathology.

(3) Diverging iterates. If the starting point is in a region where the log-likelihood is flat or inverted in curvature, the Newton step can overshoot wildly. Line search (backtracking until \ell actually increases) is the standard safeguard — guaranteed convergence at the cost of sometimes making small steps.

For the low-dimensional, smooth, exponential-family MLEs that dominate classical statistics, none of this is a concern: start anywhere reasonable, and Newton finds the MLE in 551010 steps. The pathologies become important precisely at the frontier between classical statistics and modern ML — which is where non-convex likelihoods, high dimensions, and exotic parameterizations live.

14.8 Observed and Expected Fisher Information

The asymptotic-normality theorem uses the expected Fisher information I(θ0)I(\theta_0) — a deterministic function of the unknown parameter. In practice, we estimate the asymptotic variance from data, which forces a choice: plug in θ^\hat\theta into the expected information, or use the observed (data-dependent) information? Both are consistent; both appear in textbooks and software; but they are not identical, and the choice matters.

Four panels arranged in a 2×2 grid, each showing a scatter plot of (Î_obs, nI(θ̂)) pairs across 200 MC replications. Upper-left: Normal mean — scatter collapses to a single point at (n/σ², n/σ²), because observed = expected is non-random. Upper-right: Exponential rate — points lie exactly on the identity line (observed = expected at MLE) but spread along it because λ̂ varies sample-to-sample. Lower-left and lower-right: Bernoulli and Poisson, both on-diagonal scatter like Exp. In all four, as n grows, the cloud shrinks toward the theoretical value at θ₀.
Definition 6 Observed Fisher Information

The observed Fisher information at θ\theta is the negative Hessian of the log-likelihood:

J(θ)  =  n(θ)  =  i=1nθ2logf(Xi;θ).J(\theta) \;=\; -\ell_n''(\theta) \;=\; -\sum_{i=1}^n \partial_\theta^2 \log f(X_i; \theta).

It is a random variable (depends on the sample) and is typically evaluated at the MLE: J(θ^)J(\hat\theta).

The distinction from the expected information is purely one of averaging. The expected information I(θ)=Eθ[θ2logf(X;θ)]I(\theta) = -\mathbb{E}_\theta[\partial_\theta^2 \log f(X; \theta)] is a deterministic function of θ\theta; it is what you would compute if you could integrate against the true density. The observed information J(θ)J(\theta) is what you get when you sum over the actual data you have — the sample analog. By the LLN, J(θ)/nI(θ)J(\theta)/n \to I(\theta), so they agree in the limit; at any finite nn they differ by O(n1/2)O(n^{1/2}).

Theorem 7 Observed information converges to expected information

Under the conditions of Theorem 4,

J(θ0)n  P  I(θ0),J(θ^n)n  P  I(θ0).\frac{J(\theta_0)}{n} \;\overset{P}{\to}\; I(\theta_0), \qquad \frac{J(\hat\theta_n)}{n} \;\overset{P}{\to}\; I(\theta_0).

The first statement is the LLN applied to the iid sequence {θ2logf(Xi;θ0)}\{-\partial_\theta^2 \log f(X_i; \theta_0)\}. The second extends this by continuous mapping plus the consistency of θ^n\hat\theta_n from Theorem 3. Both information quantities, evaluated at the MLE, are consistent estimators of I(θ0)I(\theta_0) — so either can be used to construct a Wald confidence interval

θ^n±zα/21J(θ^n)orθ^n±zα/21nI(θ^n).\hat\theta_n \pm z_{\alpha/2}\,\frac{1}{\sqrt{J(\hat\theta_n)}} \qquad \text{or} \qquad \hat\theta_n \pm z_{\alpha/2}\,\frac{1}{\sqrt{n I(\hat\theta_n)}}.

Both are asymptotically valid. The practical question is: which one produces better coverage at finite nn?

Example 15 Observed equals expected exactly: Normal mean

For Normal(μ,σ2)(\mu, \sigma^2) data with σ2\sigma^2 known, μ2logf(x;μ)=1/σ2\partial_\mu^2 \log f(x; \mu) = -1/\sigma^2, independent of xx. Therefore

J(μ)  =  nσ2  =  nI(μ).J(\mu) \;=\; \frac{n}{\sigma^2} \;=\; n I(\mu).

The observed information equals nn times the expected information exactly at every μ\mu and for every sample — there is no randomness in J(μ)J(\mu). This is the “no-surprises” case and makes the Normal-mean Wald interval unambiguous: the two constructions agree at every sample.

Example 16 Observed and expected information at the MLE: Exponential rate

For Exponential(λ)(\lambda), λ2logf(x;λ)=1/λ2\partial_\lambda^2 \log f(x; \lambda) = -1/\lambda^2, again independent of xx, so

J(λ)  =  nλ2  =  nI(λ).J(\lambda) \;=\; \frac{n}{\lambda^2} \;=\; n I(\lambda).

Evaluated at the MLE λ^n=1/Xˉn\hat\lambda_n = 1/\bar{X}_n,

J(λ^n)  =  nXˉn2  =  nI(λ^n).J(\hat\lambda_n) \;=\; n\bar{X}_n^2 \;=\; n I(\hat\lambda_n).

So observed and expected information, evaluated at the MLE, coincide identically — but both are now random, varying sample-to-sample through λ^n\hat\lambda_n. The same is true of Bernoulli and Poisson at the MLE: in each exponential-family case, the score equation makes the Hessian lose its data-dependence at the MLE, so observed and expected information become equal there. This is a property of exponential families in the natural parameter, not a universal one; for Cauchy location, say, observed and expected information at the MLE differ in finite samples. The ObservedVsExpectedExplorer\texttt{ObservedVsExpectedExplorer} below visualizes the sample-to-sample variation of J(λ^n)J(\hat\lambda_n) around the deterministic anchor nI(λ0)nI(\lambda_0).

The Monte-Carlo explorer below realizes both J(θ^n)J(\hat\theta_n) and nI(θ^n)nI(\hat\theta_n) across 200 replicates. Notice the qualitative difference between the Normal-mean preset — where the scatter collapses to a single point at (n/σ2,n/σ2)(n/\sigma^2, n/\sigma^2) — and the Exponential / Poisson / Bernoulli presets, where the scatter is on the identity line but spreads along it as θ^n\hat\theta_n varies. As nn grows, the cloud contracts toward the deterministic anchor nI(θ0)nI(\theta_0), visualized as an open black circle.

Interactive: Observed vs Expected Fisher Information
Each of 200 Monte-Carlo replicates produces one pair (J(θ̂), nI(θ̂)). For the natural-parameter exponential families below, the two quantities coincide at the MLE so the scatter sits exactly on the identity line; what varies sample-to-sample is where on that line we land, via θ̂. For Normal with known σ², J(μ̂) = n/σ² is non-random and the scatter collapses to a single point.
nI(θ₀) 12.500
mean J(θ̂) 12.500
mean nI(θ̂) 12.500
var J(θ̂) 0.0000
var nI(θ̂) 0.0000
replicates 200
Remark 8 Efron & Hinkley (1978): prefer the observed information

In a classic paper, Bradley Efron and David Hinkley argued that the observed information J(θ^n)J(\hat\theta_n) is generally preferable to the expected information nI(θ^n)nI(\hat\theta_n) for constructing confidence intervals — even when the MLE sits at a point where the two agree identically. The reason is that J(θ^n)J(\hat\theta_n) conditions on the sample, capturing the sample-specific curvature of the log-likelihood at the observed data. When the sample happens to be highly informative (curvature steep), JJ reflects that and gives a narrower interval; when the sample is less informative (curvature shallow), JJ gives a wider one — both calibrated to what the data actually tell you.

For exponential families the two quantities agree at the MLE (Example 16), so there is no practical distinction. For non-exponential families, the Efron–Hinkley result says “use JJ.” This is the default in most modern software: glm\texttt{glm} in R reports standard errors based on the observed information, not the expected information, unless you explicitly ask otherwise.

The broader lesson is that all consistency-level statements (Theorem 7) let you swap JJ for nInI asymptotically, but finite-sample calibration can favor one or the other. In Bayesian language, this parallels the difference between posterior variance and prior-predictive variance — both are “variance”, but they answer different questions.

14.9 Connections to Machine Learning

Every ML practitioner uses maximum likelihood daily, often without calling it that. The three most important translations are cross-entropy, MAP/regularization, and the exponential-family MLE recipe — and they are all re-readings of §§14.1–14.8 in ML notation.

Conceptual diagram with three interconnected boxes. Center: 'MLE = maximize Σ log f(xᵢ; θ)'. Left branch: 'Minimize cross-entropy loss' with examples (binary cross-entropy → Bernoulli MLE, categorical cross-entropy → Multinomial MLE, MSE → Normal MLE). Right branch: 'MAP = MLE + log prior' with examples (Gaussian prior → L² / ridge, Laplace prior → L¹ / lasso). Bottom: 'Exponential-family GLM: ∇A(η̂) = T̄' with examples (logistic regression, Poisson regression, linear regression).
Example 17 MLE as cross-entropy minimization for categorical data

Suppose Yi{1,,K}Y_i \in \{1, \ldots, K\} is a categorical outcome with class probabilities pk(xi;θ)p_k(x_i; \theta) depending on a parameter θ\theta (e.g., via a softmax layer in a neural network). The log-likelihood for one observation is logpYi(xi;θ)\log p_{Y_i}(x_i; \theta), and summed over the training set,

(θ)  =  i=1nlogpYi(xi;θ)  =  i=1nk=1K1{Yi=k}[logpk(xi;θ)].\ell(\theta) \;=\; \sum_{i=1}^n \log p_{Y_i}(x_i; \theta) \;=\; -\sum_{i=1}^n \sum_{k=1}^K \mathbf{1}\{Y_i = k\}\, [-\log p_k(x_i; \theta)].

The inner sum over kk is the cross-entropy between the one-hot empirical distribution at observation ii and the model distribution p(xi;θ)p(\cdot \,|\, x_i; \theta). Summing over ii and dividing by nn gives the mean cross-entropy. So “minimize cross-entropy loss” and “maximize Bernoulli/categorical log-likelihood” are literally the same objective up to a sign and a 1/n1/n constant.

When the model is a neural network with a softmax output, the score equation θ=0\nabla_\theta \ell = 0 is the vanishing-gradient condition backpropagation computes. Every gradient descent step in a classification training loop is a (stochastic, approximate) step toward the MLE of a categorical likelihood. Every well-trained classifier is an approximate MLE.

Example 18 Exponential-family MLE as moment matching (cite §7.8)

For any exponential family written in canonical form f(xη)=h(x)exp(ηT(x)A(η))f(x \mid \eta) = h(x) \exp(\eta \cdot T(x) - A(\eta)), Topic 7 §7.8 derived the universal MLE recipe:

A(η^n)  =  Tˉn  =  1ni=1nT(Xi).\nabla A(\hat\eta_n) \;=\; \bar{T}_n \;=\; \frac{1}{n}\sum_{i=1}^n T(X_i).

In words: the MLE of the natural parameter η\eta is the value at which the expected sufficient statistic under the model equals the sample average of the sufficient statistic. This is the most compact statement of MLE for a single parametric family — it covers all nine of the classical exponential-family examples (Normal, Bernoulli, Poisson, Gamma, Beta, and their relatives) in one formula.

For a single-predictor logistic regression, the natural parameter is η=β0+β1x\eta = \beta_0 + \beta_1 x, the sufficient statistic is T(y)=yT(y) = y, and A(η)=σ(η)\nabla A(\eta) = \sigma(\eta). The MLE equation becomes σ(η^)=yˉ\sigma(\hat\eta) = \bar{y} at each xx-value — the fitted probabilities match the empirical fractions. For a Poisson regression with log-link, A(η)=eηA(\eta) = e^\eta and A(η)=eη\nabla A(\eta) = e^\eta; the MLE equation becomes eη^=yˉe^{\hat\eta} = \bar{y} per covariate level. The entire “exponential-family GLM” pipeline — logistic, Poisson, linear, gamma, and beyond — is moment matching on the sufficient statistics.

Remark 9 MAP as regularized MLE: the Bayesian bridge

Every Bayesian posterior has the form π(θx)L(θ;x)π(θ)\pi(\theta \mid \mathbf{x}) \propto L(\theta; \mathbf{x}) \cdot \pi(\theta), where π(θ)\pi(\theta) is the prior. Taking logs,

logπ(θx)  =  (θ)+logπ(θ)+const.\log \pi(\theta \mid \mathbf{x}) \;=\; \ell(\theta) + \log \pi(\theta) + \text{const}.

The maximum-a-posteriori (MAP) estimate is the maximizer of this log-posterior — equivalently, the maximizer of (θ)+logπ(θ)\ell(\theta) + \log \pi(\theta). Compared to the MLE, MAP adds the log-prior as a penalty term. Two canonical examples light up immediately:

Gaussian prior: π(θ)=N(0,τ2)\pi(\theta) = \mathcal{N}(0, \tau^2) gives

logπ(θ)  =  12τ2θ2+const,\log \pi(\theta) \;=\; -\frac{1}{2\tau^2}\,\theta^2 + \text{const},

which is the L2L^2 / ridge / weight-decay penalty. Laplace prior: π(θ)=Laplace(0,b)\pi(\theta) = \operatorname{Laplace}(0, b) gives

logπ(θ)  =  1bθ+const,\log \pi(\theta) \;=\; -\frac{1}{b}|\theta| + \text{const},

which is the L1L^1 / lasso / sparsity penalty.

Ridge regression is MAP estimation with a Gaussian prior on the coefficients; lasso is MAP estimation with a Laplace prior (Topic 23 §23.7 Thm 5 formalizes this correspondence). Weight decay in a neural network is MAP with a Gaussian prior on the weights. The regularization parameter — τ2\tau^2 in the Gaussian case, bb in the Laplace case — is a prior hyperparameter, which can in turn be learned by empirical Bayes or set by cross-validation. The Bernstein–von Mises theorem says that under regularity, the posterior concentrates around the MLE at rate 1/n1/\sqrt{n} in total variation: priors stop mattering once there is enough data. Topic 25 §25.8 develops this with a sketch proof following van der Vaart 1998 §10. The “Bayesian bridge” is exactly this: regularization is a prior, and the prior is informative only until the data overwhelm it.

This is why every modern ML optimizer — Adam, AdamW, L-BFGS with L² regularization — can be read as a MAP procedure. The loss is the negative log-likelihood; the weight-decay term is the negative log-prior; the minimum is the MAP estimate. When someone asks “is neural-network training frequentist or Bayesian?” the honest answer is: it is MAP, which is both at once, and the amount of regularization decides which side dominates.

14.10 Summary

The maximum-likelihood estimator ties together everything in Topics 1–13 — likelihoods come from the densities of Topics 5–6; the score and Fisher information come from Topic 13; consistency comes from Topic 10’s law of large numbers; asymptotic normality comes from Topic 11’s central limit theorem together with Topic 9’s Slutsky theorem; efficiency comes from Topic 13’s Cramér-Rao lower bound. The result is a single procedure that is consistent, asymptotically normal, and asymptotically efficient under regularity — no other estimator in the regular class beats it in the limit.

PropertyStatementIngredients
Existence (interior MLE)Score equation Sn(θ^)=0S_n(\hat\theta) = 0 has a solutionΘ\Theta open, \ell differentiable
Invarianceg(θ)^=g(θ^)\widehat{g(\theta)} = g(\hat\theta)Change of variables or profile likelihood
Consistencyθ^nPθ0\hat\theta_n \overset{P}{\to} \theta_0Compact Θ\Theta, KL-identifiable, LLN + Gibbs inequality
Asymptotic normalityn(θ^nθ0)dN(0,1/I(θ0))\sqrt{n}(\hat\theta_n - \theta_0) \overset{d}{\to} \mathcal{N}(0, 1/I(\theta_0))Taylor + CLT + LLN + Slutsky
Asymptotic efficiencyAsymVar(θ^n)=1/(nI(θ0))=\operatorname{AsymVar}(\hat\theta_n) = 1/(nI(\theta_0)) = CRLBTheorem 4 + Theorem 9 of Topic 13
ComputationNewton-Raphson: θ(t+1)=θ(t)+Sn/J\theta^{(t+1)} = \theta^{(t)} + S_n/JQuadratic convergence near MLE

Cheat sheet for the four workhorse families:

FamilyParameterMLEFisher info I(θ)I(\theta)CRLBWald 95% CI
Normal(μ,σ2)(\mu, \sigma^2)μ\mu (σ² known)Xˉn\bar{X}_n1/σ21/\sigma^2σ2/n\sigma^2/nXˉn±1.96σ/n\bar{X}_n \pm 1.96\,\sigma/\sqrt n
Bernoulli(p)(p)ppXˉn\bar{X}_n1/(p(1p))1/(p(1-p))p(1p)/np(1-p)/np^±1.96p^(1p^)/n\hat p \pm 1.96\sqrt{\hat p(1-\hat p)/n}
Exponential(λ)(\lambda)λ\lambda1/Xˉn1/\bar{X}_n1/λ21/\lambda^2λ2/n\lambda^2/nλ^±1.96λ^/n\hat\lambda \pm 1.96\,\hat\lambda/\sqrt n
Poisson(λ)(\lambda)λ\lambdaXˉn\bar{X}_n1/λ1/\lambdaλ/n\lambda/nλ^±1.96λ^/n\hat\lambda \pm 1.96\,\sqrt{\hat\lambda/n}

Where this leads. Method of Moments applies the Topic 13 framework to moment-matching estimators, comparing their efficiency to the MLE benchmark developed here — asymptotic relative efficiency ARE(MoM, MLE) is typically <1< 1 for families outside the natural exponential representation. Sufficient Statistics uses Fisher information and the CRLB to prove Rao–Blackwell and Lehmann–Scheffé, showing that the MLE is always a function of the sufficient statistic when one exists. Hypothesis Testing (Topic 17) introduces the likelihood-ratio test, the Wald test, and the score test as a trio of asymptotically χ2\chi^2 tests, all corollaries of Thm 14.3 via Slutsky and continuous mapping. The full proof of Wilks’ 2logΛdχk2-2\log\Lambda \xrightarrow{d} \chi^2_k is Topic 18’s territory; Topic 17 states the result and derives the Wald case in a two-line application of asymptotic normality. Confidence Intervals constructs Wald, score, and profile intervals all from the (θ)\ell(\theta) landscape. Generalized Linear Models lifts logistic regression to its full glory: MLE for exponential-family outcomes with link functions, solved by iteratively-reweighted least squares — which is Newton-Raphson/Fisher scoring from §14.7. And Bayesian Foundations formalizes the MAP/regularized-MLE correspondence of Remark 9, giving a principled framework for the informative priors that regularization has been quietly deploying. §25.8 Ex 8 closes the arc from Topic 14 §14.11 Rem 9 through Topic 23 §23.7 Thm 5.

Across formalML, maximum likelihood organizes almost everything: the training objective of a classifier is a Bernoulli/categorical log-likelihood; the training objective of a regressor is a Gaussian log-likelihood; the training of a variational autoencoder is an ELBO — a lower bound on a likelihood — optimized by gradient ascent. The reader who has understood this topic has understood the statistical backbone of modern machine learning. Every optimizer, every loss function, every regularization penalty is a paragraph of §§14.1–14.9 written in a different notation.


Appendix: The EM Algorithm

Remark 10 Preview: EM for Gaussian mixture models

Some log-likelihoods are intractable not because Θ\Theta is big but because the model has latent variables. The canonical example is a Gaussian mixture: each observation XiX_i comes from one of KK Gaussians, but we do not observe which. Write Zi{1,,K}Z_i \in \{1, \ldots, K\} for the unobserved mixture component and θ=(π1,,πK,μ1,,μK,σ1,,σK)\theta = (\pi_1, \ldots, \pi_K, \mu_1, \ldots, \mu_K, \sigma_1, \ldots, \sigma_K) for the parameters. The marginal density of XiX_i is

f(xi;θ)  =  k=1Kπkφ(xi;μk,σk2),f(x_i; \theta) \;=\; \sum_{k=1}^K \pi_k\, \varphi(x_i; \mu_k, \sigma_k^2),

where φ\varphi is the Normal density. The log-likelihood (θ)=ilogf(Xi;θ)\ell(\theta) = \sum_i \log f(X_i; \theta) has the logarithm of a sum, which does not simplify and has no closed-form maximizer.

The expectation-maximization (EM) algorithm side-steps this by alternating between two simpler problems. At the current parameter estimate θ(t)\theta^{(t)}:

  • E-step. Compute the posterior probabilities γik=Pr(Zi=kXi,θ(t))\gamma_{ik} = \Pr(Z_i = k \mid X_i, \theta^{(t)}) — the “responsibility” of component kk for observation ii.
  • M-step. Maximize the expected complete-data log-likelihood: θ(t+1)=argmaxθikγiklog[πkφ(Xi;μk,σk2)]\theta^{(t+1)} = \arg\max_\theta \sum_i \sum_k \gamma_{ik} \log[\pi_k \varphi(X_i; \mu_k, \sigma_k^2)]. This decouples into a weighted Normal MLE for each component, which is just weighted sample means and weighted sample variances — closed-form.

Each EM iteration increases the log-likelihood: (θ(t+1))(θ(t))\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)}), with equality only at a stationary point. The EM algorithm converges to a local maximum of \ell; the global maximum in a mixture model usually requires multiple random restarts, since the likelihood surface is highly multimodal. The full theory — Jensen’s inequality applied to the log, the ELBO lower bound, convergence rates and monotonicity — is developed in Bayesian Foundations (Topic 25) and formalml’s formalML: Variational Methods . For this topic, EM is the right tool when the MLE is defined but hard to compute directly, and ff factors cleanly once the latent variable is conditioned on.


References

  1. Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  2. Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
  3. van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Bickel, P. J., & Doksum, K. A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics (2nd ed.). CRC Press.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  7. Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.