intermediate 50 min read · April 17, 2026

Method of Moments & M-Estimation

Moment matching, the sandwich variance, and robust estimation — the framework that unifies MoM, MLE, and modern robust statistics.

15.1 From Matching Moments to Estimating Equations

Topic 14 gave us a recipe — write down the likelihood, find its peak — that produces an estimator with the best possible asymptotic variance under regularity. Before that recipe existed, statisticians had a different one. Karl Pearson, working in the 1890s on biometric data and trying to fit mixtures of two Normal distributions to crab measurements, observed that the parameters of a model are functions of its moments. So estimate the moments from the data, plug those estimates into the inverse function, and you have an estimator. The procedure is called the method of moments, abbreviated MoM, and it is the oldest estimation principle in modern statistics.

Two-panel teaser. Left: a Gamma(3, 2) sample of size n=200 with the MoM and MLE fitted densities overlaid on the histogram — both near-identical to the true density at this n. Right: a small bar chart contrasting MoM and MLE on two attributes, 'closed-form' and 'asymptotically efficient' — MoM ticks the first, MLE ticks the second.

The mechanics are disarmingly simple. If X1,,XnX_1, \ldots, X_n are an iid sample from a distribution with parameter θ=(θ1,,θk)\theta = (\theta_1, \ldots, \theta_k), and the population moments μj(θ)=E[Xj]\mu_j(\theta) = \mathbb{E}[X^j] are known functions of θ\theta for j=1,,kj = 1, \ldots, k, then the sample moments Xj=(1/n)iXij\overline{X^j} = (1/n)\sum_i X_i^j are estimators of μj(θ)\mu_j(\theta) by the law of large numbers. Setting them equal and solving the resulting system

Xj  =  μj(θ^MoM),j=1,,k,\overline{X^j} \;=\; \mu_j(\hat{\theta}_{\text{MoM}}), \qquad j = 1, \ldots, k,

gives the MoM estimator θ^MoM\hat{\theta}_{\text{MoM}}. There are no derivatives, no iterations, no convexity arguments — just the inverse of the moment map.

Remark 1 Karl Pearson and the bimodal crab

The first systematic use of moment matching appears in Pearson’s 1894 paper “Contributions to the Mathematical Theory of Evolution,” where he fit a mixture of two Normals to the carapace-width measurements of 1000 crabs from the Bay of Naples. The mixture has five parameters — two means, two variances, and a mixing weight — so Pearson computed five sample moments and solved the resulting nonlinear system, by hand, to recover the underlying components. He used the method because no other recipe existed. The likelihood-maximization principle was still nearly thirty years away.

Remark 2 Why MoM came first

The reason MoM predates the MLE is operational. Until R. A. Fisher’s 1922 paper “On the Mathematical Foundations of Theoretical Statistics,” likelihood was viewed as one of several plausible criteria for fitting a distribution; there was no general argument for preferring it. What Fisher established — that the MLE is consistent, asymptotically Normal, and asymptotically efficient under regularity — required machinery (Fisher information, the Cramér–Rao bound, asymptotic theory) that did not exist in Pearson’s era. So a calculation that could be done won out over one that would have required theory not yet developed. The lesson is durable: even today, when the MLE is intractable, MoM is the first thing to reach for.

The catch — and it is a real one — is that MoM is almost never as good as the MLE in the asymptotic sense. The next eleven sections develop both the recipe and the diagnostic comparison. We will see that for the natural exponential families (§7), MoM coincides exactly with the MLE; outside that boundary, MoM trades efficiency for simplicity, and the trade is sometimes severe (Gamma shape, where MoM extracts only ~68% of the available information at α=3\alpha = 3) and sometimes catastrophic (Cauchy, where MoM has nothing to match because the population mean does not exist). The framework that unifies all of this — both MoM and MLE — is M-estimation, the topic of §15.9–§15.11.

15.2 The Method-of-Moments Principle

The construction is straightforward but worth stating with care, because the moment map can be invertible globally, locally, or not at all, and the difference matters.

Definition 1 Sample Moment

For an iid sample X1,,XnX_1, \ldots, X_n, the kk-th sample moment is

Xk  =  1ni=1nXik.\overline{X^k} \;=\; \frac{1}{n}\sum_{i=1}^n X_i^k.

The first sample moment X1=Xˉn\overline{X^1} = \bar{X}_n is the sample mean (Definition 13.2). By the strong law of large numbers, Xka.s.μk=E[Xk]\overline{X^k} \xrightarrow{a.s.} \mu_k = \mathbb{E}[X^k] whenever the population moment μk\mu_k exists.

Definition 2 Method-of-Moments Estimator

Let X1,,XnX_1, \ldots, X_n be iid from a parametric family with θ=(θ1,,θk)ΘRk\theta = (\theta_1, \ldots, \theta_k) \in \Theta \subseteq \mathbb{R}^k, and write μj(θ)=Eθ[Xj]\mu_j(\theta) = \mathbb{E}_\theta[X^j] for the jj-th population moment as a function of θ\theta. Suppose the moment map

g ⁣:ΘRk,g(θ)  =  (μ1(θ),μ2(θ),,μk(θ))g\colon \Theta \longrightarrow \mathbb{R}^k, \qquad g(\theta) \;=\; \bigl(\mu_1(\theta), \mu_2(\theta), \ldots, \mu_k(\theta)\bigr)

is invertible on its image. The method-of-moments estimator θ^MoM\hat{\theta}_{\text{MoM}} is the (unique) solution to the system of kk equations

Xj  =  μj(θ^MoM),j=1,,k.\overline{X^j} \;=\; \mu_j(\hat{\theta}_{\text{MoM}}), \qquad j = 1, \ldots, k.

Equivalently, θ^MoM=g1(X1,,Xk)\hat{\theta}_{\text{MoM}} = g^{-1}\bigl(\overline{X^1}, \ldots, \overline{X^k}\bigr).

The moment map is the bridge between parameter space and observable space, and the MoM estimator is what we get when we cross the bridge in the opposite direction from how the model is usually specified. Forward: θ\theta \mapsto moments. Backward: sample moments θ^\mapsto \hat{\theta}.

A schematic. On the left, parameter space Θ depicted as a region in ℝ². Curved arrows leave Θ and arrive at moment space ℝ² on the right, illustrating the map g(θ) = (μ₁(θ), μ₂(θ)). A point representing the sample moments (X̄, X̄²) sits in the moment space; a dashed arrow returns from that point to the corresponding θ̂ ∈ Θ via g⁻¹.
Theorem 1 Existence and uniqueness of MoM

If the moment map g:ΘRkg: \Theta \to \mathbb{R}^k is continuous and one-to-one on Θ\Theta, and (X1,,Xk)(\overline{X^1}, \ldots, \overline{X^k}) lies in the image g(Θ)g(\Theta), then the MoM estimator exists and is unique. If gg is continuously differentiable with a non-singular Jacobian g(θ0)\nabla g(\theta_0) at the true parameter θ0\theta_0, then the inverse g1g^{-1} is differentiable in a neighborhood of g(θ0)g(\theta_0), and θ^MoM\hat{\theta}_{\text{MoM}} is well-defined for sufficiently large nn with probability tending to one.

The two regularity conditions — invertibility of gg and non-singularity of its Jacobian — are exactly what we need for the asymptotic theory in §15.5 to go through. When they fail (mixtures with non-identifiable components; the Cauchy distribution, where gg is not even defined), MoM fails. We catalogue those failure modes in §15.6.

Example 1 Normal(μ, σ²): the simplest two-parameter case

For XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), the first two population moments are

μ1(μ,σ2)  =  μ,μ2(μ,σ2)  =  μ2+σ2.\mu_1(\mu, \sigma^2) \;=\; \mu, \qquad \mu_2(\mu, \sigma^2) \;=\; \mu^2 + \sigma^2.

Setting X1=μ^\overline{X^1} = \hat{\mu} and X2=μ^2+σ^2\overline{X^2} = \hat{\mu}^2 + \hat{\sigma}^2 and solving:

μ^MoM  =  Xˉn,\hat{\mu}_{\text{MoM}} \;=\; \bar{X}_n,

and

σ^MoM2  =  X2Xˉn2  =  1ni=1n(XiXˉn)2  =  Sn2.\hat{\sigma}^2_{\text{MoM}} \;=\; \overline{X^2} - \bar{X}_n^2 \;=\; \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2 \;=\; S^2_n.

The MoM variance estimator is the biased 1/n1/n form (the same form the MLE uses, see Example 14.2), not the Bessel-corrected 1/(n1)1/(n-1) form. This is structural: MoM matches the population variance to the raw second-moment-minus-squared-first-moment, whose unbiased correction is a (n1)/n(n-1)/n shift. We will see in §15.7 that for the Normal variance, ARE(σ^MoM2,σ^MLE2)=1\text{ARE}(\hat{\sigma}^2_{\text{MoM}}, \hat{\sigma}^2_{\text{MLE}}) = 1 — they are the same estimator, both dominated in MSE by the unbiased Sn12S^2_{n-1}.

Example 2 Exponential(λ): MoM = MLE

For XExp(λ)X \sim \text{Exp}(\lambda), E[X]=1/λ\mathbb{E}[X] = 1/\lambda. So Xˉn=1/λ^MoM\bar{X}_n = 1/\hat{\lambda}_{\text{MoM}}, giving

λ^MoM  =  1Xˉn.\hat{\lambda}_{\text{MoM}} \;=\; \frac{1}{\bar{X}_n}.

This is identical to the MLE (Example 14.7). The reason — to be made precise in Theorem 5 of §15.7 — is that the Exponential is an exponential family in its natural parameter, and for such families the MLE is a moment-matching estimator. The two recipes coincide.

Example 3 Poisson(λ): MoM = MLE

For XPoisson(λ)X \sim \text{Poisson}(\lambda), E[X]=λ\mathbb{E}[X] = \lambda. So

λ^MoM  =  Xˉn,\hat{\lambda}_{\text{MoM}} \;=\; \bar{X}_n,

again identical to the MLE. The Poisson is also a natural exponential family. The pattern will become explicit in §15.7.

Remark 3 Scope: k equations for k parameters

The construction in Definition 2 uses exactly kk moment equations to identify kk parameters. If we attempted to use more — say, the first three sample moments to estimate two parameters — the system is over-determined and generally has no exact solution. Choosing which two equations to enforce, or how to weight all three, is the move that takes us from MoM to the generalized method of moments (GMM, Hansen 1982). We preview GMM in §15.12. For the rest of this topic, we hold to the exact-identification setting: kk moments for kk parameters.

15.3 Scalar Examples: Uniform, Bernoulli, Geometric

Three scalar-parameter examples deepen the intuition. The Uniform case is the cautionary one: MoM is much worse than the MLE.

Four-panel grid. Top-left: Uniform(0, θ) with the MoM estimate 2X̄ and the MLE estimate X_(n) marked on a sample histogram, with the true θ shown as a vertical line — MLE visibly closer. Top-right: Exponential — MoM and MLE coincide on the density. Bottom-left: Poisson — same coincidence. Bottom-right: Geometric — MoM closed form, equals MLE.
Example 4 Uniform(0, θ): MoM is dramatically worse than MLE

For XUniform(0,θ)X \sim \text{Uniform}(0, \theta), E[X]=θ/2\mathbb{E}[X] = \theta/2. So Xˉn=θ^MoM/2\bar{X}_n = \hat{\theta}_{\text{MoM}}/2, giving

θ^MoM  =  2Xˉn.\hat{\theta}_{\text{MoM}} \;=\; 2\bar{X}_n.

This is a perfectly reasonable estimator — consistent, with asymptotic variance θ2/(3n)\theta^2/(3n) from the delta method. But it has two pathologies that the MLE does not. First, with positive probability θ^MoM<X(n)\hat{\theta}_{\text{MoM}} < X_{(n)}, where X(n)=maxiXiX_{(n)} = \max_i X_i is the largest observation. That is impossible — the parameter θ\theta must be at least as large as every observed value, so θ^<X(n)\hat{\theta} < X_{(n)} is infeasible. Second, the MLE for this problem is

θ^MLE  =  X(n),\hat{\theta}_{\text{MLE}} \;=\; X_{(n)},

with asymptotic variance of order θ2/n2\theta^2/n^2 — a full order of magnitude better than MoM’s θ2/(3n)\theta^2/(3n). The ratio ARE(θ^MoM,θ^MLE)=3/(n+2)0\text{ARE}(\hat{\theta}_{\text{MoM}}, \hat{\theta}_{\text{MLE}}) = 3/(n+2) \to 0 as nn \to \infty.

We do not call this a failure of MoM — MoM gives an answer, and the answer converges to θ\theta. We call it the most graphic illustration of why efficiency matters: when the MLE achieves a faster convergence rate (here 1/n1/n vs. 1/n1/\sqrt{n}), the asymptotic ratio of variances is not a fixed number but a sequence going to zero.

Example 5 Bernoulli(p): MoM = MLE

For XBernoulli(p)X \sim \text{Bernoulli}(p), E[X]=p\mathbb{E}[X] = p, so

p^MoM  =  Xˉn.\hat{p}_{\text{MoM}} \;=\; \bar{X}_n.

The Bernoulli is yet another natural exponential family, so MoM = MLE. Three for three so far in the exponential-family column.

Example 6 Geometric(p): MoM closed form

The Geometric distribution comes in two conventions; we adopt the failures-before-first-success convention with X{0,1,2,}X \in \{0, 1, 2, \ldots\} and P(X=k)=(1p)kp\mathbb{P}(X = k) = (1-p)^k p. Then E[X]=(1p)/p\mathbb{E}[X] = (1-p)/p, and setting Xˉn=(1p^)/p^\bar{X}_n = (1-\hat{p})/\hat{p} gives

p^MoM  =  11+Xˉn.\hat{p}_{\text{MoM}} \;=\; \frac{1}{1 + \bar{X}_n}.

If instead we use the trials-until-first-success convention with X{1,2,}X \in \{1, 2, \ldots\}, then E[X]=1/p\mathbb{E}[X] = 1/p and p^MoM=1/Xˉn\hat{p}_{\text{MoM}} = 1/\bar{X}_n. Both are exponential families; both coincide with the MLE.

Remark 4 When MoM lies outside the parameter space

Example 4 hinted at it; the issue is general. The MoM estimate is the inverse moment map applied to sample moments, which can land anywhere in Rk\mathbb{R}^k — not just in g(Θ)g(\Theta). A few realistic ways this bites:

  • Negative variance. For a Gamma(α,β)(\alpha, \beta) sample, α^MoM=Xˉ2/S2\hat{\alpha}_{\text{MoM}} = \bar{X}^2/S^2 requires S2>0S^2 > 0. With probability one this holds for n2n \geq 2 and a non-degenerate sample, but a near-constant sample (small S2S^2) makes the estimate explode.
  • Probability outside [0, 1]. For a Beta(α,β)(\alpha, \beta), the MoM solves α^=Xˉc\hat{\alpha} = \bar{X} \cdot c where c=Xˉ(1Xˉ)/S21c = \bar{X}(1-\bar{X})/S^2 - 1; if c0c \le 0 the formal MoM gives α^0\hat{\alpha} \le 0, outside the parameter space.
  • Mixing weight outside [0, 1]. Pearson’s two-Normal mixture famously yielded mixing weights below 0 and above 1 on certain datasets — Pearson reported this as a feature of the data (“the model is mis-specified”) rather than a bug of MoM.

The standard fix is to project onto the constraint set, but at that point we have abandoned pure MoM. Better diagnostic: when MoM gives an absurd answer, suspect the model.

Interactive: MoM ↔ MLE for Gamma(α, β)
MoM gives a closed-form estimate in two lines of algebra; MLE requires Newton-Raphson on the digamma equation. Both target the same true (α, β) at large n.
Sample summary (n = 100)
n = 1.4945
n (biased) = 0.8692
MoM (closed form)
α̂ = X̄² / S² = 1.495² / 0.869 = 2.5697
β̂ = X̄ / S² = 1.495 / 0.869 = 1.7194
MLE (Newton-Raphson, MoM-warm-started)
α̂ = 2.5906 (4 iters, converged)
β̂ = α̂ / X̄ = 1.7334
Cost: MoM is O(n); MLE is O(n · iter). At α = 3, β = 2 with n = 100, both estimates land within ~10% of the truth, but MoM extracts ≈ 68% of MLE's information (ARE ≈ 0.68 — see §15.7).

15.4 Multivariate MoM: The Two-Parameter Gamma

The Gamma distribution Gamma(α,β)\text{Gamma}(\alpha, \beta) with shape α>0\alpha > 0 and rate β>0\beta > 0 is the canonical example of a continuous family where MoM has a clean closed form and the MLE requires Newton-Raphson. Both estimators recover (α,β)(\alpha, \beta) consistently, but the MoM gets there in two lines of algebra. This contrast — algebraic simplicity for MoM, iterative computation for MLE — is the heart of why MoM remains useful.

Definition 3 Multivariate moment map

For a model with kk parameters θ=(θ1,,θk)ΘRk\theta = (\theta_1, \ldots, \theta_k) \in \Theta \subseteq \mathbb{R}^k, the moment map g:ΘRkg: \Theta \to \mathbb{R}^k is the vector of population moments,

g(θ)  =  (μ1(θ),μ2(θ),,μk(θ)).g(\theta) \;=\; \bigl(\mu_1(\theta),\, \mu_2(\theta),\, \ldots,\, \mu_k(\theta)\bigr).

Its Jacobian at θ\theta is the k×kk \times k matrix

g(θ)  =  [μj(θ)θi]1i,jk.\nabla g(\theta) \;=\; \left[\, \frac{\partial \mu_j(\theta)}{\partial \theta_i} \,\right]_{1 \le i, j \le k}.

The MoM estimator is θ^MoM=g1(X1,,Xk)\hat{\theta}_{\text{MoM}} = g^{-1}(\overline{X^1}, \ldots, \overline{X^k}), well-defined when g(θ0)\nabla g(\theta_0) is invertible at the true parameter.

For the Gamma, the moments are well-known:

μ1(α,β)  =  αβ,μ2(α,β)  =  α(α+1)β2.\mu_1(\alpha, \beta) \;=\; \frac{\alpha}{\beta}, \qquad \mu_2(\alpha, \beta) \;=\; \frac{\alpha(\alpha+1)}{\beta^2}.

It is more convenient to work with the variance σ2=μ2μ12=α/β2\sigma^2 = \mu_2 - \mu_1^2 = \alpha/\beta^2 alongside the mean. Setting Xˉn=α^/β^\bar{X}_n = \hat{\alpha}/\hat{\beta} and Sn2=α^/β^2S^2_n = \hat{\alpha}/\hat{\beta}^2 (with Sn2=X2Xˉn2S^2_n = \overline{X^2} - \bar{X}_n^2 the biased sample variance from §15.2) and dividing the first by the second,

XˉnSn2  =  β^,\frac{\bar{X}_n}{S^2_n} \;=\; \hat{\beta},

and then

α^  =  β^Xˉn  =  Xˉn2Sn2.\hat{\alpha} \;=\; \hat{\beta} \cdot \bar{X}_n \;=\; \frac{\bar{X}_n^2}{S^2_n}.

Closed form, two lines. The MLE for the Gamma shape, in contrast, has no closed form: the score equation logβψ(α)+logX=0\log\beta - \psi(\alpha) + \overline{\log X} = 0 involves the digamma function ψ(α)\psi(\alpha), which has no algebraic inverse. Topic 14 §14.7 walked through the Newton-Raphson solution.

Three panels. Left: scatter of (X̄, S²) over 1000 simulated Gamma(3, 2) samples at n=50, with the contour of the joint asymptotic Normal overlaid. Middle: the same samples mapped through the inverse moment map g⁻¹ to (α̂_MoM, β̂_MoM), with the limiting Normal contour from the matrix delta method overlaid. Right: a vector field on (m₁, m₂) space showing the Jacobian ∇g of the moment-to-parameter map at the true parameter.
Theorem 2 Multivariate MoM consistency and asymptotic normality (Gamma case)

Let X1,,XnX_1, \ldots, X_n be iid Gamma(α0,β0)\text{Gamma}(\alpha_0, \beta_0) with α0,β0>0\alpha_0, \beta_0 > 0, and let α^MoM=Xˉn2/Sn2\hat{\alpha}_{\text{MoM}} = \bar{X}_n^2/S^2_n, β^MoM=Xˉn/Sn2\hat{\beta}_{\text{MoM}} = \bar{X}_n/S^2_n. Then

(α^MoM,β^MoM)a.s.(α0,β0),(\hat{\alpha}_{\text{MoM}}, \hat{\beta}_{\text{MoM}}) \xrightarrow{a.s.} (\alpha_0, \beta_0),

and

n((α^,β^)(α0,β0))dN(0,g(α0,β0)1Σmomentsg(α0,β0)),\sqrt{n}\,\bigl((\hat{\alpha}, \hat{\beta}) - (\alpha_0, \beta_0)\bigr) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, \nabla g(\alpha_0, \beta_0)^{-1} \cdot \Sigma_{\text{moments}} \cdot \nabla g(\alpha_0, \beta_0)^{-\top}\bigr),

where Σmoments\Sigma_{\text{moments}} is the limiting covariance of n(Xˉnμ1,X2μ2)\sqrt{n}\,(\bar{X}_n - \mu_1, \overline{X^2} - \mu_2) given by the multivariate CLT, and g\nabla g is the Jacobian of the moment map.

Proof [show]

We apply the matrix delta method (Theorem 11.9, multivariate form) to the moment-to-parameter map.

Step 1 — joint asymptotic Normality of the sample moments. The pair (Xi,Xi2)(X_i, X_i^2) for each ii is iid with mean (μ1,μ2)(\mu_1, \mu_2) and covariance matrix Σ\Sigma where

Σ11  =  Var(X)  =  α0/β02,\Sigma_{11} \;=\; \operatorname{Var}(X) \;=\; \alpha_0/\beta_0^2,Σ12  =  Cov(X,X2)  =  E[X3]E[X]E[X2]  =  2α0(α0+1)β03,\Sigma_{12} \;=\; \operatorname{Cov}(X, X^2) \;=\; \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] \;=\; \frac{2\alpha_0(\alpha_0 + 1)}{\beta_0^3},Σ22  =  Var(X2)  =  E[X4](E[X2])2  =  2α0(α0+1)(2α0+3)β04.\Sigma_{22} \;=\; \operatorname{Var}(X^2) \;=\; \mathbb{E}[X^4] - (\mathbb{E}[X^2])^2 \;=\; \frac{2\alpha_0(\alpha_0 + 1)(2\alpha_0 + 3)}{\beta_0^4}.

By the multivariate CLT (Topic 11 §11.8) applied to the iid pairs,

n((Xˉn,X2)(μ1,μ2))dN(0,Σ).\sqrt{n}\,\bigl((\bar{X}_n, \overline{X^2}) - (\mu_1, \mu_2)\bigr) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma).

Step 2 — invertibility of the moment map. The map g(α,β)=(α/β,α(α+1)/β2)g(\alpha, \beta) = (\alpha/\beta, \alpha(\alpha+1)/\beta^2) has Jacobian

g(α,β)  =  (1/βα/β2(2α+1)/β22α(α+1)/β3).\nabla g(\alpha, \beta) \;=\; \begin{pmatrix} 1/\beta & -\alpha/\beta^2 \\ (2\alpha + 1)/\beta^2 & -2\alpha(\alpha+1)/\beta^3 \end{pmatrix}.

Its determinant evaluates to detg=α/β40\det \nabla g = -\alpha/\beta^4 \neq 0 for α,β>0\alpha, \beta > 0, so gg is locally invertible at every interior point. The inverse g1g^{-1} is continuously differentiable in a neighborhood of g(α0,β0)g(\alpha_0, \beta_0).

Step 3 — apply the delta method. Writing h=g1h = g^{-1} for the inverse map and h(g(α0,β0))=g(α0,β0)1\nabla h(g(\alpha_0, \beta_0)) = \nabla g(\alpha_0, \beta_0)^{-1} for its Jacobian (by the inverse function theorem),

n(h(Xˉn,X2)h(μ1,μ2))dN(0,hΣh).\sqrt{n}\,\bigl(h(\bar{X}_n, \overline{X^2}) - h(\mu_1, \mu_2)\bigr) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, \nabla h \cdot \Sigma \cdot \nabla h^\top\bigr).

Since h(Xˉn,X2)=(α^MoM,β^MoM)h(\bar{X}_n, \overline{X^2}) = (\hat{\alpha}_{\text{MoM}}, \hat{\beta}_{\text{MoM}}) and h(μ1,μ2)=(α0,β0)h(\mu_1, \mu_2) = (\alpha_0, \beta_0), this is the asymptotic-normality claim with covariance g1Σg\nabla g^{-1} \Sigma \nabla g^{-\top}.

Step 4 — consistency from the SLLN. The sample moments converge a.s. to the population moments by the strong LLN (Topic 10 §10.5). Continuity of hh at (μ1,μ2)(\mu_1, \mu_2) together with the continuous-mapping theorem (Topic 11 §11.4) gives

(α^MoM,β^MoM)  =  h(Xˉn,X2)a.s.h(μ1,μ2)  =  (α0,β0).(\hat{\alpha}_{\text{MoM}}, \hat{\beta}_{\text{MoM}}) \;=\; h(\bar{X}_n, \overline{X^2}) \xrightarrow{a.s.} h(\mu_1, \mu_2) \;=\; (\alpha_0, \beta_0).

This establishes both consistency and asymptotic normality. ∎ — using the multivariate CLT (§11.8), the delta method (§11.9), and the continuous-mapping theorem.

Example 7 Gamma(α, β) MoM: closed form

Restating from the derivation above: for X1,,XnX_1, \ldots, X_n iid Gamma(α,β)\text{Gamma}(\alpha, \beta),

α^MoM  =  Xˉn2Sn2,β^MoM  =  XˉnSn2,\hat{\alpha}_{\text{MoM}} \;=\; \frac{\bar{X}_n^2}{S^2_n}, \qquad \hat{\beta}_{\text{MoM}} \;=\; \frac{\bar{X}_n}{S^2_n},

where Sn2=(1/n)i(XiXˉn)2S^2_n = (1/n)\sum_i (X_i - \bar{X}_n)^2 is the biased (1/nn) sample variance — the same form used by the MLE for the Normal variance, and the form that makes the MoM and the maximum-likelihood approach align as cleanly as possible. With n=200n = 200 samples from Gamma(3,2)\text{Gamma}(3, 2), the typical estimate sits within ±0.1 of α=3\alpha = 3 on a single replicate. The MLE for α\alpha is more accurate by a factor of 1/0.681.211/\sqrt{0.68} \approx 1.21 in standard error (the ARE = 0.68 of §15.7); whether that matters depends on the application.

Example 8 Beta(α, β) MoM: dual-moment closed form

For XBeta(α,β)X \sim \text{Beta}(\alpha, \beta), the mean and variance are

μ1  =  αα+β,σ2  =  αβ(α+β)2(α+β+1).\mu_1 \;=\; \frac{\alpha}{\alpha + \beta}, \qquad \sigma^2 \;=\; \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.

A short manipulation yields

α^MoM  =  Xˉnc,β^MoM  =  (1Xˉn)c,c  =  Xˉn(1Xˉn)Sn21,\hat{\alpha}_{\text{MoM}} \;=\; \bar{X}_n \cdot c, \qquad \hat{\beta}_{\text{MoM}} \;=\; (1 - \bar{X}_n) \cdot c, \qquad c \;=\; \frac{\bar{X}_n(1 - \bar{X}_n)}{S^2_n} - 1,

with Sn2S^2_n the same biased sample variance. The MLE for the Beta has no closed form — the score equations involve digamma functions — so the gap in computational cost is genuine. We will see in §15.8 that fitting a Beta in scipy or R uses the MoM estimate as the warm start for the MLE iteration.

15.5 Consistency and Asymptotic Normality of MoM

Theorem 2 made the multivariate Gamma case rigorous. The general multivariate-MoM theorem is the same proof with the model-specific moments and Jacobian replaced by abstract symbols.

Theorem 3 MoM consistency (general)

Let X1,,XnX_1, \ldots, X_n be iid from a model with parameter θ0ΘRk\theta_0 \in \Theta \subseteq \mathbb{R}^k. Suppose:

  1. The first kk population moments μ1(θ),,μk(θ)\mu_1(\theta), \ldots, \mu_k(\theta) exist and are continuous in θ\theta.
  2. The moment map g(θ)=(μ1(θ),,μk(θ))g(\theta) = (\mu_1(\theta), \ldots, \mu_k(\theta)) is one-to-one on Θ\Theta, and g1g^{-1} is continuous on a neighborhood of g(θ0)g(\theta_0).

Then θ^MoMa.s.θ0\hat{\theta}_{\text{MoM}} \xrightarrow{a.s.} \theta_0 as nn \to \infty.

The proof is one line: (X1,,Xk)a.s.(μ1(θ0),,μk(θ0))(\overline{X^1}, \ldots, \overline{X^k}) \xrightarrow{a.s.} (\mu_1(\theta_0), \ldots, \mu_k(\theta_0)) by the SLLN (componentwise), and θ^MoM=g1(X1,,Xk)\hat{\theta}_{\text{MoM}} = g^{-1}(\overline{X^1}, \ldots, \overline{X^k}) converges a.s. to g1(g(θ0))=θ0g^{-1}(g(\theta_0)) = \theta_0 by continuity of g1g^{-1} and the continuous-mapping theorem (Topic 11 §11.4).

The asymptotic-normality theorem requires more.

Three histograms of √n(α̂_MoM − α₀) for the Gamma shape parameter, n = 20, 100, 500. The theoretical N(0, AsympVar) density is overlaid on each. As n grows, the histogram tightens and centers on zero, becoming visibly Normal.
Theorem 4 MoM asymptotic normality (general)

Under the conditions of Theorem 3, additionally suppose:

  1. Eθ0[X2k]<\mathbb{E}_{\theta_0}[X^{2k}] < \infty (finite higher moments — needed for the CLT to apply to sample moments).
  2. The moment map gg is continuously differentiable in a neighborhood of θ0\theta_0 with non-singular Jacobian g(θ0)\nabla g(\theta_0).

Then

n(θ^MoMθ0)dN(0,g(θ0)1Σ(θ0)g(θ0)),\sqrt{n}\,(\hat{\theta}_{\text{MoM}} - \theta_0) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, \nabla g(\theta_0)^{-1} \Sigma(\theta_0) \nabla g(\theta_0)^{-\top}\bigr),

where Σ(θ0)\Sigma(\theta_0) is the k×kk \times k covariance matrix of (X,X2,,Xk)(X, X^2, \ldots, X^k) under θ0\theta_0.

Proof [show]

The proof generalizes Steps 1–3 of Proof 1 (Theorem 2) with the Gamma-specific computations replaced by abstract symbols.

Step 1 — joint asymptotic Normality of sample moments. The vector (Xi,Xi2,,Xik)(X_i, X_i^2, \ldots, X_i^k) is iid across ii with mean (μ1(θ0),,μk(θ0))(\mu_1(\theta_0), \ldots, \mu_k(\theta_0)) and covariance matrix Σ(θ0)\Sigma(\theta_0) whose (j,)(j, \ell) entry is Cov(Xj,X)=μj+μjμ\operatorname{Cov}(X^j, X^\ell) = \mu_{j+\ell} - \mu_j \mu_\ell. Finite μ2k\mu_{2k} ensures Σ\Sigma has finite entries. The multivariate CLT (Topic 11 §11.8) then gives

n((X1,,Xk)(μ1(θ0),,μk(θ0)))dN(0,Σ(θ0)).\sqrt{n}\,\bigl((\overline{X^1}, \ldots, \overline{X^k}) - (\mu_1(\theta_0), \ldots, \mu_k(\theta_0))\bigr) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma(\theta_0)).

Step 2 — apply the delta method. Let h=g1h = g^{-1}. By the inverse function theorem (formalcalculus: differentiation), hh is continuously differentiable in a neighborhood of g(θ0)g(\theta_0) with Jacobian h(g(θ0))=g(θ0)1\nabla h(g(\theta_0)) = \nabla g(\theta_0)^{-1}. The multivariate delta method (Topic 11 §11.9) then transforms the asymptotic distribution of the sample moments through hh:

n(h(X1,,Xk)h(μ1(θ0),,μk(θ0)))dN(0,g1Σg).\sqrt{n}\,\bigl(h(\overline{X^1}, \ldots, \overline{X^k}) - h(\mu_1(\theta_0), \ldots, \mu_k(\theta_0))\bigr) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, \nabla g^{-1} \Sigma \nabla g^{-\top}\bigr).

Since hh applied to the sample moments is θ^MoM\hat{\theta}_{\text{MoM}}, and hh applied to the population moments is θ0\theta_0, this is the claimed limit. ∎ — using the multivariate CLT (§11.8), the delta method (§11.9), and the continuous-mapping theorem.

Remark 5 Regularity conditions

The four assumptions of Theorem 4 are exactly the ingredients the proof needs. (1)–(2) are the identification assumptions: the moments exist and the moment map distinguishes parameters. (3) is the Lyapunov-style higher-moment assumption that lets the CLT apply componentwise to the iid moments XijX_i^j. (4) is the smoothness assumption that lets the delta method propagate the limit through g1g^{-1}. When any of these fails — most notably (1) for Cauchy and (3) for heavy-tailed Pareto — MoM does not just have a worse asymptotic variance; it has no asymptotic variance, because the underlying CLT does not apply. We see those failure modes in §15.6.

Example 9 Asymptotic variance of Exponential MoM = CRLB

The Exponential(λ)(\lambda) MoM is λ^MoM=1/Xˉn\hat{\lambda}_{\text{MoM}} = 1/\bar{X}_n. By Theorem 4 with k=1k = 1 and the delta method on h(m)=1/mh(m) = 1/m,

n(λ^λ0)dN(0,h(1/λ0)2Var(X)).\sqrt{n}\,(\hat{\lambda} - \lambda_0) \xrightarrow{d} \mathcal{N}\bigl(0,\, h'(1/\lambda_0)^2 \cdot \operatorname{Var}(X)\bigr).

Computing: h(m)=1/m2h'(m) = -1/m^2, so h(1/λ0)=λ02h'(1/\lambda_0) = -\lambda_0^2; and Var(X)=1/λ02\operatorname{Var}(X) = 1/\lambda_0^2. Hence

n(λ^λ0)dN(0,λ041/λ02)  =  N(0,λ02).\sqrt{n}\,(\hat{\lambda} - \lambda_0) \xrightarrow{d} \mathcal{N}(0, \lambda_0^4 \cdot 1/\lambda_0^2) \;=\; \mathcal{N}(0, \lambda_0^2).

The Cramér–Rao lower bound for the Exponential rate (Example 13.10) is λ02/n\lambda_0^2/n — exactly what we just computed. So the Exponential MoM is asymptotically efficient, achieving the CRLB. This is the §15.7 statement that the MoM matches the MLE for natural exponential families, made concrete.

Interactive: ARE(MoM, MLE) — empirical vs theoretical
Pick a family. The left panel overlays the sampling distributions of MoM (blue) and MLE (orange) over many Monte-Carlo replications; the right panel plots the theoretical ARE curve with the empirical point marked in red.
Empirical Var(MoM): 0.27539 · Empirical Var(MLE): 0.25456 · Empirical ARE: 0.924 · Theoretical ARE: 0.676 (1 / [2(α+1)(α·ψ′(α) − 1)])

15.6 When the Method of Moments Fails

The conditions of Theorem 4 can fail in three structurally different ways. Each illuminates the moment-matching principle from the negative side.

Definition 4 Moment non-existence

A distribution has finite kk-th moment if E[Xk]<\mathbb{E}[|X|^k] < \infty. When E[Xk]=\mathbb{E}[|X|^k] = \infty, the kk-th moment does not exist (in either the abstract or the practical sense), the SLLN does not apply to Xk\overline{X^k}, and the MoM equation Xk=μk(θ^)\overline{X^k} = \mu_k(\hat{\theta}) has no meaningful right-hand side. The MoM estimator is then undefined, not merely badly behaved.

Three panels. Left: running sample mean of a Cauchy sample on log-x axis, showing wild excursions to ±50 even at n = 10,000. Middle: running sample median of the same Cauchy sample, converging smoothly to 0. Right: histogram of the sample mean across 500 replications of Pareto(α=0.8) at n=1000, displaying severe right skew with the bulk near 5 and a long tail extending past 50.
Example 10 Cauchy: no mean, MoM undefined; MLE consistent

The Cauchy distribution with location θ\theta and scale 1 has density f(x;θ)=1/[π(1+(xθ)2)]f(x; \theta) = 1/[\pi(1 + (x - \theta)^2)]. Its tails are O(1/x2)O(1/x^2), so

x1π(1+x2)dx  =  ,\int_{-\infty}^\infty |x| \cdot \frac{1}{\pi(1 + x^2)}\, dx \;=\; \infty,

i.e., E[X]=\mathbb{E}[|X|] = \infty and the population mean does not exist. The MoM equation θ^=Xˉn\hat{\theta} = \bar{X}_n has a left-hand side that does not converge as nn \to \infty: the sample mean of Cauchy data exhibits wild excursions at every scale — large jumps appear with positive probability at every nn, so the sequence Xˉn\bar{X}_n has no limit (almost surely it does not converge).

The MLE, by contrast, is consistent. The log-likelihood

(θ)  =  ilog(1+(Xiθ)2)nlogπ\ell(\theta) \;=\; -\sum_i \log\bigl(1 + (X_i - \theta)^2\bigr) - n\log\pi

is twice continuously differentiable, has a unique maximum (the MLE solves a polynomial equation in θ\theta of degree 2n12n - 1 when expanded), and the Fisher information I(θ)=1/2I(\theta) = 1/2 is finite. The MLE is asymptotically Normal with variance 2/n2/n. Cauchy is the cleanest example of MoM failing while MLE succeeds: the population mean does not exist, but the population is fully identified by the location of its peak — and the likelihood sees the peak.

Example 11 Pareto with α ≤ 1: infinite mean

The Pareto distribution with shape α\alpha and scale xmx_m has density f(x;α,xm)=αxmα/xα+1f(x; \alpha, x_m) = \alpha x_m^\alpha / x^{\alpha+1} for xxmx \geq x_m. Its first moment is

E[X]  =  xmxαxmαxα+1dx  =  αxmαxmxαdx,\mathbb{E}[X] \;=\; \int_{x_m}^\infty x \cdot \frac{\alpha x_m^\alpha}{x^{\alpha+1}}\, dx \;=\; \alpha x_m^\alpha \int_{x_m}^\infty x^{-\alpha}\, dx,

which converges only when α>1\alpha > 1, in which case E[X]=αxm/(α1)\mathbb{E}[X] = \alpha x_m / (\alpha - 1). For α1\alpha \leq 1, E[X]=\mathbb{E}[X] = \infty, and the MoM moment-matching equation Xˉn=αxm/(α1)\bar{X}_n = \alpha x_m / (\alpha - 1) has no consistent population side. Empirically, Xˉn\bar{X}_n for Pareto(0.8)(0.8) data does not converge but instead grows roughly like n1/α1n^{1/\alpha - 1} — heavy-tailed extremes pull it up at every sample size.

The MLE for the Pareto α\alpha is closed-form: α^MLE=n/ilog(Xi/xm)\hat{\alpha}_{\text{MLE}} = n / \sum_i \log(X_i / x_m), and is consistent with finite asymptotic variance, regardless of whether α>1\alpha > 1. So once again — even more cleanly than Cauchy — the MoM fails because E[X]\mathbb{E}[X] does not exist, while the MLE works because the log-transformed data has finite moments.

Example 12 Non-identifiable mixture: equal means

Consider the mixture model X12N(0,1)+12N(0,σ2)X \sim \tfrac{1}{2}\mathcal{N}(0, 1) + \tfrac{1}{2}\mathcal{N}(0, \sigma^2) — equal mixing weights, common mean zero, two variance components. The first three population moments are

μ1  =  0,μ2  =  12(1+σ2),μ3  =  0.\mu_1 \;=\; 0, \qquad \mu_2 \;=\; \tfrac{1}{2}(1 + \sigma^2), \qquad \mu_3 \;=\; 0.

Only the second moment depends on σ2\sigma^2. The first moment is identically zero (by symmetry), and the third moment is identically zero (by symmetry of each Normal component). So the moment map g(σ2)=(μ1,μ2,μ3)=(0,(1+σ2)/2,0)g(\sigma^2) = (\mu_1, \mu_2, \mu_3) = (0, (1 + \sigma^2)/2, 0) is invertible only if we use the second moment — and it has dimension 1 in moment space, embedded in 3-dimensional moment space. With three moments and one parameter, MoM is over-determined; with one moment and one parameter, MoM works trivially.

The deeper issue surfaces when we try to estimate more parameters in the mixture — say, both variances and the mixing weight. With two distinct components having equal means, the moment system collapses: the same set of population moments is consistent with a continuum of parameter combinations. The moment map is not invertible. MoM fails because the model is not identifiable from moments alone. The MLE faces the same issue — but the MLE has access to the full likelihood, which uses the shape of each observation’s contribution and can identify the components even when their moments coincide.

Remark 6 Absurd estimates in practice

The pattern in Examples 10–12 is general. When MoM fails, it fails visibly: an undefined limit, an estimate outside the parameter space, or a system that simply has no unique solution. This is, perversely, a strength — MoM tells you when it has failed, often more loudly than the MLE does. If σ^MoM2<0\hat{\sigma}^2_{\text{MoM}} < 0 or p^MoM>1\hat{p}_{\text{MoM}} > 1 on real data, the appropriate response is not to project the estimate onto the constraint set but to ask whether the model is right.

Interactive: Running mean vs running median
For Cauchy and heavy-tailed Pareto, the running mean never converges (the LLN fails because E[|X|] = ∞), but the running median converges smoothly to the population median. Watch the contrast.
Diagnosis. Population mean exists: no. Population variance exists: no. Population median = 0.000. Maximum absolute excursion of the running mean (so far): 1.69.
The MoM equation X̄n = μ1(θ̂) has nothing to match — there is no population moment to converge to.

15.7 Asymptotic Relative Efficiency: MoM vs MLE

The reason MoM coincides with the MLE for the Exponential (Example 2) and Poisson (Example 3) is structural — those families admit a representation in which the MLE is a moment-matching estimator. The reason MoM diverges from the MLE for the Gamma shape is also structural — the Gamma is an exponential family, but only the rate is in the natural parameter; the shape requires a different sufficient statistic. We make this rigorous now.

Definition 5 ARE — recalled from Topic 13

For two consistent, asymptotically Normal estimators θ^1\hat{\theta}_1 and θ^2\hat{\theta}_2 of the same parameter θ\theta, with asymptotic variances σ12(θ)/n\sigma^2_1(\theta)/n and σ22(θ)/n\sigma^2_2(\theta)/n, the asymptotic relative efficiency of θ^1\hat{\theta}_1 relative to θ^2\hat{\theta}_2 is

ARE(θ^1,θ^2)  =  σ22(θ)σ12(θ).\operatorname{ARE}(\hat{\theta}_1, \hat{\theta}_2) \;=\; \frac{\sigma^2_2(\theta)}{\sigma^2_1(\theta)}.

When this ratio is less than 1, θ^1\hat{\theta}_1 is asymptotically less efficient than θ^2\hat{\theta}_2. ARE = 1 means the two estimators are asymptotically equivalent. (Definition 13.13 of Topic 13.)

Theorem 5 Exponential-family coincidence: MoM ≡ MLE in the natural parameter

Let {f(x;η):ηE}\{f(x; \eta) : \eta \in \mathcal{E}\} be a kk-parameter exponential family in natural parameter form,

f(x;η)  =  h(x)exp(ηT(x)A(η)),f(x; \eta) \;=\; h(x) \exp\bigl(\eta^\top T(x) - A(\eta)\bigr),

with sufficient statistic T(x)=(T1(x),,Tk(x))T(x) = (T_1(x), \ldots, T_k(x)) and log-partition function A(η)A(\eta). The MLE η^MLE\hat{\eta}_{\text{MLE}} satisfies

A(η^MLE)  =  T(X)  =  1ni=1nT(Xi).\nabla A(\hat{\eta}_{\text{MLE}}) \;=\; \overline{T(X)} \;=\; \frac{1}{n}\sum_{i=1}^n T(X_i).

This is exactly the moment-matching equation: A(η)=Eη[T(X)]\nabla A(\eta) = \mathbb{E}_\eta[T(X)] is the population mean of the sufficient statistic, so the MLE equates the model’s expected sufficient statistic to the sample average — which is the MoM equation when the sufficient statistics are the moments. In particular, for distributions whose sufficient statistic is T(x)=xT(x) = x (Exponential, Bernoulli, Poisson), the MLE is the moment-matching estimator θ^MoM=Xˉn\hat{\theta}_{\text{MoM}} = \bar{X}_n, so ARE(MoM,MLE)=1\operatorname{ARE}(\text{MoM}, \text{MLE}) = 1.

The proof is the one-line argument from Topic 7 §7.8: differentiating (η)=ηTA(η)+const\ell(\eta) = \eta^\top \overline{T} - A(\eta) + \text{const} with respect to η\eta and setting the result to zero gives T=A(η)\overline{T} = \nabla A(\eta). By the well-known identity Eη[T(X)]=A(η)\mathbb{E}_\eta[T(X)] = \nabla A(\eta) (a moment-generating-function calculation, also Topic 7 §7.8), this is moment matching.

Two panels. Left: ARE(MoM, MLE) for the Gamma shape parameter α as a function of α — a curve that starts near 0.4 at α = 1, rises to about 0.68 at α = 3, and approaches 1 as α grows large. Right: empirical verification at n = 200 — sample variance ratio Var(α̂_MoM)/Var(α̂_MLE) over 1000 Monte Carlo replications, as scattered points lying close to the theoretical curve.
Theorem 6 ARE(MoM, MLE) for the Gamma shape parameter

For X1,,XnX_1, \ldots, X_n iid Gamma(α,β)\text{Gamma}(\alpha, \beta), with the MoM estimator α^MoM=Xˉn2/Sn2\hat{\alpha}_{\text{MoM}} = \bar{X}_n^2/S^2_n and the MLE α^MLE\hat{\alpha}_{\text{MLE}} from the score equation, the asymptotic variances are

AsympVar(nα^MoM)  =  2α(α+1),\operatorname{AsympVar}(\sqrt{n}\,\hat{\alpha}_{\text{MoM}}) \;=\; 2\alpha(\alpha + 1),

and (from the inverse Fisher-information matrix at (α,β)(\alpha, \beta))

AsympVar(nα^MLE)  =  ααψ(α)1,\operatorname{AsympVar}(\sqrt{n}\,\hat{\alpha}_{\text{MLE}}) \;=\; \frac{\alpha}{\alpha\,\psi'(\alpha) - 1},

where ψ(α)\psi'(\alpha) is the trigamma function. Therefore

ARE(α^MoM,α^MLE)  =  12(α+1)(αψ(α)1).\operatorname{ARE}(\hat{\alpha}_{\text{MoM}}, \hat{\alpha}_{\text{MLE}}) \;=\; \frac{1}{2(\alpha + 1)\bigl(\alpha\,\psi'(\alpha) - 1\bigr)}.

The MoM asymptotic variance comes from Theorem 4 with the Gamma-specific g\nabla g and Σ\Sigma from the proof of Theorem 2. The MLE asymptotic variance comes from inverting the 2×22 \times 2 Fisher information matrix at (α,β)(\alpha, \beta) and reading off the (1,1)(1, 1) entry. The ratio is what we plot.

Example 13 Gamma shape ARE ≈ 0.68 at α = 3

At α=3\alpha = 3: the trigamma function evaluates to ψ(3)=π2/611/40.394934\psi'(3) = \pi^2/6 - 1 - 1/4 \approx 0.394934, so αψ(α)10.184802\alpha \psi'(\alpha) - 1 \approx 0.184802. Then

ARE  =  1240.184802    0.676.\operatorname{ARE} \;=\; \frac{1}{2 \cdot 4 \cdot 0.184802} \;\approx\; 0.676.

The Monte Carlo verification at n=200n = 200 over 1000 replications gives an empirical Var(α^MoM)/Var(α^MLE)0.69\operatorname{Var}(\hat{\alpha}_{\text{MoM}}) / \operatorname{Var}(\hat{\alpha}_{\text{MLE}}) \approx 0.69 — consistent with the theoretical 0.676 to within Monte-Carlo noise. So the MLE extracts about 1/0.68 ≈ 1.47 times as much information from each observation as MoM does — equivalently, MoM needs 47% more data to match MLE precision.

Example 14 Exponential rate: ARE = 1 exactly

The Exponential family has T(x)=xT(x) = x, so by Theorem 5, the MLE λ^MLE=1/Xˉn\hat{\lambda}_{\text{MLE}} = 1/\bar{X}_n coincides with the MoM. Asymptotic variance for both is λ2\lambda^2 (per Example 9), and ARE(λ^MoM,λ^MLE)=1\operatorname{ARE}(\hat{\lambda}_{\text{MoM}}, \hat{\lambda}_{\text{MLE}}) = 1 exactly. This is not approximate equality — the two estimators are the same function of the data. They cannot have different asymptotic variances.

Remark 7 Why MoM is still taught and used

Three reasons, in increasing order of importance.

(1) Closed form. When the moment map inverts in elementary functions — Gamma, Beta, Weibull (numerically), Pareto with α>1\alpha > 1 — MoM gives an estimate in two lines of arithmetic. No iteration, no convergence diagnostics, no risk of missing a local maximum.

(2) Warm-start for MLE. The next section makes this precise. When MLE requires Newton-Raphson, the MoM estimate is almost always a good starting point: it is consistent (so it converges to θ0\theta_0 at the same rate as the MLE), it is computable, and it lies inside the parameter space whenever the data are not pathological. Newton from MoM converges in 3–5 iterations at typical nn; Newton from an arbitrary start can take 8–12 or fail to converge at all.

(3) Diagnostic. When MoM and MLE disagree wildly, the model is suspect. The two recipes use different summaries of the data — sample moments for MoM, the full likelihood for MLE — and a large discrepancy means the model is not capturing the moment structure consistently with the likelihood structure. This is a free goodness-of-fit signal, available without additional computation.

15.8 MoM as a Warm-Start for MLE

Topic 14 §14.7 introduced Newton-Raphson for the Gamma shape MLE. It mentioned, briefly, that “a standard initial guess is the method-of-moments estimate” — that’s the warm-start pattern. Here we make it first-class.

Remark 8 Conceptual: cheap first pass, expensive refinement

The MLE is asymptotically efficient. The MoM is consistent. The two estimates converge to the same true θ0\theta_0 at the same n\sqrt{n} rate, and the difference between them is Op(1/n)O_p(1/\sqrt{n}). So the MoM estimate places us, with high probability, within an O(1/n)O(1/\sqrt{n}) neighborhood of the MLE — close enough that one or two Newton steps land at the MLE to machine precision. This is the algorithmic reason MoM matters even when we have decided to fit by maximum likelihood.

The conceptual analogue in optimization is multi-resolution: solve a cheap approximate problem to get a good initial point, then refine with the expensive exact procedure. MoM-warm-started MLE is exactly this pattern at the level of statistical estimation.

Two panels. Left: the Gamma shape log-likelihood ℓ(α) on a horizontal axis with three Newton-Raphson trajectories marked. The MoM init starts inside the visible neighborhood of the maximum and arrives in 4 steps. The α₀ = 0.5 init starts in the left tail and takes 9 steps. The α₀ = 2 × MoM init starts in the right tail and takes 10 steps. Right: convergence plot |α_t − α̂_MLE| vs iteration t on a log-y scale, with the three trajectories overlaid showing the MoM path is always lowest.
Example 15 Gamma shape MLE: MoM init takes 4 Newton steps

For a Gamma(α=3,β=2)\text{Gamma}(\alpha = 3, \beta = 2) sample of size n=100n = 100, the MoM estimate of the shape is typically α^MoM3.0±0.3\hat{\alpha}_{\text{MoM}} \approx 3.0 \pm 0.3 on a single replicate. Starting Newton-Raphson on the score equation logβψ(α)+logX=0\log\beta - \psi(\alpha) + \overline{\log X} = 0 from this initial value, convergence to machine precision (step size <108< 10^{-8}) requires 4 iterations in the typical case. Starting from α0=0.5\alpha_0 = 0.5 takes 9 iterations — Newton has to traverse the steep left tail of the score function. Starting from α0=2×α^MoM6\alpha_0 = 2 \times \hat{\alpha}_{\text{MoM}} \approx 6 takes 10 iterations, with the first step often overshooting. The pattern holds across sample sizes from n=50n = 50 to n=1000n = 1000: MoM-warm-started Newton converges in 3–5 steps; arbitrary-start Newton needs 8–12.

Remark 9 Pattern in modern software

This is not a textbook idea waiting to be implemented. It is the dominant pattern in the actual codebases that fit distributions to data.

  • scipy.stats.fit: when a parametric family is specified, scipy first computes a MoM estimate (it calls this “method of moments” internally), then refines via L-BFGS or a Nelder–Mead simplex on the negative log-likelihood. MoM is the explicit default initial point.
  • R’s MASS::fitdistr: the same pattern. MoM init, then optim() for the MLE.
  • Statsmodels (Python): iteratively reweighted least squares for GLMs, with the OLS estimate (a MoM-like start) as the initial guess.

The dominant alternative — searching for a starting point by random restart or by grid search — is much slower and rarely necessary when MoM is available. The exceptions are models where MoM is undefined (mixtures with non-identifiable components, latent-variable models) or pathological (Cauchy). For everything else, MoM is the warm-start.

Interactive: MoM warm-start vs arbitrary inits
Newton-Raphson on the Gamma shape MLE from three starting points. The MoM init lands close to the maximum, converging in 3–5 steps; arbitrary starts wander.
α̂_MoM = 3.4198 · α̂_MLE = 3.3870 · Steps: MoM init 4, small init 7, large init 7.

15.9 M-Estimation: The Unifying Framework

Both MoM and MLE produce an estimator by setting an empirical sum to zero. For MLE, the sum is the score: is(θ;Xi)=0\sum_i s(\theta; X_i) = 0, where s(θ;x)=θlogf(x;θ)s(\theta; x) = \partial_\theta \log f(x; \theta). For MoM, the sum is a moment residual: i(Xikμk(θ))=0\sum_i \bigl(X_i^k - \mu_k(\theta)\bigr) = 0. The shared structure is

Ψn(θ)  =  i=1nψ(Xi;θ)  =  0\Psi_n(\theta) \;=\; \sum_{i=1}^n \psi(X_i; \theta) \;=\; 0

for some choice of estimating function ψ\psi. The framework that takes this structure as primary is called M-estimation, and it gives a single asymptotic theory for all such estimators — including MoM, MLE, OLS, the Huber location estimator, the median, and quantile regression.

Definition 6 M-estimator

Let X1,,XnX_1, \ldots, X_n be iid from a distribution PP with parameter θΘ\theta \in \Theta. Given an estimating function ψ:X×ΘRk\psi: \mathcal{X} \times \Theta \to \mathbb{R}^k satisfying the moment condition Eθ0[ψ(X;θ0)]=0\mathbb{E}_{\theta_0}[\psi(X; \theta_0)] = 0 at the true parameter, the M-estimator θ^n\hat{\theta}_n is any solution of the estimating equation

Ψn(θ^n)  =  i=1nψ(Xi;θ^n)  =  0.\Psi_n(\hat{\theta}_n) \;=\; \sum_{i=1}^n \psi(X_i; \hat{\theta}_n) \;=\; 0.

Existence and uniqueness require regularity on ψ\psi (e.g., monotonicity in θ\theta, or strict convexity of an associated loss); see van der Vaart Chapter 5 for the formal conditions.

Definition 7 Z-estimator and the loss-function form

When ψ(x;θ)=θρ(x;θ)\psi(x; \theta) = \nabla_\theta \rho(x; \theta) for some scalar loss function ρ:X×ΘR\rho: \mathcal{X} \times \Theta \to \mathbb{R}, the M-estimator is equivalently the minimizer of the empirical risk:

θ^n  =  argminθΘi=1nρ(Xi;θ).\hat{\theta}_n \;=\; \arg\min_{\theta \in \Theta} \sum_{i=1}^n \rho(X_i; \theta).

In this gradient form, the estimating equation θiρ=0\nabla_\theta \sum_i \rho = 0 is the first-order condition for the minimization. The label Z-estimator (for “zero” of the estimating equation) is sometimes reserved for the general form even when no underlying loss exists; M-estimator (for “minimum” of an empirical risk) emphasizes the loss-function origin.

A unified diagram organized as a table. Rows enumerate canonical ψ-functions: (1) ψ(x; θ) = ∂ log f / ∂θ → MLE, (2) ψ_k(x; θ) = x^k − μ_k(θ) → MoM, (3) ψ_H(x; θ) = min(max(x − θ, −k), k) → Huber, (4) ψ(x, y; β) = x(y − xᵀβ) → OLS. The right column displays the shape of each ψ as a small graph.
Theorem 7 MLE is an M-estimator with ψ = score

For an iid sample from a parametric family {f(x;θ):θΘ}\{f(x; \theta) : \theta \in \Theta\}, the maximum-likelihood estimator θ^MLE\hat{\theta}_{\text{MLE}} is the M-estimator with estimating function

ψ(x;θ)  =  s(θ;x)  =  θlogf(x;θ).\psi(x; \theta) \;=\; s(\theta; x) \;=\; \frac{\partial}{\partial \theta} \log f(x; \theta).

The moment condition Eθ0[ψ(X;θ0)]=0\mathbb{E}_{\theta_0}[\psi(X; \theta_0)] = 0 holds because the population score has mean zero (Theorem 13.8). The associated loss is the negative log-density: ρ(x;θ)=logf(x;θ)\rho(x; \theta) = -\log f(x; \theta), with θ^MLE\hat{\theta}_{\text{MLE}} the minimizer of the negative log-likelihood.

Theorem 8 MoM is an M-estimator with ψ = moment residual

For a kk-parameter model with first kk population moments μj(θ)\mu_j(\theta), the method-of-moments estimator θ^MoM\hat{\theta}_{\text{MoM}} is the M-estimator with vector-valued estimating function

ψ(x;θ)  =  (xμ1(θ),x2μ2(θ),,xkμk(θ)).\psi(x; \theta) \;=\; \bigl(x - \mu_1(\theta),\, x^2 - \mu_2(\theta),\, \ldots,\, x^k - \mu_k(\theta)\bigr).

The moment condition Eθ0[ψ(X;θ0)]=(μ1μ1,,μkμk)=0\mathbb{E}_{\theta_0}[\psi(X; \theta_0)] = (\mu_1 - \mu_1, \ldots, \mu_k - \mu_k) = \mathbf{0} holds at the true parameter by definition. There is no associated loss function in general — MoM is a Z-estimator, not an M-estimator in the strict loss-minimization sense.

Remark 10 OLS is an M-estimator

For a regression model Yi=Xiβ+εiY_i = X_i^\top \beta + \varepsilon_i with E[εX]=0\mathbb{E}[\varepsilon | X] = 0, the ordinary least-squares estimator β^OLS\hat{\beta}_{\text{OLS}} is the M-estimator with

ψ(x,y;β)  =  x(yxβ).\psi(x, y; \beta) \;=\; x \cdot (y - x^\top \beta).

The moment condition E[X(YXβ0)]=E[Xε]=0\mathbb{E}[X(Y - X^\top \beta_0)] = \mathbb{E}[X \varepsilon] = 0 holds by the regression assumption. The loss is ρ(x,y;β)=12(yxβ)2\rho(x, y; \beta) = \tfrac{1}{2}(y - x^\top \beta)^2. We develop OLS in detail in Linear Regression; here it suffices to note that OLS, MLE for a Normal-error linear model, falls into the M-estimation framework.

The list extends. Quantile regression: ψ(x,y;θ)=(τ1{yxθ})x\psi(x, y; \theta) = \bigl(\tau - \mathbb{1}\{y \le x^\top\theta\}\bigr) \cdot x. Generalized linear models: ψ\psi is the score of the GLM likelihood. Robust regression: ψ\psi is a Huber or Tukey choice. The unifying observation is that the estimator is determined by the choice of ψ\psi, and the asymptotic theory is the same in every case — the sandwich variance of §15.10.

15.10 Asymptotic Theory of M-Estimators: The Sandwich Variance

The single theorem in this section is, structurally, the most consequential in the topic. It generalizes the asymptotic-normality result for the MLE (Theorem 14.4) to every M-estimator, and it produces an asymptotic variance — the sandwich form A1BA1A^{-1} B A^{-1\top} — that reduces to the Fisher-information inverse I(θ)1I(\theta)^{-1} in the MLE case but differs from it whenever the model is misspecified.

Definition 8 Sensitivity matrix A and variability matrix B

For an M-estimator with smooth estimating function ψ(x;θ)\psi(x; \theta), define

A(θ)  =  Eθ[θψ(X;θ)],A(\theta) \;=\; -\mathbb{E}_\theta\bigl[\nabla_\theta \psi(X; \theta)\bigr],

the sensitivity matrix — how strongly the population estimating function responds to changes in θ\theta. And define

B(θ)  =  Eθ[ψ(X;θ)ψ(X;θ)],B(\theta) \;=\; \mathbb{E}_\theta\bigl[\psi(X; \theta) \psi(X; \theta)^\top\bigr],

the variability matrix — the covariance of ψ\psi at θ0\theta_0 (using the moment condition E[ψ]=0\mathbb{E}[\psi] = 0 to drop the mean). When θ\theta is scalar, both AA and BB are scalars; for vector θ\theta, they are k×kk \times k matrices.

Sign conventions vary across references. We adopt the convention that AA is positive definite in the regular case (so the negative sign appears explicitly). The alternative convention A=E[ψ]A = \mathbb{E}[\nabla \psi] flips the sign and treats AA as negative definite.

Theorem 9 M-estimator consistency

Under suitable identifiability and regularity conditions on ψ\psi (e.g., θEθ0[ψ(X;θ)]\theta \mapsto \mathbb{E}_{\theta_0}[\psi(X; \theta)] is uniquely zero at θ=θ0\theta = \theta_0 and continuous in a neighborhood, plus uniform convergence of Ψn/n\Psi_n/n to its expectation), the M-estimator satisfies

θ^nPθ0as n.\hat{\theta}_n \xrightarrow{P} \theta_0 \qquad \text{as } n \to \infty.

The argument parallels the MLE consistency proof (Theorem 14.3): uniform LLN on Ψn/n\Psi_n/n, plus the unique-zero hypothesis, plus a continuity argument.

Two panels. Left: a decomposition diagram of the matrix product A⁻¹ B A⁻¹ᵀ shown as three boxed factors with annotations: 'A measures sensitivity of ψ to θ', 'B measures variability of ψ at θ₀', 'Sandwich = robust variance.' Right: the MLE special case shown as A = B = I(θ), with the sandwich collapsing to I⁻¹ — the ordinary Fisher-info variance recovered.
Theorem 10 M-estimator asymptotic normality — the sandwich variance

Under the conditions of Theorem 9 plus continuous differentiability of ψ\psi in θ\theta near θ0\theta_0, finite second moments of ψ\psi, and invertibility of A(θ0)A(\theta_0),

n(θ^nθ0)dN(0,V(θ0)),\sqrt{n}\,(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, V(\theta_0)\bigr),

where the asymptotic variance is the sandwich form

V(θ0)  =  A(θ0)1B(θ0)A(θ0).V(\theta_0) \;=\; A(\theta_0)^{-1}\, B(\theta_0)\, A(\theta_0)^{-\top}.

For scalar θ\theta this reduces to V=B/A2V = B/A^2.

When ψ\psi is the score function of a correctly specified parametric model, A=B=I(θ0)A = B = I(\theta_0) (information equality, Theorem 13.8), the sandwich collapses to V=I(θ0)1V = I(\theta_0)^{-1}, recovering the MLE asymptotic variance of Theorem 14.4 as a corollary.

Proof [show]

This is the core asymptotic theorem of M-estimation. The argument is a Taylor expansion of the estimating equation around θ0\theta_0, followed by a one-step rearrangement and an application of the LLN, the CLT, and Slutsky’s theorem.

Step 1 — Taylor expand the estimating equation. By assumption θ^nPθ0\hat{\theta}_n \xrightarrow{P} \theta_0 (Theorem 9), so for large nn, θ^n\hat{\theta}_n lies in a neighborhood of θ0\theta_0 where ψ\psi is differentiable. Taylor-expanding Ψn\Psi_n around θ0\theta_0:

0  =  Ψn(θ^n)  =  Ψn(θ0)+Ψn(θ~n)(θ^nθ0),\mathbf{0} \;=\; \Psi_n(\hat{\theta}_n) \;=\; \Psi_n(\theta_0) + \nabla\Psi_n(\tilde{\theta}_n) \cdot (\hat{\theta}_n - \theta_0),

for some θ~n\tilde{\theta}_n on the line segment between θ^n\hat{\theta}_n and θ0\theta_0 (a multivariate mean-value form).

Step 2 — solve for the deviation. Assuming Ψn(θ~n)\nabla\Psi_n(\tilde{\theta}_n) is invertible for large nn (which it is, with probability tending to one, by the invertibility assumption on AA and the consistency θ~nPθ0\tilde\theta_n \xrightarrow{P} \theta_0),

θ^nθ0  =  [Ψn(θ~n)]1Ψn(θ0).\hat{\theta}_n - \theta_0 \;=\; -\bigl[\nabla\Psi_n(\tilde{\theta}_n)\bigr]^{-1} \cdot \Psi_n(\theta_0).

Multiplying both sides by n\sqrt{n}:

n(θ^nθ0)  =  [1nΨn(θ~n)]11nΨn(θ0).\sqrt{n}\,(\hat{\theta}_n - \theta_0) \;=\; -\bigl[\tfrac{1}{n}\nabla\Psi_n(\tilde{\theta}_n)\bigr]^{-1} \cdot \tfrac{1}{\sqrt{n}}\Psi_n(\theta_0).

We will compute the limit of each factor separately.

Step 3 — the denominator. By the LLN applied to the iid ψ(Xi;θ0)\nabla\psi(X_i; \theta_0), plus uniform convergence in a neighborhood and the consistency θ~nPθ0\tilde\theta_n \xrightarrow{P} \theta_0,

1nΨn(θ~n)  =  1ni=1nθψ(Xi;θ~n)PEθ0[θψ(X;θ0)]  =  A(θ0).\tfrac{1}{n} \nabla\Psi_n(\tilde{\theta}_n) \;=\; \tfrac{1}{n} \sum_{i=1}^n \nabla_\theta \psi(X_i; \tilde{\theta}_n) \xrightarrow{P} \mathbb{E}_{\theta_0}[\nabla_\theta \psi(X; \theta_0)] \;=\; -A(\theta_0).

Step 4 — the numerator. The iid random vectors ψ(Xi;θ0)\psi(X_i; \theta_0) have mean zero (the moment condition) and finite covariance B(θ0)B(\theta_0) (by the second-moment assumption). By the multivariate CLT (Topic 11 §11.8),

1nΨn(θ0)  =  1ni=1nψ(Xi;θ0)dN(0,B(θ0)).\tfrac{1}{\sqrt{n}}\Psi_n(\theta_0) \;=\; \tfrac{1}{\sqrt{n}} \sum_{i=1}^n \psi(X_i; \theta_0) \xrightarrow{d} \mathcal{N}(\mathbf{0}, B(\theta_0)).

Step 5 — combine via Slutsky. Slutsky’s theorem (Topic 11 §11.4) lets us multiply the convergent-in-probability denominator by the convergent-in-distribution numerator:

n(θ^nθ0)d(A(θ0))1N(0,B(θ0))  =  A(θ0)1N(0,B(θ0)).\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} -\bigl(-A(\theta_0)\bigr)^{-1} \cdot \mathcal{N}(\mathbf{0}, B(\theta_0)) \;=\; A(\theta_0)^{-1} \cdot \mathcal{N}(\mathbf{0}, B(\theta_0)).

A linear transformation of a Normal vector is Normal with covariance A1BA1A^{-1} B A^{-1\top}:

n(θ^nθ0)dN(0,A(θ0)1B(θ0)A(θ0)).\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} \mathcal{N}\bigl(\mathbf{0},\, A(\theta_0)^{-1} B(\theta_0) A(\theta_0)^{-\top}\bigr).

Step 6 — MLE collapse. When ψ(x;θ)=s(θ;x)\psi(x; \theta) = s(\theta; x) is the score of a correctly specified model, the information equality (Theorem 13.8) states

Eθ0[θs(θ0;X)]  =  Eθ0[s(θ0;X)s(θ0;X)]  =  I(θ0),-\mathbb{E}_{\theta_0}[\nabla_\theta s(\theta_0; X)] \;=\; \mathbb{E}_{\theta_0}[s(\theta_0; X) s(\theta_0; X)^\top] \;=\; I(\theta_0),

so A(θ0)=B(θ0)=I(θ0)A(\theta_0) = B(\theta_0) = I(\theta_0), and the sandwich variance collapses:

V(θ0)  =  I(θ0)1I(θ0)I(θ0)1  =  I(θ0)1.V(\theta_0) \;=\; I(\theta_0)^{-1} I(\theta_0) I(\theta_0)^{-1} \;=\; I(\theta_0)^{-1}.

This is exactly the MLE asymptotic variance from Theorem 14.4. The MLE asymptotic-normality theorem is the special case A=BA = B of the M-estimator asymptotic-normality theorem. ∎ — using the multivariate CLT (§11.8), the LLN (Topic 10), Slutsky (§11.4), and the information equality (Theorem 13.8).

Example 16 MLE recovery: A = B = I, sandwich = I⁻¹

For any correctly specified parametric family with score s(θ;x)s(\theta; x), the M-estimator with ψ=s\psi = s has A(θ0)=B(θ0)=I(θ0)A(\theta_0) = B(\theta_0) = I(\theta_0) by the information equality. The sandwich variance is V=I1II1=I1V = I^{-1} I I^{-1} = I^{-1}. Theorem 14.4 is recovered as a corollary of Theorem 10. The Fisher-information inverse is the correct-model asymptotic variance; the sandwich is the general one.

Example 17 Exponential MoM as M-estimator: A, B, and the sandwich agree with Theorem 4

For Exponential(λ)(\lambda) data, MoM uses ψ(x;λ)=x1/λ\psi(x; \lambda) = x - 1/\lambda. Compute:

A(λ)  =  E[λψ(X;λ)]  =  E[1/λ2]  =  1/λ2.A(\lambda) \;=\; -\mathbb{E}\bigl[\partial_\lambda \psi(X; \lambda)\bigr] \;=\; -\mathbb{E}[1/\lambda^2] \;=\; -1/\lambda^2.

(With the sign convention from Definition 8 — note that A<0A < 0 here is the convention warning we flagged.) And

B(λ)  =  E[ψ(X;λ)2]  =  Var(X)  =  1/λ2.B(\lambda) \;=\; \mathbb{E}[\psi(X; \lambda)^2] \;=\; \operatorname{Var}(X) \;=\; 1/\lambda^2.

Sandwich: V=B/A2=(1/λ2)/(1/λ4)=λ2V = B/A^2 = (1/\lambda^2) / (1/\lambda^4) = \lambda^2. This matches the asymptotic variance derived in Example 9 via the delta method. The two routes — applying Theorem 4 (delta method) and applying Theorem 10 (sandwich) — agree, as they must.

A demonstration of the misspecified case is more striking. Suppose we generate data from Gamma(3,2)\text{Gamma}(3, 2) but fit it as an Exponential, treating ψ(x;λ)=x1/λ\psi(x; \lambda) = x - 1/\lambda as the estimating equation. The MoM estimate λ^=1/Xˉn\hat{\lambda} = 1/\bar{X}_n converges to λ=1/EGamma[X]=2/3\lambda^* = 1/\mathbb{E}_{\text{Gamma}}[X] = 2/3, not to any “true” Exponential rate (there is none). The model-based asymptotic variance would be λ2/n=(4/9)/n0.00222\lambda^{*2}/n = (4/9)/n \approx 0.00222 at n=200n = 200. But the empirical variance over Monte Carlo replicates is 0.00073\approx 0.00073 — three times smaller than the model-based guess, because the Gamma distribution has lower variance than the Exponential at the same mean. The sandwich estimate, which uses the empirical AA and BB computed from the actual Gamma data, is 0.00074\approx 0.00074, matching the empirical to three significant figures. This is the practical significance of ABA \neq B: it is the signature of model misspecification, and the sandwich formula corrects for it.

Remark 11 Misspecification signature: A ≠ B

The information equality A=B=IA = B = I is a theorem under correct specification — and only under correct specification. When the model is wrong, AA and BB generally differ, and the sandwich variance gives the asymptotic variance of the M-estimator under the true (mis-specified) data-generating distribution rather than under the postulated (mis-specified) model.

This has a name in econometrics: robust standard errors, or White’s heteroskedasticity-consistent standard errors for OLS — exactly the sandwich formula applied to OLS under heteroskedasticity. The model-based standard errors assume homoskedasticity (and hence A=BA = B); the sandwich corrects for the violation. Linear Regression develops this fully. The M-estimation framework makes the correction transparent: it is not a separate “robust” trick; it is what happens when you take the asymptotic theory seriously.

15.11 Robust M-Estimation: Huber and Tukey

The estimating-function ψ\psi in M-estimation is a design — choose it to balance efficiency under the postulated model against robustness to deviations from that model. The MLE choice (ψ=\psi = score) is the one extreme: maximally efficient when the model is correct, but fragile under contamination. Two alternatives, both due to mid-20th-century work in robust statistics, give controlled trade-offs.

Definition 9 Huber's ψ-function

Huber’s ψ-function with tuning constant k>0k > 0 is

ψH(u;k)  =  {uif uk,ksign(u)if u>k.\psi_H(u; k) \;=\; \begin{cases} u & \text{if } |u| \le k, \\ k \cdot \operatorname{sign}(u) & \text{if } |u| > k. \end{cases}

The associated loss function is ρH(u;k)=12u2\rho_H(u; k) = \tfrac{1}{2}u^2 for uk|u| \le k, and ρH(u;k)=ku12k2\rho_H(u; k) = k|u| - \tfrac{1}{2}k^2 for u>k|u| > k — quadratic in the central region, linear in the tails. The Huber location estimator solves iψH((Xiθ^)/σ^)=0\sum_i \psi_H((X_i - \hat{\theta})/\hat{\sigma}) = 0 for a robust scale σ^\hat{\sigma} (typically the MAD, σ^=med(Ximed(X))/0.6745\hat{\sigma} = \operatorname{med}(|X_i - \operatorname{med}(X)|) / 0.6745).

The default choice k=1.345k = 1.345 gives 95% asymptotic efficiency under Normal data while bounding the influence of any single observation.

Definition 10 Tukey's biweight ψ-function

Tukey’s biweight ψ-function (also called the bisquare) with tuning constant c>0c > 0 is

ψT(u;c)  =  {u(1(u/c)2)2if uc,0if u>c.\psi_T(u; c) \;=\; \begin{cases} u\,\bigl(1 - (u/c)^2\bigr)^2 & \text{if } |u| \le c, \\ 0 & \text{if } |u| > c. \end{cases}

The associated loss is bounded:

ρT(u;c)  =  {c26(1(1(u/c)2)3)if uc,c2/6if u>c.\rho_T(u; c) \;=\; \begin{cases} \tfrac{c^2}{6}\bigl(1 - (1 - (u/c)^2)^3\bigr) & \text{if } |u| \le c, \\ c^2/6 & \text{if } |u| > c. \end{cases}

Tukey’s ψ\psi is redescending — it rises in the central region, peaks, and returns to zero outside u>c|u| > c. Observations beyond cc standardized residuals get weight exactly zero and are entirely ignored. The default c=4.685c = 4.685 gives 95% asymptotic efficiency under Normal data, with full rejection of outliers beyond 4.685σ4.685\sigma.

Three panels. Left: ψ_H(u; k = 1.345) — linear with slope 1 in [−1.345, 1.345], then flat at ±1.345 in the tails. Center: ψ_T(u; c = 4.685) — bell-shaped on [−4.685, 4.685], rising from 0 at u = 0, peaking near |u| = c/√3, then descending to 0 at u = ±c, and identically zero beyond. Right: estimator variance versus contamination fraction for the sample mean (rises sharply), the median (flat), Huber (flat through ~10%, then rising), and Tukey (flat well past 10%).
Theorem 11 Minimax property of Huber (Huber 1964)

Among all M-estimators of location with bounded influence function, the Huber estimator with tuning constant kk chosen to match a specified Normal-efficiency target is minimax: it minimizes the maximum asymptotic variance over the class of ε\varepsilon-contaminated Normal distributions

Fε  =  {(1ε)N(0,1)+εG:G symmetric around 0}\mathcal{F}_\varepsilon \;=\; \bigl\{(1 - \varepsilon)\,\mathcal{N}(0, 1) + \varepsilon \cdot G : G \text{ symmetric around 0}\bigr\}

for a corresponding ε\varepsilon. The choice k=1.345k = 1.345 corresponds to roughly ε=0.05\varepsilon = 0.05 contamination tolerance and 95% efficiency under pure Normal.

The proof, due to Huber (1964), is a saddle-point argument: the worst case in Fε\mathcal{F}_\varepsilon is a heavy-tailed Normal-like distribution, and the optimal estimator under that worst case turns out to be the Huber location with tuning kk depending on ε\varepsilon. This was the founding result of robust statistics — it gave a principled recipe for outlier-resistant estimators, replacing earlier ad-hoc procedures like trimmed means and Winsorization.

Example 18 Huber location for contaminated Normal

Generate n=200n = 200 samples: 180 from N(5,1)\mathcal{N}(5, 1) and 20 from N(25,1)\mathcal{N}(25, 1) (that is, 10% contamination at +20σ+20\sigma). The sample mean is μ^mean7.0\hat{\mu}_{\text{mean}} \approx 7.0 — pulled away from the true location 5 by the outliers. The median is μ^median5.0\hat{\mu}_{\text{median}} \approx 5.0 — unmoved. The Huber location with k=1.345k = 1.345 and MAD scale gives μ^Huber5.0\hat{\mu}_{\text{Huber}} \approx 5.0 as well — also unmoved, but with smaller variance than the median when the contamination is not present (when contamination is zero, the Huber matches the mean almost exactly because no observations are clipped).

Example 19 Tukey biweight: extreme outliers get weight zero

Tukey’s redescending ψ\psi goes one step further. Beyond u>c=4.685|u| > c = 4.685, ψT=0\psi_T = 0, so the corresponding observations make no contribution to the estimating equation — they are effectively discarded. On the same contaminated sample as Example 18, Tukey gives μ^Tukey5.0\hat{\mu}_{\text{Tukey}} \approx 5.0 with the variance of the median, but with somewhat better efficiency at the clean center because the central kernel is smoother.

The cost of the redescending shape is a concavity issue: Ψn(θ)=iψT((Xiθ)/σ^)\Psi_n(\theta) = \sum_i \psi_T((X_i - \theta)/\hat\sigma) is not monotone in θ\theta, so the iteratively-reweighted fixed-point iteration can converge to a local optimum. The standard fix is to start from the median (which is consistent and globally well-defined), then refine to the Tukey estimate.

Definition 11 Breakdown point

The (asymptotic) breakdown point of an estimator is the largest fraction ε\varepsilon^* of the sample that can be replaced by arbitrary outliers without making the estimate go to infinity. Formally,

ε(θ^,P)  =  sup{ε:supQFε(P)θ^(Q)<},\varepsilon^*(\hat{\theta}, P) \;=\; \sup\Bigl\{\varepsilon : \sup_{Q \in \mathcal{F}_\varepsilon(P)} |\hat{\theta}(Q)| < \infty\Bigr\},

where Fε(P)\mathcal{F}_\varepsilon(P) is the set of distributions within ε\varepsilon-contamination of PP. The maximum possible breakdown point is 1/21/2 — any larger ε\varepsilon allows the contaminated distribution to be the majority, in which case identifying the “true” parameter is impossible.

Horizontal bar chart of breakdown points: sample mean (0%), Huber (~5%), 10% trimmed mean (10%), 25% trimmed mean (25%), Tukey biweight (~45%), median (50%). Each bar carries a short caption; a dotted vertical line marks the theoretical maximum at 0.5.

The figure above uses the older convention of reporting Huber and Tukey’s minimax-contamination tolerances rather than their asymptotic breakdown points. With a robust scale like the MAD, both Huber and Tukey achieve the maximum breakdown point of 1/2 (Rousseeuw–Leroy 1987) — the same as the median. The 5% bar for Huber should be read as: under the minimax-optimal tuning, the estimator is designed to be near-efficient against an ε=0.05\varepsilon = 0.05 contamination class. Empirical drift past ~10–20% in finite samples is a separate, gentler phenomenon driven by corruption of the MAD itself.

Example 20 Breakdown points: mean (0), median (1/2), Huber (1/2 with MAD), Tukey (1/2)

The breakdown points for common location estimators are:

  • Sample mean. ε=0\varepsilon^* = 0. A single outlier at ++\infty sends the mean to ++\infty.
  • Sample median. ε=1/2\varepsilon^* = 1/2. Up to nearly half the sample can be arbitrarily corrupted before the median becomes undefined; the median is the optimum in this sense.
  • Huber location at k=1.345k = 1.345, paired with a robust scale (MAD). ε=1/2\varepsilon^* = 1/2. Asymptotically, the Huber-with-MAD estimator achieves the maximum breakdown point — the same as the median (Rousseeuw–Leroy 1987, §1). The often-quoted figure ε0.05\varepsilon \approx 0.05 is a different quantity: the contamination level for which the default k=1.345k = 1.345 is minimax-optimal under Huber’s gross-error model. In practice the estimator drifts gradually as ε\varepsilon grows, well before the asymptotic 1/2 is reached, because at high contamination the MAD itself becomes corrupted.
  • Tukey biweight at c=4.685c = 4.685. ε=1/2\varepsilon^* = 1/2 (asymptotic). Redescending — once outliers are far enough out, they have zero weight, so increasing the fraction further does not move the estimate (until the redescending region itself becomes the majority of the data, at which point the estimator can have multiple local optima).
  • α\alpha-trimmed mean. ε=α\varepsilon^* = \alpha. Trim the top and bottom αn\alpha n observations and average the rest; the breakdown is exactly the trimming proportion.
Remark 12 Huber at k = 1.345: 95% efficiency, bounded influence

The default k=1.345k = 1.345 is not arbitrary. It is the value that makes the asymptotic variance of the Huber location, under pure N(0,1)\mathcal{N}(0, 1) data, exactly 1/0.951.05261/0.95 \approx 1.0526 times the asymptotic variance of the sample mean (which is 1 under unit-variance data) — that is, 95% efficiency under the Normal model. Bounded influence: the maximum value ψHk=1.345|\psi_H| \leq k = 1.345 caps the impact of any single observation. The combination — 95% efficiency on a clean sample plus bounded influence on contaminated samples — is the dominant default in modern robust regression. Tukey’s c=4.685c = 4.685 is the analogous value for the biweight, giving the same 95% efficiency under Normal.

Why 95% and not, say, 99%? The reasoning is empirical: pushing efficiency above 95% (with kk small) makes the estimator behave nearly identically to the mean in the central region, surrendering most of the robustness. Pushing efficiency below 95% (with kk small) gains additional outlier resistance but at noticeable efficiency loss on clean data. The 95% point is roughly the elbow of this trade-off curve.

Interactive: M-estimator gallery — ψ-functions, contamination, sandwich CIs
The shape of ψ determines the estimator. Contaminate the data and see how each estimator responds: the mean drifts, the median holds, Huber holds-then-drifts past its breakdown, Tukey holds further still.
Estimates (true μ = 5): Sample mean: 5.027 ± 0.136 (breakdown 0%)·Sample median: 5.091 ± 0.147 (breakdown 50%)·Huber: 5.017 ± 0.141 (breakdown 50%)·Tukey: 5.024 ± 0.141 (breakdown 50%)

15.12 Summary & Forward Look: GMM and Beyond

The method of moments matches sample moments to population moments and solves; the M-estimation framework generalizes by replacing the moment-matching equation with a more general estimating equation iψ(Xi;θ)=0\sum_i \psi(X_i; \theta) = 0. The asymptotic theory is uniform across the framework: consistency from a uniform LLN, asymptotic normality from a Taylor expansion plus the CLT, and asymptotic variance in the sandwich form A1BA1A^{-1}BA^{-1\top} that collapses to the Fisher-information inverse when ψ\psi is the score under correct specification. Choosing ψ\psi differently — Huber, Tukey, the score, the moment residual — produces a universe of estimators with controllable trade-offs between efficiency under the postulated model and robustness against deviations from it.

PropertyMethod of MomentsMaximum LikelihoodGeneral M-Estimator
Estimating equationi(Xikμk(θ))=0\sum_i (X_i^k - \mu_k(\theta)) = 0iθlogf(Xi;θ)=0\sum_i \nabla_\theta \log f(X_i; \theta) = 0iψ(Xi;θ)=0\sum_i \psi(X_i; \theta) = 0
Closed formOften (Gamma, Beta, Pareto with α>1\alpha > 1)Rare (only for exponential families with closed-form inverse A\nabla A)Depends on ψ\psi
ConsistencyYes if gg invertible, μk\mu_k existYes under standard regularityYes under Theorem 9 conditions
Asymptotic varianceg1Σg\nabla g^{-1} \Sigma \nabla g^{-\top}I(θ0)1I(\theta_0)^{-1}A1BAA^{-1} B A^{-\top} (sandwich)
Asymptotic efficiency== MLE for natural exponential families; << MLE in generalYes (achieves CRLB)Yes only under correct specification with ψ=\psi = score
Robustness to outliersPoor (sample moments are non-robust)Poor (likelihood is non-robust)Tunable via ψ\psi (Huber, Tukey)
Computational costClosed-form: O(n)O(n). Iterative: O(niter)O(n \cdot \text{iter})Newton-Raphson: O(niter)O(n \cdot \text{iter}), typically iter = 4–10Depends on ψ\psi
Remark 13 GMM: the generalized method of moments (Hansen 1982, ~300-word preview)

The method of moments uses exactly kk moment equations to identify kk parameters. Generalized method of moments (GMM), introduced by Lars Peter Hansen in his 1982 Econometrica paper, lifts this restriction: it allows mkm \geq k moment conditions on kk parameters, giving an over-identified system that is then solved by minimizing a weighted distance.

Specifically, given moment conditions g(X;θ)Rmg(X; \theta) \in \mathbb{R}^m with the population property Eθ0[g(X;θ0)]=0\mathbb{E}_{\theta_0}[g(X; \theta_0)] = 0, define the empirical moment vector

g^n(θ)  =  1ni=1ng(Xi;θ).\hat{g}_n(\theta) \;=\; \frac{1}{n}\sum_{i=1}^n g(X_i; \theta).

The GMM estimator with weighting matrix WnW_n (an m×mm \times m positive-definite matrix that may depend on the data) is

θ^GMM  =  argminθg^n(θ)Wng^n(θ).\hat{\theta}_{\text{GMM}} \;=\; \arg\min_\theta \hat{g}_n(\theta)^\top W_n \hat{g}_n(\theta).

When m=km = k and Wn=ImW_n = I_m, GMM reduces to ordinary MoM. When m>km > k, the estimator is over-identified: the mm moment conditions cannot all be satisfied exactly, so GMM finds the θ\theta that comes closest in the WnW_n-weighted norm. The optimal weighting matrix — the one that minimizes the asymptotic variance of θ^GMM\hat{\theta}_{\text{GMM}} — is W=Cov(g(X;θ0))1W^* = \operatorname{Cov}(g(X; \theta_0))^{-1}, the inverse of the covariance matrix of the moment conditions. This is efficient GMM.

GMM is the workhorse of empirical economics: it covers instrumental variables (IV) estimation, Euler equation estimation in dynamic consumption models, asset-pricing moment conditions, and the Arellano-Bond estimator for dynamic panels. The asymptotic theory parallels M-estimation but with the additional weighting-matrix layer. We do not develop GMM further in formalStatistics; the natural home is a future econometrics topic. For the canonical reference, see Hansen (1982).

Where this leads. Sufficient Statistics extends Topic 13’s evaluation framework with the Rao-Blackwell theorem and Lehmann–Scheffé, comparing UMVUE to both MLE and MoM; when MLE and MoM coincide (exponential families), they are both UMVUE under completeness. Hypothesis Testing builds the score test and Wald test as asymptotic χ2\chi^2 tests (Topic 17 §17.9); the sandwich-variance estimator developed in §15.10 reappears as the basis of robust hypothesis tests under model misspecification in Track 6 (Linear Regression; Generalized Linear Models). Confidence Intervals inverts the M-estimator asymptotic-normality result: a Wald-type interval is θ^n±zα/2V^/n\hat{\theta}_n \pm z_{\alpha/2} \cdot \sqrt{\hat{V}/n}, with V^\hat{V} the sandwich-variance estimate. Linear Regression treats OLS as the M-estimator with ψ(x,y;β)=x(yxβ)\psi(x, y; \beta) = x(y - x^\top\beta); White’s heteroskedasticity-consistent standard errors are exactly the sandwich formula applied to OLS. Generalized Linear Models develops the Huber and Tukey extensions of OLS — robust regression — built on §15.10–§15.11.

Across formalML, M-estimation organizes much of supervised learning. formalML: Empirical Risk Minimization is M-estimation — the empirical risk is the average loss, and minimizing it solves a Z-estimating equation. formalML: Robust Statistics develops the Huber and Tukey M-estimators in regression, with the influence-function machinery and breakdown-point analysis built atop §15.11. formalML: Quantile Regression adopts the check-loss ψ-function, a non-smooth instance of the framework. And formalML: Generalized Method of Moments extends MoM to the over-identified setting Hansen (1982) introduced.

The reader who has worked through Topics 13–15 has the complete classical estimation toolkit: an evaluation framework (Topic 13), a maximally efficient default estimator (Topic 14), a fast warm-start with closed-form structure for many families (Topic 15 first half), and a unifying asymptotic theory that subsumes everything and accommodates robust alternatives (Topic 15 second half). Track 4 closes with Sufficient Statistics (Topic 16), which gives the data-reduction theory completing the foundations.


References

  1. Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  2. van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
  3. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.
  4. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley.
  5. Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
  6. Wasserman, L. (2004). All of Statistics. Springer.
  7. Hansen, L. P. (1982). Large Sample Properties of Generalized Method of Moments Estimators. Econometrica, 50(4), 1029–1054.
  8. Huber, P. J. (1964). Robust Estimation of a Location Parameter. Annals of Mathematical Statistics, 35(1), 73–101.