intermediate 50 min read · April 13, 2026

Multivariate Distributions

Joint, marginal, conditional densities in p dimensions, the multivariate Normal, Multinomial, Dirichlet, copulas — dependence beyond correlation.

8.1 From Bivariate to p Dimensions

Consider a neural network with 1,000 weights. Each weight is a random variable — uncertain before training, updated by data during training, distributed across a posterior after training. To describe the joint uncertainty of all 1,000 weights simultaneously, we need a distribution on R1000\mathbb{R}^{1000}. A single-variable Normal will not do. Neither will 1,000 independent single-variable Normals — because the weights are correlated (constraining one weight constrains others through the shared loss landscape). We need the full joint distribution: a function that assigns probability to every region of R1000\mathbb{R}^{1000} at once, capturing all the dependencies.

This is not an exotic requirement. Nearly every ML model operates on vectors of random variables:

  • Feature vectors xRp\mathbf{x} \in \mathbb{R}^p — the input to every supervised learning model. The joint distribution of features determines which classifiers work and which fail.
  • Weight vectors wRp\mathbf{w} \in \mathbb{R}^p — the parameters of a linear model, neural network, or Gaussian process. Bayesian inference places a prior distribution on w\mathbf{w} and computes a posterior, both of which are multivariate distributions.
  • Latent representations zRd\mathbf{z} \in \mathbb{R}^d — the hidden variables in a variational autoencoder, the topic proportions in LDA, the factors in a factor model. The generative process starts with a draw from a multivariate distribution over z\mathbf{z}.

Topic 3 — Random Variables introduced joint distributions for the bivariate case: two random variables XX and YY with a joint PDF fX,Y(x,y)f_{X,Y}(x, y), marginals obtained by integrating out one variable, and conditional distributions obtained by dividing (Definitions 8-14). That bivariate machinery is the right starting point, but it covers only p=2p = 2. The real world hands us p=10p = 10, p=100p = 100, p=10,000p = 10{,}000.

This topic generalizes everything from the bivariate case to pp dimensions. The ideas are the same — joint, marginal, conditional, independence — but the notation shifts from integrals over R\mathbb{R} to integrals over Rpq\mathbb{R}^{p-q}, from scalar variances to covariance matrices, and from scalar conditioning formulas to block matrix partitions. The payoff is enormous: the conditional multivariate Normal formula that we derive in Section 8.4 is the exact formula that powers Gaussian process prediction, Kalman filtering, and Bayesian linear regression. The Dirichlet distribution that we define in Section 8.6 is the prior that powers LDA topic models. The covariance matrix geometry of Section 8.7 is the mathematical foundation of PCA.

Overview of multivariate distributions: MVN contours at different correlations, Multinomial scatter with expected value, and Dirichlet samples on the simplex

8.2 The General p-Dimensional Framework

We begin with the general definitions. These are direct extensions of the bivariate definitions from Topic 3, but stated for an arbitrary dimension pp. The key prerequisite is comfort with iterated integrals over Rp\mathbb{R}^p — see formalCalculus: for Fubini’s theorem and the mechanics of integrating over subsets of Rp\mathbb{R}^p.

Definition 1 Joint PDF in p Dimensions

A continuous random vector X=(X1,X2,,Xp)T\mathbf{X} = (X_1, X_2, \ldots, X_p)^T has a joint probability density function fX:Rp[0,)f_{\mathbf{X}} : \mathbb{R}^p \to [0, \infty) if, for every (measurable) region ARpA \subseteq \mathbb{R}^p,

P(XA)=AfX(x)dx=AfX(x1,x2,,xp)dx1dx2dxpP(\mathbf{X} \in A) = \int_A f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x} = \int_A f_{\mathbf{X}}(x_1, x_2, \ldots, x_p) \, dx_1 \, dx_2 \cdots dx_p

The density must satisfy fX(x)0f_{\mathbf{X}}(\mathbf{x}) \geq 0 for all xRp\mathbf{x} \in \mathbb{R}^p and

RpfX(x)dx=1\int_{\mathbb{R}^p} f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x} = 1

When p=2p = 2, this recovers the bivariate joint PDF fX,Y(x,y)f_{X,Y}(x, y) from Topic 3 (Definition 8). The extension to p>2p > 2 is notational: we replace double integrals with pp-fold integrals.

Definition 2 Marginal Density via Integration

Given a random vector X=(X1,,Xp)T\mathbf{X} = (X_1, \ldots, X_p)^T with joint PDF fXf_{\mathbf{X}}, the marginal density of a subvector X1=(X1,,Xq)T\mathbf{X}_1 = (X_1, \ldots, X_q)^T (where 1q<p1 \leq q < p) is obtained by integrating the joint density over all components not in X1\mathbf{X}_1:

fX1(x1,,xq)=RpqfX(x1,,xq,xq+1,,xp)dxq+1dxpf_{\mathbf{X}_1}(x_1, \ldots, x_q) = \int_{\mathbb{R}^{p-q}} f_{\mathbf{X}}(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p) \, dx_{q+1} \cdots dx_p

In words: to find the density of a subset of the variables, integrate out the rest. This is the pp-dimensional generalization of Topic 3’s marginal formula (Definition 9), where we integrated out one variable over R\mathbb{R}.

The order of integration does not matter (by Fubini’s theorem), so we can marginalize in any order. To get the marginal of X3X_3 alone from a joint density on (X1,X2,X3,X4)(X_1, X_2, X_3, X_4), we integrate out X1X_1, X2X_2, and X4X_4 in any convenient order.

Definition 3 Conditional Density of Subvectors

Partition the random vector as X=(X1T,X2T)T\mathbf{X} = (\mathbf{X}_1^T, \mathbf{X}_2^T)^T, where X1\mathbf{X}_1 is qq-dimensional and X2\mathbf{X}_2 is (pq)(p - q)-dimensional. The conditional density of X1\mathbf{X}_1 given X2=x2\mathbf{X}_2 = \mathbf{x}_2 is

fX1X2(x1x2)=fX(x1,x2)fX2(x2)f_{\mathbf{X}_1 | \mathbf{X}_2}(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{f_{\mathbf{X}}(\mathbf{x}_1, \mathbf{x}_2)}{f_{\mathbf{X}_2}(\mathbf{x}_2)}

provided fX2(x2)>0f_{\mathbf{X}_2}(\mathbf{x}_2) > 0.

This is the ratio of the joint density to the marginal density of the conditioning variables — the same ratio as in the bivariate case (Topic 3, Definition 10), now applied to subvectors of arbitrary dimension. The conditional density fX1X2(x2)f_{\mathbf{X}_1 | \mathbf{X}_2}(\cdot \mid \mathbf{x}_2) is a proper density in x1\mathbf{x}_1 for each fixed x2\mathbf{x}_2: it integrates to 1 over Rq\mathbb{R}^q.

Definition 4 Mutual Independence

The random variables X1,X2,,XpX_1, X_2, \ldots, X_p are mutually independent if and only if the joint density factors as the product of all marginal densities:

fX(x1,x2,,xp)=fX1(x1)fX2(x2)fXp(xp)=j=1pfXj(xj)f_{\mathbf{X}}(x_1, x_2, \ldots, x_p) = f_{X_1}(x_1) \cdot f_{X_2}(x_2) \cdots f_{X_p}(x_p) = \prod_{j=1}^p f_{X_j}(x_j)

for all (x1,,xp)Rp(x_1, \ldots, x_p) \in \mathbb{R}^p.

Mutual independence is strictly stronger than pairwise independence (every pair (Xi,Xj)(X_i, X_j) is independent). Topic 2 (Conditional Probability) discussed this distinction for events; the same distinction holds for random variables. When p>2p > 2, verifying that all (p2)\binom{p}{2} pairs are independent does not guarantee mutual independence — the joint must factor as a product of all pp marginals.

The chain rule for densities tells us how to decompose any joint density into a product of conditionals. This is the density analog of the chain rule for probabilities from Topic 2.

Theorem 1 General Chain Rule for Densities

For any random vector X=(X1,X2,,Xp)T\mathbf{X} = (X_1, X_2, \ldots, X_p)^T with a joint density, the joint PDF factors as

fX(x1,x2,,xp)=fX1(x1)fX2X1(x2x1)fX3X1,X2(x3x1,x2)fXpX1,,Xp1(xpx1,,xp1)f_{\mathbf{X}}(x_1, x_2, \ldots, x_p) = f_{X_1}(x_1) \cdot f_{X_2 | X_1}(x_2 \mid x_1) \cdot f_{X_3 | X_1, X_2}(x_3 \mid x_1, x_2) \cdots f_{X_p | X_1, \ldots, X_{p-1}}(x_p \mid x_1, \ldots, x_{p-1})=j=1pfXjX1,,Xj1(xjx1,,xj1)= \prod_{j=1}^p f_{X_j | X_1, \ldots, X_{j-1}}(x_j \mid x_1, \ldots, x_{j-1})

where fX1fX1f_{X_1 | \emptyset} \equiv f_{X_1} (the first factor is the marginal).

Proof [show]

We proceed by induction on pp, using the definition of conditional density (Definition 3) at each step.

Base case (p=2p = 2). By Definition 3,

fX1,X2(x1,x2)=fX2X1(x2x1)fX1(x1)f_{X_1, X_2}(x_1, x_2) = f_{X_2 | X_1}(x_2 \mid x_1) \cdot f_{X_1}(x_1)

This is the definition of the conditional density rearranged.

Inductive step. Assume the result holds for p1p - 1 variables:

fX1,,Xp1(x1,,xp1)=j=1p1fXjX1,,Xj1(xjx1,,xj1)f_{X_1, \ldots, X_{p-1}}(x_1, \ldots, x_{p-1}) = \prod_{j=1}^{p-1} f_{X_j | X_1, \ldots, X_{j-1}}(x_j \mid x_1, \ldots, x_{j-1})

Now partition X\mathbf{X} as (X1T,Xp)T(\mathbf{X}_1^T, X_p)^T where X1=(X1,,Xp1)T\mathbf{X}_1 = (X_1, \ldots, X_{p-1})^T. By Definition 3 applied to this partition:

fX(x1,,xp)=fXpX1,,Xp1(xpx1,,xp1)fX1,,Xp1(x1,,xp1)f_{\mathbf{X}}(x_1, \ldots, x_p) = f_{X_p | X_1, \ldots, X_{p-1}}(x_p \mid x_1, \ldots, x_{p-1}) \cdot f_{X_1, \ldots, X_{p-1}}(x_1, \ldots, x_{p-1})

Substituting the inductive hypothesis for fX1,,Xp1f_{X_1, \ldots, X_{p-1}}:

fX(x1,,xp)=fXpX1,,Xp1(xpx1,,xp1)j=1p1fXjX1,,Xj1(xjx1,,xj1)f_{\mathbf{X}}(x_1, \ldots, x_p) = f_{X_p | X_1, \ldots, X_{p-1}}(x_p \mid x_1, \ldots, x_{p-1}) \cdot \prod_{j=1}^{p-1} f_{X_j | X_1, \ldots, X_{j-1}}(x_j \mid x_1, \ldots, x_{j-1})=j=1pfXjX1,,Xj1(xjx1,,xj1)= \prod_{j=1}^{p} f_{X_j | X_1, \ldots, X_{j-1}}(x_j \mid x_1, \ldots, x_{j-1})

The result follows by induction.

The chain rule is not merely a formula — it is the theoretical foundation of autoregressive models. When a language model generates text token by token, computing P(tokenttoken1,,tokent1)P(\text{token}_t \mid \text{token}_1, \ldots, \text{token}_{t-1}) at each step, it is implementing the chain rule for the joint distribution of the token sequence. The factorization order matters in practice (left-to-right vs. right-to-left vs. arbitrary), even though mathematically any ordering gives the same joint.

Three-panel diagram: joint contours with marginal PDFs, conditional density slices, and chain rule factorization tree

8.3 The Multivariate Normal Distribution

The univariate Normal N(μ,σ2)\mathcal{N}(\mu, \sigma^2) was the star of Topic 6 — Continuous Distributions. Its multivariate generalization Np(μ,Σ)\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) is the single most important distribution in all of statistics and machine learning. It appears as the prior on Bayesian linear regression weights, the variational family in variational inference, the base distribution in normalizing flows, the finite-dimensional projection of a Gaussian process, and the asymptotic distribution of virtually every well-behaved estimator. If you internalize one distribution from this topic, make it this one.

Before the definition, let us build intuition. In one dimension, a Normal distribution is specified by its center μ\mu and its spread σ2\sigma^2. In two dimensions, we still need a center — now a vector μ=(μ1,μ2)T\boldsymbol{\mu} = (\mu_1, \mu_2)^T — but the “spread” becomes richer. We need not just the variance of each coordinate (σ12\sigma_1^2 and σ22\sigma_2^2), but also the covariance between them (σ12\sigma_{12}). Positive covariance stretches the distribution along the diagonal; negative covariance stretches it along the anti-diagonal. The contours of equal density, which are circles for independent equal-variance Normals, become ellipses whose orientation and eccentricity are controlled by the covariance. In pp dimensions, the contours are ellipsoids in Rp\mathbb{R}^p, and the full covariance structure is captured by a p×pp \times p matrix Σ\boldsymbol{\Sigma}.

Definition 5 Multivariate Normal Distribution

A random vector X=(X1,X2,,Xp)T\mathbf{X} = (X_1, X_2, \ldots, X_p)^T has the multivariate Normal distribution with mean vector μRp\boldsymbol{\mu} \in \mathbb{R}^p and p×pp \times p positive definite covariance matrix Σ\boldsymbol{\Sigma}, written XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), if its joint PDF is

fX(x)=(2π)p/2Σ1/2exp ⁣(12(xμ)TΣ1(xμ))f_{\mathbf{X}}(\mathbf{x}) = (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

where Σ|\boldsymbol{\Sigma}| denotes the determinant of Σ\boldsymbol{\Sigma}.

The term (xμ)TΣ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) in the exponent is a quadratic form — a scalar that measures how far x\mathbf{x} is from μ\boldsymbol{\mu}, weighted by Σ1\boldsymbol{\Sigma}^{-1}. This is the squared Mahalanobis distance (Definition 9). The contours of constant density are the sets where this quadratic form is constant, which are ellipsoids centered at μ\boldsymbol{\mu}.

When p=1p = 1, we recover the univariate Normal: μ=μ\boldsymbol{\mu} = \mu, Σ=σ2\boldsymbol{\Sigma} = \sigma^2, and the exponent becomes (xμ)2/(2σ2)-(x - \mu)^2 / (2\sigma^2).

The MVN has a collection of remarkable properties that make it uniquely tractable. No other distribution enjoys all six of the properties below simultaneously.

Theorem 2 Properties of the Multivariate Normal

Let XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}). Then:

  1. Marginals are Normal. Every subvector of X\mathbf{X} is also multivariate Normal. In particular, each component XjN(μj,Σjj)X_j \sim \mathcal{N}(\mu_j, \Sigma_{jj}).

  2. Linear transformations preserve normality. If A\mathbf{A} is a q×pq \times p matrix and bRq\mathbf{b} \in \mathbb{R}^q, then Y=AX+bNq(Aμ+b,AΣAT)\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b} \sim \mathcal{N}_q(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T).

  3. Uncorrelated implies independent. If Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for all iji \neq j (equivalently, Σ\boldsymbol{\Sigma} is diagonal), then X1,,XpX_1, \ldots, X_p are mutually independent. This equivalence between uncorrelatedness and independence is unique to the multivariate Normal — it fails for every other distribution.

  4. Moment-generating function. MX(t)=E[etTX]=exp ⁣(tTμ+12tTΣt)M_{\mathbf{X}}(\mathbf{t}) = E[e^{\mathbf{t}^T \mathbf{X}}] = \exp\!\left(\mathbf{t}^T \boldsymbol{\mu} + \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t}\right) for all tRp\mathbf{t} \in \mathbb{R}^p.

  5. Sum of independent Normals. If XNp(μX,ΣX)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}_X, \boldsymbol{\Sigma}_X) and YNp(μY,ΣY)\mathbf{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_Y) are independent, then X+YNp(μX+μY,ΣX+ΣY)\mathbf{X} + \mathbf{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \boldsymbol{\Sigma}_X + \boldsymbol{\Sigma}_Y).

  6. Characteristic function. φX(t)=E[eitTX]=exp ⁣(itTμ12tTΣt)\varphi_{\mathbf{X}}(\mathbf{t}) = E[e^{i\mathbf{t}^T \mathbf{X}}] = \exp\!\left(i\mathbf{t}^T \boldsymbol{\mu} - \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t}\right), which is the MGF with t\mathbf{t} replaced by iti\mathbf{t} and exists for all tRp\mathbf{t} \in \mathbb{R}^p.

Proof [show]

Proof of Property 1 (Marginals are Normal).

We prove that the marginal distribution of X1=(X1,,Xq)T\mathbf{X}_1 = (X_1, \ldots, X_q)^T is Nq(μ1,Σ11)\mathcal{N}_q(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11}), where μ1=(μ1,,μq)T\boldsymbol{\mu}_1 = (\mu_1, \ldots, \mu_q)^T and Σ11\boldsymbol{\Sigma}_{11} is the upper-left q×qq \times q block of Σ\boldsymbol{\Sigma}.

Partition X=(X1T,X2T)T\mathbf{X} = (\mathbf{X}_1^T, \mathbf{X}_2^T)^T, μ=(μ1T,μ2T)T\boldsymbol{\mu} = (\boldsymbol{\mu}_1^T, \boldsymbol{\mu}_2^T)^T, and

Σ=(Σ11Σ12Σ21Σ22)\boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}

The marginal density of X1\mathbf{X}_1 is obtained by integrating out X2\mathbf{X}_2:

fX1(x1)=RpqfX(x1,x2)dx2f_{\mathbf{X}_1}(\mathbf{x}_1) = \int_{\mathbb{R}^{p-q}} f_{\mathbf{X}}(\mathbf{x}_1, \mathbf{x}_2) \, d\mathbf{x}_2

The exponent in the joint density is the quadratic form Q=(xμ)TΣ1(xμ)Q = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}). Using the block matrix inverse formula, we can write

Σ1=(Σ11Σ12Σ21Σ22)\boldsymbol{\Sigma}^{-1} = \begin{pmatrix} \boldsymbol{\Sigma}^{11} & \boldsymbol{\Sigma}^{12} \\ \boldsymbol{\Sigma}^{21} & \boldsymbol{\Sigma}^{22} \end{pmatrix}

where Σ22=(Σ22Σ21Σ111Σ12)1\boldsymbol{\Sigma}^{22} = (\boldsymbol{\Sigma}_{22} - \boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12})^{-1} and Σ11=(Σ11Σ12Σ221Σ21)1\boldsymbol{\Sigma}^{11} = (\boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1}.

Expanding the quadratic form QQ by completing the square in x2\mathbf{x}_2, we can separate the terms involving x2\mathbf{x}_2 from those involving only x1\mathbf{x}_1. Define a2=μ2+Σ21Σ111(x1μ1)\mathbf{a}_2 = \boldsymbol{\mu}_2 + \boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}(\mathbf{x}_1 - \boldsymbol{\mu}_1) and S21=Σ22Σ21Σ111Σ12\mathbf{S}_{2|1} = \boldsymbol{\Sigma}_{22} - \boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}. Then

Q=(x1μ1)TΣ111(x1μ1)+(x2a2)TS211(x2a2)Q = (\mathbf{x}_1 - \boldsymbol{\mu}_1)^T \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1) + (\mathbf{x}_2 - \mathbf{a}_2)^T \mathbf{S}_{2|1}^{-1} (\mathbf{x}_2 - \mathbf{a}_2)

The first term depends only on x1\mathbf{x}_1. The second term is a quadratic form in x2\mathbf{x}_2 centered at a2\mathbf{a}_2. When we integrate over x2\mathbf{x}_2:

Rpqexp ⁣(12(x2a2)TS211(x2a2))dx2=(2π)(pq)/2S211/2\int_{\mathbb{R}^{p-q}} \exp\!\left(-\frac{1}{2}(\mathbf{x}_2 - \mathbf{a}_2)^T \mathbf{S}_{2|1}^{-1} (\mathbf{x}_2 - \mathbf{a}_2)\right) d\mathbf{x}_2 = (2\pi)^{(p-q)/2} |\mathbf{S}_{2|1}|^{1/2}

because the integrand is the kernel of a Npq(a2,S21)\mathcal{N}_{p-q}(\mathbf{a}_2, \mathbf{S}_{2|1}) density and integrates to the reciprocal of the normalizing constant.

Combining the normalizing constants:

fX1(x1)=(2π)p/2Σ1/2(2π)(pq)/2S211/2exp ⁣(12(x1μ1)TΣ111(x1μ1))f_{\mathbf{X}_1}(\mathbf{x}_1) = (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \cdot (2\pi)^{(p-q)/2} |\mathbf{S}_{2|1}|^{1/2} \cdot \exp\!\left(-\frac{1}{2}(\mathbf{x}_1 - \boldsymbol{\mu}_1)^T \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1)\right)

The determinant identity Σ=Σ11S21|\boldsymbol{\Sigma}| = |\boldsymbol{\Sigma}_{11}| \cdot |\mathbf{S}_{2|1}| (which follows from the block determinant formula) gives

Σ1/2S211/2=Σ111/2|\boldsymbol{\Sigma}|^{-1/2} \cdot |\mathbf{S}_{2|1}|^{1/2} = |\boldsymbol{\Sigma}_{11}|^{-1/2}

and (2π)p/2(2π)(pq)/2=(2π)q/2(2\pi)^{-p/2} \cdot (2\pi)^{(p-q)/2} = (2\pi)^{-q/2}. Therefore

fX1(x1)=(2π)q/2Σ111/2exp ⁣(12(x1μ1)TΣ111(x1μ1))f_{\mathbf{X}_1}(\mathbf{x}_1) = (2\pi)^{-q/2} |\boldsymbol{\Sigma}_{11}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x}_1 - \boldsymbol{\mu}_1)^T \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1)\right)

This is the PDF of Nq(μ1,Σ11)\mathcal{N}_q(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11}). The marginal of any subvector is Normal with mean and covariance given by the corresponding subvector and submatrix.

Proof [show]

Proof of Property 2 (Linear transformations preserve normality).

Let Y=AX+b\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b} where A\mathbf{A} is q×pq \times p and bRq\mathbf{b} \in \mathbb{R}^q. We use the moment-generating function. The MGF of Y\mathbf{Y} at tRq\mathbf{t} \in \mathbb{R}^q is

MY(t)=E ⁣[etTY]=E ⁣[etT(AX+b)]=etTbE ⁣[e(ATt)TX]M_{\mathbf{Y}}(\mathbf{t}) = E\!\left[e^{\mathbf{t}^T \mathbf{Y}}\right] = E\!\left[e^{\mathbf{t}^T(\mathbf{A}\mathbf{X} + \mathbf{b})}\right] = e^{\mathbf{t}^T \mathbf{b}} \cdot E\!\left[e^{(\mathbf{A}^T \mathbf{t})^T \mathbf{X}}\right]

The expectation on the right is MX(ATt)M_{\mathbf{X}}(\mathbf{A}^T \mathbf{t}), which by Property 4 of the MVN equals

MX(ATt)=exp ⁣((ATt)Tμ+12(ATt)TΣ(ATt))M_{\mathbf{X}}(\mathbf{A}^T \mathbf{t}) = \exp\!\left((\mathbf{A}^T \mathbf{t})^T \boldsymbol{\mu} + \frac{1}{2}(\mathbf{A}^T \mathbf{t})^T \boldsymbol{\Sigma} (\mathbf{A}^T \mathbf{t})\right)=exp ⁣(tTAμ+12tTAΣATt)= \exp\!\left(\mathbf{t}^T \mathbf{A}\boldsymbol{\mu} + \frac{1}{2}\mathbf{t}^T \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T \mathbf{t}\right)

Therefore

MY(t)=exp ⁣(tT(Aμ+b)+12tT(AΣAT)t)M_{\mathbf{Y}}(\mathbf{t}) = \exp\!\left(\mathbf{t}^T(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}) + \frac{1}{2}\mathbf{t}^T (\mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T) \mathbf{t}\right)

This is the MGF of Nq(Aμ+b,AΣAT)\mathcal{N}_q(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T). Since the MGF uniquely determines the distribution (when it exists in a neighborhood of the origin, which it does for the Normal), we conclude YNq(Aμ+b,AΣAT)\mathbf{Y} \sim \mathcal{N}_q(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^T).

Note that Property 1 (marginals are Normal) is a special case: extracting a subvector is a linear transformation with A\mathbf{A} as a selection matrix (rows of the identity matrix) and b=0\mathbf{b} = \mathbf{0}.

Proof [show]

Proof: MVN in exponential family form (see Remark 1 below).

We show that the MVN PDF can be written in the canonical exponential family form f(xθ)=h(x)exp(η(θ)T(x)A(θ))f(\mathbf{x} \mid \boldsymbol{\theta}) = h(\mathbf{x}) \exp(\boldsymbol{\eta}(\boldsymbol{\theta}) \cdot \mathbf{T}(\mathbf{x}) - A(\boldsymbol{\theta})) from Topic 7 — Exponential Families.

Start from the MVN PDF and take the logarithm:

logfX(x)=p2log(2π)12logΣ12(xμ)TΣ1(xμ)\log f_{\mathbf{X}}(\mathbf{x}) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})

Expand the quadratic form. Let Λ=Σ1\boldsymbol{\Lambda} = \boldsymbol{\Sigma}^{-1} (the precision matrix). Then

(xμ)TΛ(xμ)=xTΛx2μTΛx+μTΛμ(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Lambda}(\mathbf{x} - \boldsymbol{\mu}) = \mathbf{x}^T \boldsymbol{\Lambda} \mathbf{x} - 2\boldsymbol{\mu}^T \boldsymbol{\Lambda} \mathbf{x} + \boldsymbol{\mu}^T \boldsymbol{\Lambda} \boldsymbol{\mu}

Substituting:

logfX(x)=p2log(2π)+μTΛx12xTΛx12μTΛμ12logΣ\log f_{\mathbf{X}}(\mathbf{x}) = -\frac{p}{2}\log(2\pi) + \boldsymbol{\mu}^T \boldsymbol{\Lambda} \mathbf{x} - \frac{1}{2}\mathbf{x}^T \boldsymbol{\Lambda} \mathbf{x} - \frac{1}{2}\boldsymbol{\mu}^T \boldsymbol{\Lambda} \boldsymbol{\mu} - \frac{1}{2}\log|\boldsymbol{\Sigma}|

Now we identify the exponential family components. The natural parameters are η1=Λμ=Σ1μ\boldsymbol{\eta}_1 = \boldsymbol{\Lambda}\boldsymbol{\mu} = \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} (a pp-vector) and η2=12Λ=12Σ1\boldsymbol{\eta}_2 = -\frac{1}{2}\boldsymbol{\Lambda} = -\frac{1}{2}\boldsymbol{\Sigma}^{-1} (a p×pp \times p symmetric matrix, contributing p(p+1)/2p(p+1)/2 unique parameters). The corresponding sufficient statistics are T1(x)=x\mathbf{T}_1(\mathbf{x}) = \mathbf{x} and T2(x)=xxT\mathbf{T}_2(\mathbf{x}) = \mathbf{x}\mathbf{x}^T.

The components of the canonical form are:

  • h(x)=(2π)p/2h(\mathbf{x}) = (2\pi)^{-p/2}
  • η=(Σ1μ,  12Σ1)\boldsymbol{\eta} = (\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}, \; -\frac{1}{2}\boldsymbol{\Sigma}^{-1})
  • T(x)=(x,  xxT)\mathbf{T}(\mathbf{x}) = (\mathbf{x}, \; \mathbf{x}\mathbf{x}^T)
  • A(η)=12μTΣ1μ+12logΣA(\boldsymbol{\eta}) = \frac{1}{2}\boldsymbol{\mu}^T \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} + \frac{1}{2}\log|\boldsymbol{\Sigma}|

The inner product is ηT(x)=(Σ1μ)Tx+tr(12Σ1xxT)=μTΣ1x12xTΣ1x\boldsymbol{\eta} \cdot \mathbf{T}(\mathbf{x}) = (\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu})^T \mathbf{x} + \text{tr}(-\frac{1}{2}\boldsymbol{\Sigma}^{-1} \mathbf{x}\mathbf{x}^T) = \boldsymbol{\mu}^T\boldsymbol{\Sigma}^{-1}\mathbf{x} - \frac{1}{2}\mathbf{x}^T\boldsymbol{\Sigma}^{-1}\mathbf{x}, matching the expansion above.

The MVN is a kk-parameter exponential family with k=p+p(p+1)/2k = p + p(p+1)/2 natural parameters (the entries of Σ1μ\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} and the unique entries of 12Σ1-\frac{1}{2}\boldsymbol{\Sigma}^{-1}). When Σ\boldsymbol{\Sigma} is known, it reduces to a pp-parameter family with natural parameters η=Σ1μ\boldsymbol{\eta} = \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} and sufficient statistics T(x)=x\mathbf{T}(\mathbf{x}) = \mathbf{x}.

Remark 1 MVN as Exponential Family Member

The exponential family form of the MVN connects directly to Topic 7. The sufficient statistics T(x)=(x,xxT)\mathbf{T}(\mathbf{x}) = (\mathbf{x}, \mathbf{x}\mathbf{x}^T) tell us that for an i.i.d. sample x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n from a MVN, the sample mean xˉ=1nixi\bar{\mathbf{x}} = \frac{1}{n}\sum_i \mathbf{x}_i and the sample second moment matrix 1nixixiT\frac{1}{n}\sum_i \mathbf{x}_i \mathbf{x}_i^T are jointly sufficient for (μ,Σ)(\boldsymbol{\mu}, \boldsymbol{\Sigma}). No matter how large the sample, these two summary quantities capture everything the data has to say about the parameters.

The log-partition function A(η)=12μTΣ1μ+12logΣA(\boldsymbol{\eta}) = \frac{1}{2}\boldsymbol{\mu}^T \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} + \frac{1}{2}\log|\boldsymbol{\Sigma}| generates the moments via differentiation, just as in the univariate case (Topic 7, Theorem 7.1). And the convexity of AA in the natural parameters guarantees that the MLE for (μ,Σ)(\boldsymbol{\mu}, \boldsymbol{\Sigma}) — namely μ^=xˉ\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} and Σ^=1ni(xixˉ)(xixˉ)T\hat{\boldsymbol{\Sigma}} = \frac{1}{n}\sum_i (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T — is unique.

Three-panel: 3D surface plot of bivariate Normal, effect of covariance on ellipses, samples with marginal curves
Interactive: Multivariate Normal Explorer
-4-4-3-3-2-2-1-10011223344x₁x₂eigenvectors
Covariance Matrix
Σ =
1.000.00
0.001.00
Eigendecomposition
λ₁ =1.000
λ₂ =1.000
v₁ =(1.00, 0.00)
v₂ =(0.00, 1.00)
Matrix Properties
|Σ| =1.000
κ(Σ) =1.000
Σ is nearly spherical — the contours are approximately circular and the principal axes align with the coordinate axes.

8.4 Conditional Multivariate Normal

This section contains the single most consequential formula in this topic. The conditional MVN result is the exact same formula used to make predictions in Gaussian processes, to update states in Kalman filters, and to perform Bayesian linear regression. Learn this formula once, and you have the engine behind three of the most important tools in applied ML.

Theorem 3 Conditional Multivariate Normal

Let XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}). Partition X\mathbf{X}, μ\boldsymbol{\mu}, and Σ\boldsymbol{\Sigma} as

X=(X1X2),μ=(μ1μ2),Σ=(Σ11Σ12Σ21Σ22)\mathbf{X} = \begin{pmatrix} \mathbf{X}_1 \\ \mathbf{X}_2 \end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}

where X1\mathbf{X}_1 is qq-dimensional and X2\mathbf{X}_2 is (pq)(p-q)-dimensional. Then the conditional distribution of X1\mathbf{X}_1 given X2=x2\mathbf{X}_2 = \mathbf{x}_2 is

X1X2=x2Nq ⁣(μ12,  Σ12)\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}_q\!\left(\boldsymbol{\mu}_{1|2}, \; \boldsymbol{\Sigma}_{1|2}\right)

where the conditional mean and conditional covariance are

μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

Two properties are striking:

  • The conditional mean is a linear function of x2\mathbf{x}_2. It equals the prior mean μ1\boldsymbol{\mu}_1 plus a correction term Σ12Σ221(x2μ2)\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) that adjusts for how much x2\mathbf{x}_2 deviates from its mean, weighted by the cross-covariance Σ12\boldsymbol{\Sigma}_{12} relative to the variance of X2\mathbf{X}_2.
  • The conditional covariance does not depend on x2\mathbf{x}_2. No matter what value of x2\mathbf{x}_2 we observe, the uncertainty about X1\mathbf{X}_1 is always Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}, which is the Schur complement of Σ22\boldsymbol{\Sigma}_{22} in Σ\boldsymbol{\Sigma}.
Proof [show]

We derive the conditional density by starting from the definition fX1X2(x1x2)=fX(x1,x2)/fX2(x2)f_{\mathbf{X}_1 | \mathbf{X}_2}(\mathbf{x}_1 \mid \mathbf{x}_2) = f_{\mathbf{X}}(\mathbf{x}_1, \mathbf{x}_2) / f_{\mathbf{X}_2}(\mathbf{x}_2) and working with the exponent of the joint density.

Step 1: Set up the quadratic form. The exponent of the joint MVN density is

Q=12(xμ)TΣ1(xμ)Q = -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})

We need the block inverse of Σ\boldsymbol{\Sigma}. Using the formula for the inverse of a 2×22 \times 2 block matrix:

Σ1=(Σ111+Σ111Σ12S1Σ21Σ111Σ111Σ12S1S1Σ21Σ111S1)\boldsymbol{\Sigma}^{-1} = \begin{pmatrix} \boldsymbol{\Sigma}_{11}^{-1} + \boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\mathbf{S}^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1} & -\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\mathbf{S}^{-1} \\ -\mathbf{S}^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1} & \mathbf{S}^{-1} \end{pmatrix}

where S=Σ22Σ21Σ111Σ12\mathbf{S} = \boldsymbol{\Sigma}_{22} - \boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12} is the Schur complement of Σ11\boldsymbol{\Sigma}_{11} in Σ\boldsymbol{\Sigma}.

Step 2: Expand the quadratic form. Let d1=x1μ1\mathbf{d}_1 = \mathbf{x}_1 - \boldsymbol{\mu}_1 and d2=x2μ2\mathbf{d}_2 = \mathbf{x}_2 - \boldsymbol{\mu}_2. Expanding Q=12(d1T,d2T)Σ1(d1T,d2T)TQ = -\frac{1}{2}(\mathbf{d}_1^T, \mathbf{d}_2^T)\boldsymbol{\Sigma}^{-1}(\mathbf{d}_1^T, \mathbf{d}_2^T)^T using the block inverse:

2Q=d1T(Σ111+Σ111Σ12S1Σ21Σ111)d12d1TΣ111Σ12S1d2+d2TS1d2-2Q = \mathbf{d}_1^T(\boldsymbol{\Sigma}_{11}^{-1} + \boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\mathbf{S}^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1})\mathbf{d}_1 - 2\mathbf{d}_1^T\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\mathbf{S}^{-1}\mathbf{d}_2 + \mathbf{d}_2^T\mathbf{S}^{-1}\mathbf{d}_2

Step 3: Complete the square in x1\mathbf{x}_1. We want to rewrite 2Q-2Q as the sum of a perfect square in d1\mathbf{d}_1 (which will give the conditional density) and a term in d2\mathbf{d}_2 alone (which belongs to the marginal of X2\mathbf{X}_2).

Define Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21} and m=Σ12Σ221d2\mathbf{m} = \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\mathbf{d}_2, so that μ12=μ1+m\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \mathbf{m}.

After completing the square (grouping terms involving d1\mathbf{d}_1 and absorbing cross-terms), the quadratic form separates as:

2Q=(d1m)TΣ121(d1m)+d2TΣ221d2-2Q = (\mathbf{d}_1 - \mathbf{m})^T \boldsymbol{\Sigma}_{1|2}^{-1} (\mathbf{d}_1 - \mathbf{m}) + \mathbf{d}_2^T \boldsymbol{\Sigma}_{22}^{-1} \mathbf{d}_2

We verify this by expanding the right side. The first term expands to:

(d1m)TΣ121(d1m)=d1TΣ121d12d1TΣ121m+mTΣ121m(\mathbf{d}_1 - \mathbf{m})^T \boldsymbol{\Sigma}_{1|2}^{-1} (\mathbf{d}_1 - \mathbf{m}) = \mathbf{d}_1^T \boldsymbol{\Sigma}_{1|2}^{-1} \mathbf{d}_1 - 2\mathbf{d}_1^T \boldsymbol{\Sigma}_{1|2}^{-1}\mathbf{m} + \mathbf{m}^T \boldsymbol{\Sigma}_{1|2}^{-1}\mathbf{m}

Using the Woodbury identity, Σ121=(Σ11Σ12Σ221Σ21)1=Σ111+Σ111Σ12S1Σ21Σ111\boldsymbol{\Sigma}_{1|2}^{-1} = (\boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1} = \boldsymbol{\Sigma}_{11}^{-1} + \boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12}\mathbf{S}^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}, and substituting m=Σ12Σ221d2\mathbf{m} = \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\mathbf{d}_2, each term matches the expansion in Step 2. The cross-terms in d1T()d2\mathbf{d}_1^T(\cdot)\mathbf{d}_2 combine correctly, and the terms purely in d2\mathbf{d}_2 collect to d2TΣ221d2\mathbf{d}_2^T\boldsymbol{\Sigma}_{22}^{-1}\mathbf{d}_2.

Step 4: Read off the conditional. With the quadratic form separated, the joint density factors as

fX(x1,x2)exp ⁣(12(x1μ12)TΣ121(x1μ12))exp ⁣(12d2TΣ221d2)f_{\mathbf{X}}(\mathbf{x}_1, \mathbf{x}_2) \propto \exp\!\left(-\frac{1}{2}(\mathbf{x}_1 - \boldsymbol{\mu}_{1|2})^T \boldsymbol{\Sigma}_{1|2}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_{1|2})\right) \cdot \exp\!\left(-\frac{1}{2}\mathbf{d}_2^T \boldsymbol{\Sigma}_{22}^{-1} \mathbf{d}_2\right)

The first factor, viewed as a function of x1\mathbf{x}_1, is the kernel of Nq(μ12,Σ12)\mathcal{N}_q(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2}). The second factor depends only on x2\mathbf{x}_2 and is the kernel of fX2(x2)f_{\mathbf{X}_2}(\mathbf{x}_2).

By Definition 3, the conditional density is

fX1X2(x1x2)=fX(x1,x2)fX2(x2)exp ⁣(12(x1μ12)TΣ121(x1μ12))f_{\mathbf{X}_1 | \mathbf{X}_2}(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{f_{\mathbf{X}}(\mathbf{x}_1, \mathbf{x}_2)}{f_{\mathbf{X}_2}(\mathbf{x}_2)} \propto \exp\!\left(-\frac{1}{2}(\mathbf{x}_1 - \boldsymbol{\mu}_{1|2})^T \boldsymbol{\Sigma}_{1|2}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_{1|2})\right)

This is the kernel of Nq(μ12,Σ12)\mathcal{N}_q(\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1|2}), where

μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2)Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}

The conditional mean is linear in x2\mathbf{x}_2 (a regression of X1\mathbf{X}_1 on X2\mathbf{X}_2), and the conditional covariance is constant in x2\mathbf{x}_2 (it depends only on the covariance structure, not on the observed value).

Example 1 Bivariate Case: Recovering the Scalar Formula

Let p=2p = 2 with X=(X1,X2)TN2(μ,Σ)\mathbf{X} = (X_1, X_2)^T \sim \mathcal{N}_2(\boldsymbol{\mu}, \boldsymbol{\Sigma}) where

μ=(μ1μ2),Σ=(σ12ρσ1σ2ρσ1σ2σ22)\boldsymbol{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}

Here q=1q = 1, Σ11=σ12\boldsymbol{\Sigma}_{11} = \sigma_1^2, Σ12=ρσ1σ2\boldsymbol{\Sigma}_{12} = \rho\sigma_1\sigma_2, Σ21=ρσ1σ2\boldsymbol{\Sigma}_{21} = \rho\sigma_1\sigma_2, Σ22=σ22\boldsymbol{\Sigma}_{22} = \sigma_2^2. Applying the conditional MVN formulas:

μ12=μ1+ρσ1σ21σ22(x2μ2)=μ1+ρσ1σ2(x2μ2)\mu_{1|2} = \mu_1 + \rho\sigma_1\sigma_2 \cdot \frac{1}{\sigma_2^2} \cdot (x_2 - \mu_2) = \mu_1 + \rho\frac{\sigma_1}{\sigma_2}(x_2 - \mu_2)Σ12=σ12ρσ1σ21σ22ρσ1σ2=σ12(1ρ2)\Sigma_{1|2} = \sigma_1^2 - \rho\sigma_1\sigma_2 \cdot \frac{1}{\sigma_2^2} \cdot \rho\sigma_1\sigma_2 = \sigma_1^2(1 - \rho^2)

So X1X2=x2N ⁣(μ1+ρσ1σ2(x2μ2),  σ12(1ρ2))X_1 \mid X_2 = x_2 \sim \mathcal{N}\!\left(\mu_1 + \rho\frac{\sigma_1}{\sigma_2}(x_2 - \mu_2), \; \sigma_1^2(1 - \rho^2)\right).

This recovers the bivariate conditional Normal formula from Topic 3 (Example 9). The conditional mean traces out the regression line μ1+ρ(σ1/σ2)(x2μ2)\mu_1 + \rho(\sigma_1/\sigma_2)(x_2 - \mu_2), and the conditional variance σ12(1ρ2)\sigma_1^2(1-\rho^2) shrinks as ρ1|\rho| \to 1 — stronger correlation means less residual uncertainty after conditioning.

When ρ=0\rho = 0, the conditional mean is just μ1\mu_1 (knowing X2X_2 tells us nothing about X1X_1) and the conditional variance is σ12\sigma_1^2 (no uncertainty reduction). When ρ=1|\rho| = 1, the conditional variance is 0 — knowing X2X_2 determines X1X_1 exactly. This is the perfect linear relationship X1=μ1+ρ(σ1/σ2)(X2μ2)X_1 = \mu_1 + \rho(\sigma_1/\sigma_2)(X_2 - \mu_2).

Three-panel: conditional slices with regression line, block matrix partition diagram, GP prediction as conditional MVN

The conditional MVN formula is the workhorse behind Gaussian process prediction. In a GP, we observe function values fobs\mathbf{f}_{\text{obs}} at training inputs and want to predict function values f\mathbf{f}_* at new inputs. The GP prior places a joint MVN on (f,fobs)T(\mathbf{f}_*, \mathbf{f}_{\text{obs}})^T, and the posterior predictive ffobs\mathbf{f}_* \mid \mathbf{f}_{\text{obs}} is given exactly by the conditional MVN formula: μ12\boldsymbol{\mu}_{1|2} is the posterior mean and Σ12\boldsymbol{\Sigma}_{1|2} is the posterior covariance. See formalML: for the full development.

Interactive: Conditional Multivariate Normal Explorer
Drag the horizontal line on the joint density to condition on x₂
-3-2-10123-3-2-10123x₁x₂0.0Joint Density f(x₁, x₂)
Partition Notation
μ = (0, 0)T
Σ =
[
1.00.60
0.601.0
]
Conditional Parameters
μ1|2 = μ₁ + (Σ₁₂ / Σ₂₂)(x₂ − μ₂)
= 0 + (0.60 / 1.0)(0.00 − 0)
= 0.000
↑ depends on x₂
σ²1|2 = Σ₁₁ − Σ₁₂² / Σ₂₂
= 1.00.60² / 1.0
= 0.640
↑ does NOT depend on x₂
The conditional mean is linear in x₂ (the regression line). The conditional variance is constant — it never changes as you drag.
μ₁|₂ = 0.00−2σ+2σ-3-2-101230.250.50x₁f(x₁ | x₂)Conditional: X₁ | X₂ = 0.0
Mean varies with x₂: μ1|2 = 0.000
Variance constant: σ²1|2 = 0.640

8.5 The Multinomial Distribution

The Binomial counts successes in nn independent trials with two outcomes. What if there are k>2k > 2 possible outcomes? This is the setting of the Multinomial distribution — the natural generalization of the Binomial to multiple categories.

Consider classifying n=100n = 100 emails as spam, promotions, or legitimate. Each email falls into exactly one category, and we model the classification probabilities as (p1,p2,p3)=(0.3,0.2,0.5)(p_1, p_2, p_3) = (0.3, 0.2, 0.5). The counts (X1,X2,X3)(X_1, X_2, X_3) — how many emails fall into each category — follow a Multinomial distribution. The constraint X1+X2+X3=100X_1 + X_2 + X_3 = 100 is baked in: every email goes somewhere. This fixed-total constraint has deep consequences for the covariance structure.

Definition 6 Multinomial Distribution

Let n{1,2,3,}n \in \{1, 2, 3, \ldots\} be the number of trials and p=(p1,p2,,pk)T\mathbf{p} = (p_1, p_2, \ldots, p_k)^T a probability vector with pj>0p_j > 0 and j=1kpj=1\sum_{j=1}^k p_j = 1. The random vector X=(X1,,Xk)T\mathbf{X} = (X_1, \ldots, X_k)^T has the Multinomial distribution Mult(n,p)\text{Mult}(n, \mathbf{p}) if its PMF is

P(X=x)=n!x1!x2!xk!  p1x1p2x2pkxkP(\mathbf{X} = \mathbf{x}) = \frac{n!}{x_1! \, x_2! \cdots x_k!} \; p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k}

for non-negative integers x1,,xkx_1, \ldots, x_k with j=1kxj=n\sum_{j=1}^k x_j = n.

The coefficient n!x1!xk!=(nx1,,xk)\frac{n!}{x_1! \cdots x_k!} = \binom{n}{x_1, \ldots, x_k} is the multinomial coefficient — the number of ways to assign nn items into kk groups of sizes x1,,xkx_1, \ldots, x_k.

When k=2k = 2, we have X=(X1,nX1)T\mathbf{X} = (X_1, n - X_1)^T and the PMF reduces to (nx1)p1x1(1p1)nx1\binom{n}{x_1} p_1^{x_1} (1-p_1)^{n-x_1}, which is the Binomial(n,p1)(n, p_1) PMF. The Multinomial is genuinely the kk-category Binomial.

Theorem 4 Multinomial Moments

Let XMult(n,p)\mathbf{X} \sim \text{Mult}(n, \mathbf{p}). Then:

  1. E[Xj]=npjE[X_j] = np_j for each j=1,,kj = 1, \ldots, k
  2. Var(Xj)=npj(1pj)\text{Var}(X_j) = np_j(1 - p_j) for each jj — each marginal XjBin(n,pj)X_j \sim \text{Bin}(n, p_j)
  3. Cov(Xi,Xj)=npipj\text{Cov}(X_i, X_j) = -np_ip_j for iji \neq j — the covariance is always negative
Proof [show]

Proof of Part 1 (Mean via indicator variables).

Write each trial outcome as an indicator. For the \ell-th trial (=1,,n\ell = 1, \ldots, n), define the indicator

Zj={1if trial  results in category j0otherwiseZ_{\ell j} = \begin{cases} 1 & \text{if trial } \ell \text{ results in category } j \\ 0 & \text{otherwise} \end{cases}

Then Xj==1nZjX_j = \sum_{\ell=1}^n Z_{\ell j}, the count for category jj is the sum of nn independent Bernoulli(pj)(p_j) indicators. By linearity of expectation:

E[Xj]=E ⁣[=1nZj]==1nE[Zj]==1npj=npjE[X_j] = E\!\left[\sum_{\ell=1}^n Z_{\ell j}\right] = \sum_{\ell=1}^n E[Z_{\ell j}] = \sum_{\ell=1}^n p_j = np_j

Since each ZjZ_{\ell j} is Bernoulli(pj)(p_j), the marginal Xj=ZjX_j = \sum_\ell Z_{\ell j} is Bin(n,pj)\text{Bin}(n, p_j). From Topic 5, Var(Xj)=npj(1pj)\text{Var}(X_j) = np_j(1 - p_j).

Proof [show]

Proof of Part 3 (Negative covariance via the variance constraint).

This is the most revealing proof, because it shows why the covariance must be negative.

The total count is fixed: j=1kXj=n\sum_{j=1}^k X_j = n with probability 1. A constant has zero variance:

Var ⁣(j=1kXj)=0\text{Var}\!\left(\sum_{j=1}^k X_j\right) = 0

Expanding the variance of the sum using the general formula Var(Xj)=Var(Xj)+2i<jCov(Xi,Xj)\text{Var}(\sum X_j) = \sum \text{Var}(X_j) + 2\sum_{i < j} \text{Cov}(X_i, X_j):

j=1kVar(Xj)+2i<jCov(Xi,Xj)=0\sum_{j=1}^k \text{Var}(X_j) + 2\sum_{i < j} \text{Cov}(X_i, X_j) = 0

Substituting Var(Xj)=npj(1pj)\text{Var}(X_j) = np_j(1-p_j):

j=1knpj(1pj)+2i<jCov(Xi,Xj)=0\sum_{j=1}^k np_j(1-p_j) + 2\sum_{i < j} \text{Cov}(X_i, X_j) = 0

The first sum is njpjnjpj2=nnjpj2n\sum_j p_j - n\sum_j p_j^2 = n - n\sum_j p_j^2. Therefore

2i<jCov(Xi,Xj)=(nnjpj2)=n+njpj22\sum_{i < j} \text{Cov}(X_i, X_j) = -(n - n\sum_j p_j^2) = -n + n\sum_j p_j^2

Now we claim that Cov(Xi,Xj)=npipj\text{Cov}(X_i, X_j) = -np_ip_j for iji \neq j. To verify, we can compute it directly from the indicator representation. For distinct trials m\ell \neq m:

E[ZiZmj]=E[Zi]E[Zmj]=pipjE[Z_{\ell i} Z_{m j}] = E[Z_{\ell i}]E[Z_{m j}] = p_i p_j

since trials are independent. For the same trial \ell:

E[ZiZj]=P(trial  in category i AND category j)=0E[Z_{\ell i} Z_{\ell j}] = P(\text{trial } \ell \text{ in category } i \text{ AND category } j) = 0

since a single trial cannot be in two categories simultaneously. Therefore

E[XiXj]=E ⁣[=1nZim=1nZmj]==1nm=1nE[ZiZmj]E[X_i X_j] = E\!\left[\sum_{\ell=1}^n Z_{\ell i} \sum_{m=1}^n Z_{m j}\right] = \sum_{\ell=1}^n \sum_{m=1}^n E[Z_{\ell i} Z_{m j}]==1nE[ZiZj]+mE[ZiZmj]=n0+n(n1)pipj=n(n1)pipj= \sum_{\ell=1}^n E[Z_{\ell i} Z_{\ell j}] + \sum_{\ell \neq m} E[Z_{\ell i} Z_{m j}] = n \cdot 0 + n(n-1) \cdot p_i p_j = n(n-1)p_i p_j

So

Cov(Xi,Xj)=E[XiXj]E[Xi]E[Xj]=n(n1)pipjn2pipj=npipj\text{Cov}(X_i, X_j) = E[X_i X_j] - E[X_i]E[X_j] = n(n-1)p_ip_j - n^2p_ip_j = -np_ip_j

The covariance is always negative. This makes geometric sense: X1+X2++Xk=nX_1 + X_2 + \cdots + X_k = n is a hard constraint, so if one count goes up, the others must collectively go down. More emails classified as spam means fewer classified as legitimate — the negative covariance captures this competition for a fixed total.

We can verify consistency. The sum 2i<j(npipj)=n2i<jpipj=n[(jpj)2jpj2]=n(1jpj2)2\sum_{i<j}(-np_ip_j) = -n \cdot 2\sum_{i<j}p_ip_j = -n\left[(\sum_j p_j)^2 - \sum_j p_j^2\right] = -n(1 - \sum_j p_j^2), which matches the constraint equation above.

Remark 2 Multinomial as Exponential Family Member

The Multinomial is an exponential family member, but the parameterization requires care because of the constraint pj=1\sum p_j = 1. Working with the first k1k - 1 components (since pk=1j=1k1pjp_k = 1 - \sum_{j=1}^{k-1} p_j):

logP(X=x)=log(nx1,,xk)+j=1k1xjlogpj+xklogpk\log P(\mathbf{X} = \mathbf{x}) = \log \binom{n}{x_1, \ldots, x_k} + \sum_{j=1}^{k-1} x_j \log p_j + x_k \log p_k=log(nx1,,xk)+j=1k1xjlogpj+(nj=1k1xj)logpk= \log \binom{n}{x_1, \ldots, x_k} + \sum_{j=1}^{k-1} x_j \log p_j + \left(n - \sum_{j=1}^{k-1} x_j\right) \log p_k=log(nx1,,xk)+j=1k1xjlogpjpk+nlogpk= \log \binom{n}{x_1, \ldots, x_k} + \sum_{j=1}^{k-1} x_j \log\frac{p_j}{p_k} + n\log p_k

The natural parameters are the log-odds ratios ηj=log(pj/pk)\eta_j = \log(p_j / p_k) for j=1,,k1j = 1, \ldots, k-1, and the sufficient statistics are Tj(x)=xjT_j(\mathbf{x}) = x_j. The inverse map gives pj=eηj/(1+m=1k1eηm)p_j = e^{\eta_j} / (1 + \sum_{m=1}^{k-1} e^{\eta_m}) — this is the softmax function. The softmax output layer in a neural network classifier is performing Multinomial exponential family parameterization, just as the sigmoid is performing Bernoulli parameterization.

Three-panel: Multinomial samples vs expected counts, marginal Binomial overlay, negative covariance scatter
Interactive: Multinomial Distribution Explorer
6.76.76.7Cat 1Cat 2Cat 305101520Count
Dashed bars = E[Xj] = npj. Dots = individual draws.
0510152005101520XXX + X = 20
ρ(X₁, X₂) = -0.491 (sample)
CategoryE[Xj] = npjVar(Xj) = npj(1-pj)Cov(Xi, Xj)
Cat 16.674.44-2.22-2.22
Cat 26.674.44-2.22-2.22
Cat 36.674.44-2.22-2.22
Off-diagonal covariances are negative: Cov(Xi, Xj) = -npipj (the fixed-total constraint).

8.6 The Dirichlet Distribution

The Multinomial models counts given fixed probabilities. But what if the probabilities themselves are unknown? We need a distribution over probability vectors — a distribution on the simplex Δk1={(p1,,pk):pj0,pj=1}\Delta_{k-1} = \{(p_1, \ldots, p_k) : p_j \geq 0, \sum p_j = 1\}. This is the Dirichlet distribution.

Consider a language model trained on a corpus. The topic proportions of a document — say, 40% sports, 35% politics, 25% science — are a probability vector on the simplex. Different documents have different topic proportions. The Dirichlet distribution models this variability: each document’s topic proportions are a draw from a Dirichlet, and the concentration parameters α\boldsymbol{\alpha} control how spread out or concentrated the draws are.

Definition 7 Dirichlet Distribution

Let α=(α1,α2,,αk)T\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \ldots, \alpha_k)^T with αj>0\alpha_j > 0 for all jj. The random vector P=(P1,P2,,Pk)T\mathbf{P} = (P_1, P_2, \ldots, P_k)^T has the Dirichlet distribution Dir(α)\text{Dir}(\boldsymbol{\alpha}) if its PDF on the simplex Δk1\Delta_{k-1} is

f(pα)=Γ(α0)j=1kΓ(αj)j=1kpjαj1f(\mathbf{p} \mid \boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)}{\prod_{j=1}^k \Gamma(\alpha_j)} \prod_{j=1}^k p_j^{\alpha_j - 1}

where α0=j=1kαj\alpha_0 = \sum_{j=1}^k \alpha_j is the concentration sum (also called the total concentration or precision).

The normalizing constant Γ(α0)Γ(αj)\frac{\Gamma(\alpha_0)}{\prod \Gamma(\alpha_j)} is the reciprocal of the multivariate Beta function B(α)=Γ(αj)Γ(α0)B(\boldsymbol{\alpha}) = \frac{\prod \Gamma(\alpha_j)}{\Gamma(\alpha_0)}.

The concentration parameters control the shape:

  • αj=1\alpha_j = 1 for all jj: Dir(1)\text{Dir}(\mathbf{1}) is the uniform distribution on the simplex — all probability vectors are equally likely
  • αj>1\alpha_j > 1 for all jj: probability mass concentrates toward the center of the simplex (probability vectors with similar components)
  • αj<1\alpha_j < 1 for all jj: probability mass concentrates toward the corners and edges (sparse probability vectors with most mass on a few categories)
  • Large α0\alpha_0: draws are tightly concentrated near the mean vector E[P]E[\mathbf{P}]
  • Small α0\alpha_0: draws are highly variable

When k=2k = 2, the Dirichlet reduces to the Beta distribution: Dir(α1,α2)=Beta(α1,α2)\text{Dir}(\alpha_1, \alpha_2) = \text{Beta}(\alpha_1, \alpha_2), since P2=1P1P_2 = 1 - P_1 and f(p1)p1α11(1p1)α21f(p_1) \propto p_1^{\alpha_1-1}(1-p_1)^{\alpha_2-1}.

Theorem 5 Dirichlet Moments

Let PDir(α)\mathbf{P} \sim \text{Dir}(\boldsymbol{\alpha}) with α0=j=1kαj\alpha_0 = \sum_{j=1}^k \alpha_j. Then:

  1. E[Pj]=αjα0E[P_j] = \frac{\alpha_j}{\alpha_0}

  2. Var(Pj)=αj(α0αj)α02(α0+1)\text{Var}(P_j) = \frac{\alpha_j(\alpha_0 - \alpha_j)}{\alpha_0^2(\alpha_0 + 1)}

  3. Cov(Pi,Pj)=αiαjα02(α0+1)\text{Cov}(P_i, P_j) = \frac{-\alpha_i \alpha_j}{\alpha_0^2(\alpha_0 + 1)} for iji \neq j

The mean E[Pj]=αj/α0E[P_j] = \alpha_j / \alpha_0 is the proportion of the total concentration allocated to category jj. The variance decreases as α0\alpha_0 increases — larger total concentration means more precise draws. The covariance is always negative, reflecting the simplex constraint Pj=1\sum P_j = 1 (same structural reason as the Multinomial’s negative covariance).

The Dirichlet’s greatest significance in applied statistics and machine learning is its role as the conjugate prior for the Multinomial.

Theorem 6 Dirichlet-Multinomial Conjugacy

If the prior on p\mathbf{p} is Dir(α)\text{Dir}(\boldsymbol{\alpha}) and the likelihood is Mult(n,p)\text{Mult}(n, \mathbf{p}) with observed counts x=(x1,,xk)T\mathbf{x} = (x_1, \ldots, x_k)^T, then the posterior is

pxDir(α+x)\mathbf{p} \mid \mathbf{x} \sim \text{Dir}(\boldsymbol{\alpha} + \mathbf{x})

That is, pxDir(α1+x1,  α2+x2,  ,  αk+xk)\mathbf{p} \mid \mathbf{x} \sim \text{Dir}(\alpha_1 + x_1, \; \alpha_2 + x_2, \; \ldots, \; \alpha_k + x_k).

Proof [show]

By Bayes’ theorem, the posterior density is proportional to the likelihood times the prior:

f(px)P(X=xp)f(p)f(\mathbf{p} \mid \mathbf{x}) \propto P(\mathbf{X} = \mathbf{x} \mid \mathbf{p}) \cdot f(\mathbf{p})

The Multinomial likelihood (keeping only the terms that depend on p\mathbf{p}) is

P(X=xp)j=1kpjxjP(\mathbf{X} = \mathbf{x} \mid \mathbf{p}) \propto \prod_{j=1}^k p_j^{x_j}

The Dirichlet prior density (again, keeping only p\mathbf{p}-dependent terms) is

f(p)j=1kpjαj1f(\mathbf{p}) \propto \prod_{j=1}^k p_j^{\alpha_j - 1}

Multiplying the likelihood and the prior:

f(px)j=1kpjxjj=1kpjαj1=j=1kpj(αj+xj)1f(\mathbf{p} \mid \mathbf{x}) \propto \prod_{j=1}^k p_j^{x_j} \cdot \prod_{j=1}^k p_j^{\alpha_j - 1} = \prod_{j=1}^k p_j^{(\alpha_j + x_j) - 1}

This is the kernel of a Dir(α+x)\text{Dir}(\boldsymbol{\alpha} + \mathbf{x}) density. Since f(px)f(\mathbf{p} \mid \mathbf{x}) must integrate to 1 over the simplex, the normalizing constant must be B(α+x)1=Γ(α0+n)j=1kΓ(αj+xj)B(\boldsymbol{\alpha} + \mathbf{x})^{-1} = \frac{\Gamma(\alpha_0 + n)}{\prod_{j=1}^k \Gamma(\alpha_j + x_j)}, confirming the posterior is Dir(α+x)\text{Dir}(\boldsymbol{\alpha} + \mathbf{x}).

The update rule is additive: each prior concentration αj\alpha_j is incremented by the observed count xjx_j. The prior concentrations αj\alpha_j act as pseudo-counts — they represent the “data” implied by the prior. The total pseudo-sample size is α0\alpha_0, and after observing nn real data points, the effective sample size is α0+n\alpha_0 + n.

The posterior mean for category jj is

E[Pjx]=αj+xjα0+nE[P_j \mid \mathbf{x}] = \frac{\alpha_j + x_j}{\alpha_0 + n}

This is a weighted average of the prior mean αj/α0\alpha_j / \alpha_0 and the MLE xj/nx_j / n:

E[Pjx]=α0α0+nαjα0+nα0+nxjnE[P_j \mid \mathbf{x}] = \frac{\alpha_0}{\alpha_0 + n} \cdot \frac{\alpha_j}{\alpha_0} + \frac{n}{\alpha_0 + n} \cdot \frac{x_j}{n}

As nn \to \infty, the posterior mean converges to the MLE — the data overwhelms the prior, regardless of the choice of α\boldsymbol{\alpha}.

Example 2 Conjugate Updating with Observed Data

Suppose we are estimating the topic proportions for a corpus with k=3k = 3 topics: sports, politics, and science. We start with a weakly informative prior Dir(2,2,2)\text{Dir}(2, 2, 2) — slight preference for uniform proportions, with pseudo-sample size α0=6\alpha_0 = 6.

We classify n=30n = 30 documents and observe counts x=(12,10,8)T\mathbf{x} = (12, 10, 8)^T.

Prior: Dir(2,2,2)\text{Dir}(2, 2, 2) with prior means (1/3,1/3,1/3)(1/3, 1/3, 1/3).

Posterior: Dir(2+12,2+10,2+8)=Dir(14,12,10)\text{Dir}(2 + 12, 2 + 10, 2 + 8) = \text{Dir}(14, 12, 10) with posterior means:

E[P1x]=14360.389,E[P2x]=12360.333,E[P3x]=10360.278E[P_1 \mid \mathbf{x}] = \frac{14}{36} \approx 0.389, \quad E[P_2 \mid \mathbf{x}] = \frac{12}{36} \approx 0.333, \quad E[P_3 \mid \mathbf{x}] = \frac{10}{36} \approx 0.278

MLE: p^j=xj/n\hat{p}_j = x_j / n, so (p^1,p^2,p^3)=(0.400,0.333,0.267)(\hat{p}_1, \hat{p}_2, \hat{p}_3) = (0.400, 0.333, 0.267).

The posterior means are pulled slightly toward the prior mean of 1/31/3 relative to the MLE — the Bayesian shrinkage effect. The shrinkage is mild because n=30n = 30 is much larger than α0=6\alpha_0 = 6. If we had used α0=300\alpha_0 = 300 (a very strong prior), the posterior means would be much closer to 1/31/3.

Now observe 20 more documents with counts (4,8,8)(4, 8, 8). The posterior updates again:

Dir(14+4,12+8,10+8)=Dir(18,20,18)\text{Dir}(14 + 4, 12 + 8, 10 + 8) = \text{Dir}(18, 20, 18)

with posterior means (18/56,20/56,18/56)(0.321,0.357,0.321)(18/56, 20/56, 18/56) \approx (0.321, 0.357, 0.321). The second batch shifted the estimate toward politics. Sequential Bayesian updating works by using yesterday’s posterior as today’s prior — the algebra is the same at each step.

Three-panel: symmetric Dirichlet samples, asymmetric Dirichlet, conjugate updating with prior and posterior

The Dirichlet-Multinomial conjugacy is the backbone of Latent Dirichlet Allocation (LDA), the foundational topic model in NLP. In LDA, each document’s topic proportions are drawn from a Dir(α)\text{Dir}(\boldsymbol{\alpha}), and the word counts for each topic follow a Multinomial. The conjugacy makes the posterior over topic proportions tractable, enabling variational inference and collapsed Gibbs sampling. See Topic 26 §26.3 Remark 10 for collapsed Gibbs and formalML: on formalml for VI.

Interactive: Dirichlet Simplex Explorer
E[P]p₁p₂p₃
Dirichlet Statistics
α = (1.0, 1.0, 1.0)
α₀ = Σαⱼ = 3.00
Marginal moments:
P₁: E = 0.3333, Var = 0.0556
P₂: E = 0.3333, Var = 0.0556
P₃: E = 0.3333, Var = 0.0556
Moderate concentration — broad coverage of the simplex.
Dir(α) samples E[P]

8.7 Covariance Matrices: Geometry and Structure

The covariance matrix Σ\boldsymbol{\Sigma} encodes the geometry of a multivariate distribution. Its diagonal entries are variances; its off-diagonal entries are covariances. But Σ\boldsymbol{\Sigma} is more than a table of numbers — it defines an inner product, a notion of distance, and a set of principal axes. Understanding this geometry is essential for PCA, Gaussian discriminant analysis, and the interpretation of MVN contour plots.

Definition 8 Covariance Matrix

The covariance matrix of a random vector X=(X1,,Xp)T\mathbf{X} = (X_1, \ldots, X_p)^T with mean μ=E[X]\boldsymbol{\mu} = E[\mathbf{X}] is the p×pp \times p matrix

Σ=E ⁣[(Xμ)(Xμ)T]\boldsymbol{\Sigma} = E\!\left[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T\right]

The (i,j)(i, j) entry of Σ\boldsymbol{\Sigma} is Σij=Cov(Xi,Xj)=E[(Xiμi)(Xjμj)]\Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - \mu_i)(X_j - \mu_j)]. The diagonal entries Σjj=Var(Xj)\Sigma_{jj} = \text{Var}(X_j) are the variances of the individual components.

Equivalently, Σ=E[XXT]μμT\boldsymbol{\Sigma} = E[\mathbf{X}\mathbf{X}^T] - \boldsymbol{\mu}\boldsymbol{\mu}^T, which is the matrix analog of Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2.

Definition 9 Mahalanobis Distance

The Mahalanobis distance from a point x\mathbf{x} to a distribution with mean μ\boldsymbol{\mu} and covariance Σ\boldsymbol{\Sigma} is

dM(x,μ)=(xμ)TΣ1(xμ)d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}

and the squared Mahalanobis distance is

dM2(x,μ)=(xμ)TΣ1(xμ)d_M^2(\mathbf{x}, \boldsymbol{\mu}) = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})

Unlike Euclidean distance, Mahalanobis distance accounts for the scale and correlation of the variables. Two points that are equidistant from μ\boldsymbol{\mu} in Euclidean terms may be very different in Mahalanobis terms if they lie in different directions relative to the covariance structure.

For XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), the squared Mahalanobis distance dM2(X,μ)χp2d_M^2(\mathbf{X}, \boldsymbol{\mu}) \sim \chi^2_p. This is why MVN contours (sets of constant dM2d_M^2) are used for hypothesis testing and outlier detection: a point with dM2>χp,0.952d_M^2 > \chi^2_{p, 0.95} is an outlier at the 5% level.

Theorem 7 Properties of Covariance Matrices

The covariance matrix Σ\boldsymbol{\Sigma} of any random vector X\mathbf{X} has the following properties:

  1. Symmetry. Σ=ΣT\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T, since Cov(Xi,Xj)=Cov(Xj,Xi)\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i).

  2. Positive semi-definiteness. For any vector aRp\mathbf{a} \in \mathbb{R}^p, aTΣa0\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} \geq 0.

  3. Spectral decomposition. Since Σ\boldsymbol{\Sigma} is symmetric and PSD, it has a spectral decomposition Σ=QΛQT\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T, where Q\mathbf{Q} is an orthogonal matrix of eigenvectors and Λ=diag(λ1,,λp)\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_p) is a diagonal matrix of non-negative eigenvalues.

The eigenvectors q1,,qp\mathbf{q}_1, \ldots, \mathbf{q}_p are the principal axes of the distribution — the directions of maximum and minimum variance. The eigenvalues λ1λ2λp0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0 are the variances along each principal axis. For the MVN, the ellipsoidal contours have semi-axes along qj\mathbf{q}_j with lengths proportional to λj\sqrt{\lambda_j}.

Proof [show]

Proof of positive semi-definiteness.

For any fixed vector aRp\mathbf{a} \in \mathbb{R}^p:

aTΣa=aTE ⁣[(Xμ)(Xμ)T]a\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = \mathbf{a}^T E\!\left[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T\right] \mathbf{a}

Since a\mathbf{a} is a constant vector, we can move it inside the expectation:

=E ⁣[aT(Xμ)(Xμ)Ta]= E\!\left[\mathbf{a}^T (\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T \mathbf{a}\right]

Let Y=aT(Xμ)Y = \mathbf{a}^T(\mathbf{X} - \boldsymbol{\mu}). This is a scalar random variable (a linear combination of the components of Xμ\mathbf{X} - \boldsymbol{\mu}). The expression becomes:

=E[Y2]= E[Y^2]

Since Y20Y^2 \geq 0 for every outcome, E[Y2]0E[Y^2] \geq 0. Therefore aTΣa0\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} \geq 0 for all aRp\mathbf{a} \in \mathbb{R}^p.

Moreover, aTΣa=0\mathbf{a}^T \boldsymbol{\Sigma} \mathbf{a} = 0 if and only if E[Y2]=0E[Y^2] = 0, which happens if and only if Y=aT(Xμ)=0Y = \mathbf{a}^T(\mathbf{X} - \boldsymbol{\mu}) = 0 with probability 1. This means Σ\boldsymbol{\Sigma} is positive definite (not just semi-definite) when no non-trivial linear combination of the components is degenerate — which is the generic case, and the case assumed for the MVN with invertible Σ\boldsymbol{\Sigma}.

Remark 3 Wishart and Inverse-Wishart Distributions

When the covariance matrix Σ\boldsymbol{\Sigma} itself is unknown and we want to perform Bayesian inference, we need a prior distribution on positive definite matrices. The Wishart distribution Wp(V,n)\mathcal{W}_p(\mathbf{V}, n) is the multivariate generalization of the Chi-squared distribution — if X1,,Xn\mathbf{X}_1, \ldots, \mathbf{X}_n are i.i.d. Np(0,Σ)\mathcal{N}_p(\mathbf{0}, \boldsymbol{\Sigma}), then i=1nXiXiTWp(Σ,n)\sum_{i=1}^n \mathbf{X}_i \mathbf{X}_i^T \sim \mathcal{W}_p(\boldsymbol{\Sigma}, n).

The Inverse-Wishart distribution Wp1(Ψ,ν)\mathcal{W}_p^{-1}(\boldsymbol{\Psi}, \nu) is the distribution of the inverse of a Wishart-distributed matrix, and it serves as the conjugate prior for the covariance matrix of a multivariate Normal. If we observe data from Np(μ,Σ)\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) with an Inverse-Wishart prior on Σ\boldsymbol{\Sigma}, the posterior on Σ\boldsymbol{\Sigma} is also Inverse-Wishart with updated parameters.

The full development of Wishart-based inference, including the Normal-Inverse-Wishart conjugate family for joint inference on (μ,Σ)(\boldsymbol{\mu}, \boldsymbol{\Sigma}), belongs to Bayesian Computation.

Three-panel: eigenvectors as principal axes, Mahalanobis vs Euclidean distance ellipses, chi-squared calibration

The spectral decomposition Σ=QΛQT\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T is the mathematical core of Principal Component Analysis. The eigenvectors qj\mathbf{q}_j are the principal directions, and the eigenvalues λj\lambda_j are the explained variances. PCA projects data onto the top eigenvectors — the directions of maximum variance — to reduce dimensionality while preserving as much information as possible. See formalML: for the full treatment, including the connection between the population covariance eigendecomposition (developed here) and the sample covariance eigendecomposition (used in practice).


8.8 Multivariate Transformations

In one dimension, if Y=g(X)Y = g(X) and gg is invertible with g1g^{-1} differentiable, then fY(y)=fX(g1(y))(g1)(y)f_Y(y) = f_X(g^{-1}(y)) \cdot |{(g^{-1})}'(y)|. The multivariate generalization replaces the absolute derivative with the absolute value of the Jacobian determinant — the natural measure of how a transformation stretches or compresses volume in Rp\mathbb{R}^p. This requires the multivariable change of variables formula from formalCalculus: .

Theorem 8 Multivariate Change of Variables

Let X\mathbf{X} be a continuous random vector in Rp\mathbb{R}^p with PDF fXf_{\mathbf{X}}. Let g:RpRp\mathbf{g}: \mathbb{R}^p \to \mathbb{R}^p be an invertible, continuously differentiable transformation with inverse g1\mathbf{g}^{-1}. If Y=g(X)\mathbf{Y} = \mathbf{g}(\mathbf{X}), then the PDF of Y\mathbf{Y} is

fY(y)=fX(g1(y))detJg1(y)f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(\mathbf{g}^{-1}(\mathbf{y})) \cdot \left|\det \mathbf{J}_{\mathbf{g}^{-1}}(\mathbf{y})\right|

where Jg1(y)\mathbf{J}_{\mathbf{g}^{-1}}(\mathbf{y}) is the p×pp \times p Jacobian matrix of g1\mathbf{g}^{-1} evaluated at y\mathbf{y}, with entries [Jg1]ij=[g1]iyj[\mathbf{J}_{\mathbf{g}^{-1}}]_{ij} = \frac{\partial [\mathbf{g}^{-1}]_i}{\partial y_j}.

The Jacobian determinant detJ|\det \mathbf{J}| measures the local volume distortion: if the transformation locally stretches volumes by a factor of cc, then densities are locally compressed by a factor of 1/c1/c to preserve total probability.

The most important application of the change of variables formula for ML is Cholesky sampling — the method used to generate samples from an arbitrary MVN using only standard Normal samples.

Proof [show]

Cholesky sampling produces the correct distribution.

Let ZNp(0,I)\mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}) (a standard multivariate Normal with independent components). Let Σ=LLT\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^T be the Cholesky decomposition of Σ\boldsymbol{\Sigma}, where L\mathbf{L} is a lower triangular matrix with positive diagonal entries (which exists because Σ\boldsymbol{\Sigma} is positive definite). Define the affine transformation

X=μ+LZ\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}

We claim that XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}).

Mean. E[X]=μ+LE[Z]=μ+L0=μE[\mathbf{X}] = \boldsymbol{\mu} + \mathbf{L} E[\mathbf{Z}] = \boldsymbol{\mu} + \mathbf{L} \cdot \mathbf{0} = \boldsymbol{\mu}.

Covariance. Cov(X)=Cov(LZ)=LCov(Z)LT=LILT=LLT=Σ\text{Cov}(\mathbf{X}) = \text{Cov}(\mathbf{L}\mathbf{Z}) = \mathbf{L} \, \text{Cov}(\mathbf{Z}) \, \mathbf{L}^T = \mathbf{L}\mathbf{I}\mathbf{L}^T = \mathbf{L}\mathbf{L}^T = \boldsymbol{\Sigma}.

Normality. By Property 2 of Theorem 2, any affine transformation of a multivariate Normal is multivariate Normal. Since ZNp(0,I)\mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I}) and X=LZ+μ\mathbf{X} = \mathbf{L}\mathbf{Z} + \boldsymbol{\mu}, we have XNp(μ,Σ)\mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}).

Alternatively, via the change of variables formula. The transformation is g(z)=μ+Lz\mathbf{g}(\mathbf{z}) = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}, with inverse g1(x)=L1(xμ)\mathbf{g}^{-1}(\mathbf{x}) = \mathbf{L}^{-1}(\mathbf{x} - \boldsymbol{\mu}). The Jacobian of the inverse is Jg1=L1\mathbf{J}_{\mathbf{g}^{-1}} = \mathbf{L}^{-1}, so

detJg1=detL1=detL1=Σ1/2|\det \mathbf{J}_{\mathbf{g}^{-1}}| = |\det \mathbf{L}^{-1}| = |\det \mathbf{L}|^{-1} = |\boldsymbol{\Sigma}|^{-1/2}

since detL2=det(LLT)=Σ|\det \mathbf{L}|^2 = \det(\mathbf{L}\mathbf{L}^T) = |\boldsymbol{\Sigma}|. Substituting into the change of variables formula:

fX(x)=fZ(L1(xμ))Σ1/2f_{\mathbf{X}}(\mathbf{x}) = f_{\mathbf{Z}}(\mathbf{L}^{-1}(\mathbf{x} - \boldsymbol{\mu})) \cdot |\boldsymbol{\Sigma}|^{-1/2}=(2π)p/2exp ⁣(12(L1(xμ))T(L1(xμ)))Σ1/2= (2\pi)^{-p/2} \exp\!\left(-\frac{1}{2}(\mathbf{L}^{-1}(\mathbf{x} - \boldsymbol{\mu}))^T (\mathbf{L}^{-1}(\mathbf{x} - \boldsymbol{\mu}))\right) \cdot |\boldsymbol{\Sigma}|^{-1/2}=(2π)p/2Σ1/2exp ⁣(12(xμ)T(L1)TL1(xμ))= (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T (\mathbf{L}^{-1})^T \mathbf{L}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)=(2π)p/2Σ1/2exp ⁣(12(xμ)T(LLT)1(xμ))= (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T (\mathbf{L}\mathbf{L}^T)^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)=(2π)p/2Σ1/2exp ⁣(12(xμ)TΣ1(xμ))= (2\pi)^{-p/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

This is the PDF of Np(μ,Σ)\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}), confirming that the Cholesky sampling procedure produces the correct distribution.

The Cholesky sampling formula x=μ+Lz\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z} is not just a computational convenience — it is the mathematical foundation of the reparameterization trick used in variational autoencoders (VAEs). In a VAE, the encoder outputs μ\boldsymbol{\mu} and L\mathbf{L} (or a diagonal approximation), and the decoder needs to sample from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}). Direct sampling would block gradient flow, but writing x=μ+Lε\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\varepsilon} with εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) makes the sampling deterministic given ε\boldsymbol{\varepsilon}, allowing backpropagation through the sampling step. See formalML: for the full story.

Remark 4 Normalizing Flows and the Jacobian Determinant

The change of variables formula from Theorem 8 is the core mathematical tool behind normalizing flows. A normalizing flow models a complex target distribution fYf_{\mathbf{Y}} by learning an invertible transformation g\mathbf{g} from a simple base distribution (typically ZNp(0,I)\mathbf{Z} \sim \mathcal{N}_p(\mathbf{0}, \mathbf{I})) to the target:

fY(y)=fZ(g1(y))detJg1(y)f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{Z}}(\mathbf{g}^{-1}(\mathbf{y})) \cdot |\det \mathbf{J}_{\mathbf{g}^{-1}}(\mathbf{y})|

The entire architecture of a normalizing flow — RealNVP, Glow, Neural Spline Flows — is designed to ensure that (1) g\mathbf{g} is invertible, (2) the Jacobian determinant is efficiently computable, and (3) the transformation is flexible enough to model complex densities. The constraint of efficient Jacobian computation drives the use of triangular Jacobians (for which detJ=iJii\det \mathbf{J} = \prod_i J_{ii}, computed in O(p)O(p) rather than O(p3)O(p^3)). See formalML: for the full treatment.

Three-panel: Cholesky sampling from standard Normal to MVN, whitening transform, normalizing flows Jacobian diagram

8.9 Copulas: Separating Marginals from Dependence

Here is a question that sounds simple but has deep consequences: can two random vectors have the same marginal distributions but different joint distributions?

Yes, absolutely. Topic 3 flagged this point: the marginals alone do not determine the joint. Two variables XX and YY can both be Exponential(1), but their joint distribution could show positive dependence, negative dependence, tail dependence, or no dependence at all — all while preserving the same marginals. The structure that separates marginals from dependence is called a copula.

As Topic 3 promised: “Specifying the dependence structure is the job of copulas.” We now deliver on that promise.

Definition 10 Copula

A copula is a joint CDF C:[0,1]p[0,1]C: [0, 1]^p \to [0, 1] whose marginal distributions are all Uniform(0,1)\text{Uniform}(0, 1).

That is, C(u1,,up)C(u_1, \ldots, u_p) is a valid joint CDF on [0,1]p[0,1]^p satisfying:

  1. C(u1,,uj1,0,uj+1,,up)=0C(u_1, \ldots, u_{j-1}, 0, u_{j+1}, \ldots, u_p) = 0 for any jj (if any argument is 0, the probability is 0)
  2. C(1,,1,uj,1,,1)=ujC(1, \ldots, 1, u_j, 1, \ldots, 1) = u_j for each jj (each marginal is Uniform(0,1))
  3. CC is non-decreasing in each argument and satisfies the rectangle inequality (ensuring probabilities of rectangles are non-negative)

The power of copulas comes from the following fundamental theorem, which says that any joint distribution can be decomposed into marginals plus a copula.

Theorem 9 Sklar's Theorem

Let FF be a joint CDF of a random vector X=(X1,,Xp)T\mathbf{X} = (X_1, \ldots, X_p)^T with marginal CDFs F1,,FpF_1, \ldots, F_p. Then there exists a copula C:[0,1]p[0,1]C: [0, 1]^p \to [0, 1] such that for all (x1,,xp)Rp(x_1, \ldots, x_p) \in \mathbb{R}^p,

F(x1,,xp)=C(F1(x1),,Fp(xp))F(x_1, \ldots, x_p) = C(F_1(x_1), \ldots, F_p(x_p))

If F1,,FpF_1, \ldots, F_p are all continuous, then CC is unique. Conversely, for any copula CC and any marginal CDFs F1,,FpF_1, \ldots, F_p, the function F(x1,,xp)=C(F1(x1),,Fp(xp))F(x_1, \ldots, x_p) = C(F_1(x_1), \ldots, F_p(x_p)) is a valid joint CDF with marginals F1,,FpF_1, \ldots, F_p.

The proof of Sklar’s theorem requires measure-theoretic machinery that is beyond the scope of this topic. We state it without proof and focus on its consequences.

Sklar’s theorem is a decomposition result: the joint CDF FF decomposes into marginals (F1,,FpF_1, \ldots, F_p, which determine the individual behavior of each variable) and a copula (CC, which determines the dependence structure). This decomposition is modular: we can change the marginals while keeping the copula fixed, or change the copula while keeping the marginals fixed.

Three important copula families:

Independence copula. C(u1,,up)=u1u2up=j=1pujC_{\perp}(u_1, \ldots, u_p) = u_1 \cdot u_2 \cdots u_p = \prod_{j=1}^p u_j. This copula encodes independence: the joint CDF is the product of the marginal CDFs. No dependence structure at all.

Gaussian copula. Let R\mathbf{R} be a p×pp \times p correlation matrix and Φ\Phi the standard Normal CDF. The Gaussian copula is

CRGauss(u1,,up)=ΦR ⁣(Φ1(u1),,Φ1(up))C_R^{\text{Gauss}}(u_1, \ldots, u_p) = \Phi_{\mathbf{R}}\!\left(\Phi^{-1}(u_1), \ldots, \Phi^{-1}(u_p)\right)

where ΦR\Phi_{\mathbf{R}} is the joint CDF of Np(0,R)\mathcal{N}_p(\mathbf{0}, \mathbf{R}). The Gaussian copula applies the Normal quantile transform to each margin, computes the multivariate Normal joint CDF, and maps back. It inherits the correlation structure of the MVN but can be paired with any marginals — not just Normal ones.

Student-tt copula. Same construction as the Gaussian copula, but using the multivariate Student-tt distribution with ν\nu degrees of freedom instead of the Normal. The critical difference: the Student-tt copula exhibits tail dependence — extreme values in one variable are associated with extreme values in another.

Remark 5 Tail Dependence: Gaussian vs. Student-t Copulas

Two variables with a Gaussian copula can have strong overall dependence (high ρ\rho), but their tail dependence is exactly zero. This means that extreme events — both variables being in their top 1%, say — are no more likely than independent extreme events, conditional on at least one being extreme. This property caused serious problems in financial risk modeling during the 2008 financial crisis: Gaussian copula models drastically underestimated the probability of joint extreme losses.

The Student-tt copula with finite ν\nu has positive tail dependence that increases as ν\nu decreases. With ν=3\nu = 3, extreme co-movement is far more likely than the Gaussian copula predicts. When modeling correlated risks (portfolio losses, simultaneous system failures, climate extremes), the choice between Gaussian and Student-tt copulas has real consequences for tail risk estimation.

The coefficient of (upper) tail dependence is defined as λU=limu1P(U1>uU2>u)\lambda_U = \lim_{u \to 1^-} P(U_1 > u \mid U_2 > u), where (U1,U2)(U_1, U_2) follows the copula. For the Gaussian copula with ρ<1|\rho| < 1, λU=0\lambda_U = 0. For the Student-tt copula with correlation ρ\rho and ν\nu degrees of freedom, λU>0\lambda_U > 0 whenever ρ>1\rho > -1.

Three-panel: same marginals with independence copula, Gaussian copula, and Student-t copula showing tail dependence
Interactive: Copula Explorer — Sklar’s Theorem
Exponential(1) PDF01234500.20.40.60.81Exponential(1)Beta(2, 5)Beta(2, 5) PDF
Pearson ρ = 0.505Spearman ρs = 0.515(500 samples · Gaussian copula)

8.10 Connections to ML

The multivariate distributions developed in this topic are the computational backbone of modern machine learning. This section traces five concrete connections, starting with the most important.

Example 3 GP Prediction as Conditional MVN

Consider a Gaussian process fGP(m(),k(,))f \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)) with mean function mm and kernel kk. We observe function values fobs=(f(x1),,f(xn))T\mathbf{f}_{\text{obs}} = (f(x_1), \ldots, f(x_n))^T at training inputs x1,,xnx_1, \ldots, x_n and want to predict function values f=(f(x(1)),,f(x(m)))T\mathbf{f}_* = (f(x_*^{(1)}), \ldots, f(x_*^{(m)}))^T at new inputs.

By the definition of a GP, any finite collection of function values is jointly multivariate Normal. Stacking observed and predicted values:

(ffobs)N ⁣((mmobs),(KKobsKobsKobs,obs))\begin{pmatrix} \mathbf{f}_* \\ \mathbf{f}_{\text{obs}} \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mathbf{m}_* \\ \mathbf{m}_{\text{obs}} \end{pmatrix}, \begin{pmatrix} \mathbf{K}_{**} & \mathbf{K}_{*\text{obs}} \\ \mathbf{K}_{\text{obs}*} & \mathbf{K}_{\text{obs,obs}} \end{pmatrix} \right)

where [K]ij=k(x(i),x(j))[\mathbf{K}_{**}]_{ij} = k(x_*^{(i)}, x_*^{(j)}), [Kobs]ij=k(x(i),xj)[\mathbf{K}_{*\text{obs}}]_{ij} = k(x_*^{(i)}, x_j), and [Kobs,obs]ij=k(xi,xj)[\mathbf{K}_{\text{obs,obs}}]_{ij} = k(x_i, x_j).

Applying the conditional MVN formula from Theorem 3 with X1=f\mathbf{X}_1 = \mathbf{f}_*, X2=fobs\mathbf{X}_2 = \mathbf{f}_{\text{obs}}:

ffobsN(μobs,Σobs)\mathbf{f}_* \mid \mathbf{f}_{\text{obs}} \sim \mathcal{N}(\boldsymbol{\mu}_{*|\text{obs}}, \boldsymbol{\Sigma}_{*|\text{obs}})

where

μobs=m+KobsKobs,obs1(fobsmobs)\boldsymbol{\mu}_{*|\text{obs}} = \mathbf{m}_* + \mathbf{K}_{*\text{obs}} \mathbf{K}_{\text{obs,obs}}^{-1}(\mathbf{f}_{\text{obs}} - \mathbf{m}_{\text{obs}})Σobs=KKobsKobs,obs1Kobs\boldsymbol{\Sigma}_{*|\text{obs}} = \mathbf{K}_{**} - \mathbf{K}_{*\text{obs}} \mathbf{K}_{\text{obs,obs}}^{-1} \mathbf{K}_{\text{obs}*}

These are exactly the standard GP prediction equations. The posterior mean μobs\boldsymbol{\mu}_{*|\text{obs}} is the prior mean plus a correction based on the observed residuals fobsmobs\mathbf{f}_{\text{obs}} - \mathbf{m}_{\text{obs}}. The posterior covariance Σobs\boldsymbol{\Sigma}_{*|\text{obs}} is the prior covariance minus the reduction due to the observations — the Schur complement. The conditional covariance does not depend on fobs\mathbf{f}_{\text{obs}} (only on the input locations), which means the GP’s uncertainty is determined by where we observe, not what we observe.

The GP prediction formula is not an approximation — it is the exact conditional MVN formula, applied to the specific joint MVN defined by the kernel. See formalML: for the full development.

PCA as eigendecomposition of the sample covariance matrix. Given data x1,,xnRp\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^p, the sample covariance matrix is Σ^=1n1i=1n(xixˉ)(xixˉ)T\hat{\boldsymbol{\Sigma}} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T. PCA computes the spectral decomposition Σ^=QΛQT\hat{\boldsymbol{\Sigma}} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T and projects onto the top eigenvectors. The eigenvalues λ1λp\lambda_1 \geq \cdots \geq \lambda_p are the variances explained by each principal component, and the cumulative proportion j=1kλj/j=1pλj\sum_{j=1}^k \lambda_j / \sum_{j=1}^p \lambda_j determines how many components to retain. When the data are MVN, PCA gives the unique directions that maximize variance of the projections — the connection between Theorem 7 and the PCA algorithm. See formalML: .

Variational inference with MVN variational families. In variational inference, we approximate a complex posterior p(zx)p(\mathbf{z} \mid \mathbf{x}) with a tractable distribution q(z)q(\mathbf{z}), chosen to minimize the KL divergence KL(qp)\text{KL}(q \| p). The most common choice for qq is the multivariate Normal. A full-covariance variational family q(z)=Np(μ,Σ)q(\mathbf{z}) = \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) has p+p(p+1)/2p + p(p+1)/2 parameters (the mean vector and the full covariance matrix). A diagonal (mean-field) variational family q(z)=Np(μ,diag(σ2))q(\mathbf{z}) = \mathcal{N}_p(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) has only 2p2p parameters but assumes independence — it misses all correlations. The reparameterization trick from Section 8.8 (z=μ+Lε\mathbf{z} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\varepsilon}) enables gradient-based optimization of the ELBO with respect to μ\boldsymbol{\mu} and L\mathbf{L}.

LDA as Dirichlet prior with Multinomial likelihood. In Latent Dirichlet Allocation, each document dd has topic proportions θdDir(α)\boldsymbol{\theta}_d \sim \text{Dir}(\boldsymbol{\alpha}), each topic kk has a word distribution ϕkDir(β)\boldsymbol{\phi}_k \sim \text{Dir}(\boldsymbol{\beta}), and the words in document dd are drawn from a Multinomial governed by the topic mixture. The entire generative model is: draw topic proportions from a Dirichlet, draw a topic for each word from a Multinomial over topics, then draw the word from a Multinomial over the vocabulary. The Dirichlet-Multinomial conjugacy (Theorem 6) makes collapsed Gibbs sampling tractable: we can integrate out θd\boldsymbol{\theta}_d and ϕk\boldsymbol{\phi}_k analytically, sampling only the topic assignments.

BNN weight priors as multivariate Normal. In Bayesian neural networks, a common prior on the weight vector wRp\mathbf{w} \in \mathbb{R}^p is wNp(0,σ2I)\mathbf{w} \sim \mathcal{N}_p(\mathbf{0}, \sigma^2 \mathbf{I}) — a spherical MVN centered at the origin. This prior regularizes the weights (equivalent to L2L^2 regularization in the MAP limit) and defines a prior over functions. The posterior p(wD)p(\mathbf{w} \mid \mathcal{D}) is approximately MVN for well-specified models (by the Bernstein-von Mises theorem), with the Hessian of the negative log-posterior at the MAP estimate serving as the inverse covariance (the Laplace approximation). Topic 25 §25.8 proves BvM in the scalar case and Rem 16 handles the multivariate Laplace extension. See formalML: .

Three-panel: PCA eigendecomposition, VI full vs diagonal covariance, LDA topic mixtures on the simplex

8.11 Summary

This topic completes Track 2: Core Distributions and Families. With Topics 5-8 now in place, we have a complete catalog of the distributions that appear throughout statistics and machine learning, along with the structural frameworks (exponential families, multivariate distributions) that connect them.

TopicTitleStatus
Topic 5Discrete DistributionsPublished
Topic 6Continuous DistributionsPublished
Topic 7Exponential FamiliesPublished
Topic 8Multivariate Distributions (this topic)Published

Key results from this topic:

The conditional MVN formula (Theorem 3) is the single most consequential result. The conditional mean μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) is linear in the observed value, and the conditional covariance Σ12=Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{1|2} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21} (the Schur complement) is constant. This formula is the GP prediction formula, the Kalman filter update, and the engine behind Bayesian linear regression.

The Dirichlet-Multinomial conjugacy (Theorem 6) gives the cleanest example of Bayesian updating: prior Dir(α)\text{Dir}(\boldsymbol{\alpha}) plus observed counts x\mathbf{x} yields posterior Dir(α+x)\text{Dir}(\boldsymbol{\alpha} + \mathbf{x}). This powers LDA topic models and categorical Bayesian inference.

The spectral decomposition of the covariance matrix (Theorem 7) Σ=QΛQT\boldsymbol{\Sigma} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T gives the principal axes and variances of the distribution. This is the mathematical foundation of PCA.

The change of variables formula (Theorem 8) with the Jacobian determinant is the mathematical tool behind normalizing flows and the reparameterization trick.

Sklar’s theorem (Theorem 9) decomposes any joint distribution into marginals plus a copula, enabling the separation of individual behavior from dependence structure.

What comes next. Track 3 (Convergence and Limit Theorems) develops the asymptotic theory that makes statistics work in practice. The Law of Large Numbers explains why sample means converge to population means. The Central Limit Theorem explains why so many estimators are asymptotically Normal — including the multivariate Normal. The Delta Method, Slutsky’s theorem, and the Continuous Mapping Theorem provide the tools for deriving the limiting distributions of complex statistics. The multivariate Normal developed here is the target of most convergence results — virtually every well-behaved estimator converges to a multivariate Normal distribution as sample size grows.


References

  1. DeGroot, M. H. & Schervish, M. J. (2012). Probability and Statistics (4th ed.). Pearson.
  2. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  3. Nelsen, R. B. (2006). An Introduction to Copulas (2nd ed.). Springer.
  4. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
  5. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  6. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR, 3, 993–1022.