Consider a neural network with 1,000 weights. Each weight is a random variable — uncertain before training, updated by data during training, distributed across a posterior after training. To describe the joint uncertainty of all 1,000 weights simultaneously, we need a distribution on R1000. A single-variable Normal will not do. Neither will 1,000 independent single-variable Normals — because the weights are correlated (constraining one weight constrains others through the shared loss landscape). We need the full joint distribution: a function that assigns probability to every region of R1000 at once, capturing all the dependencies.
This is not an exotic requirement. Nearly every ML model operates on vectors of random variables:
Feature vectorsx∈Rp — the input to every supervised learning model. The joint distribution of features determines which classifiers work and which fail.
Weight vectorsw∈Rp — the parameters of a linear model, neural network, or Gaussian process. Bayesian inference places a prior distribution on w and computes a posterior, both of which are multivariate distributions.
Latent representationsz∈Rd — the hidden variables in a variational autoencoder, the topic proportions in LDA, the factors in a factor model. The generative process starts with a draw from a multivariate distribution over z.
Topic 3 — Random Variables introduced joint distributions for the bivariate case: two random variables X and Y with a joint PDF fX,Y(x,y), marginals obtained by integrating out one variable, and conditional distributions obtained by dividing (Definitions 8-14). That bivariate machinery is the right starting point, but it covers only p=2. The real world hands us p=10, p=100, p=10,000.
This topic generalizes everything from the bivariate case to p dimensions. The ideas are the same — joint, marginal, conditional, independence — but the notation shifts from integrals over R to integrals over Rp−q, from scalar variances to covariance matrices, and from scalar conditioning formulas to block matrix partitions. The payoff is enormous: the conditional multivariate Normal formula that we derive in Section 8.4 is the exact formula that powers Gaussian process prediction, Kalman filtering, and Bayesian linear regression. The Dirichlet distribution that we define in Section 8.6 is the prior that powers LDA topic models. The covariance matrix geometry of Section 8.7 is the mathematical foundation of PCA.
8.2 The General p-Dimensional Framework
We begin with the general definitions. These are direct extensions of the bivariate definitions from Topic 3, but stated for an arbitrary dimension p. The key prerequisite is comfort with iterated integrals over Rp — see formalCalculus: for Fubini’s theorem and the mechanics of integrating over subsets of Rp.
Definition 1 Joint PDF in p Dimensions
A continuous random vector X=(X1,X2,…,Xp)T has a joint probability density functionfX:Rp→[0,∞) if, for every (measurable) region A⊆Rp,
The density must satisfy fX(x)≥0 for all x∈Rp and
∫RpfX(x)dx=1
When p=2, this recovers the bivariate joint PDF fX,Y(x,y) from Topic 3 (Definition 8). The extension to p>2 is notational: we replace double integrals with p-fold integrals.
Definition 2 Marginal Density via Integration
Given a random vector X=(X1,…,Xp)T with joint PDF fX, the marginal density of a subvector X1=(X1,…,Xq)T (where 1≤q<p) is obtained by integrating the joint density over all components not in X1:
In words: to find the density of a subset of the variables, integrate out the rest. This is the p-dimensional generalization of Topic 3’s marginal formula (Definition 9), where we integrated out one variable over R.
The order of integration does not matter (by Fubini’s theorem), so we can marginalize in any order. To get the marginal of X3 alone from a joint density on (X1,X2,X3,X4), we integrate out X1, X2, and X4 in any convenient order.
Definition 3 Conditional Density of Subvectors
Partition the random vector as X=(X1T,X2T)T, where X1 is q-dimensional and X2 is (p−q)-dimensional. The conditional density of X1 given X2=x2 is
fX1∣X2(x1∣x2)=fX2(x2)fX(x1,x2)
provided fX2(x2)>0.
This is the ratio of the joint density to the marginal density of the conditioning variables — the same ratio as in the bivariate case (Topic 3, Definition 10), now applied to subvectors of arbitrary dimension. The conditional density fX1∣X2(⋅∣x2) is a proper density in x1 for each fixed x2: it integrates to 1 over Rq.
Definition 4 Mutual Independence
The random variables X1,X2,…,Xp are mutually independent if and only if the joint density factors as the product of all marginal densities:
Mutual independence is strictly stronger than pairwise independence (every pair (Xi,Xj) is independent). Topic 2 (Conditional Probability) discussed this distinction for events; the same distinction holds for random variables. When p>2, verifying that all (2p) pairs are independent does not guarantee mutual independence — the joint must factor as a product of all p marginals.
The chain rule for densities tells us how to decompose any joint density into a product of conditionals. This is the density analog of the chain rule for probabilities from Topic 2.
Theorem 1 General Chain Rule for Densities
For any random vector X=(X1,X2,…,Xp)T with a joint density, the joint PDF factors as
The chain rule is not merely a formula — it is the theoretical foundation of autoregressive models. When a language model generates text token by token, computing P(tokent∣token1,…,tokent−1) at each step, it is implementing the chain rule for the joint distribution of the token sequence. The factorization order matters in practice (left-to-right vs. right-to-left vs. arbitrary), even though mathematically any ordering gives the same joint.
8.3 The Multivariate Normal Distribution
The univariate Normal N(μ,σ2) was the star of Topic 6 — Continuous Distributions. Its multivariate generalization Np(μ,Σ) is the single most important distribution in all of statistics and machine learning. It appears as the prior on Bayesian linear regression weights, the variational family in variational inference, the base distribution in normalizing flows, the finite-dimensional projection of a Gaussian process, and the asymptotic distribution of virtually every well-behaved estimator. If you internalize one distribution from this topic, make it this one.
Before the definition, let us build intuition. In one dimension, a Normal distribution is specified by its center μ and its spread σ2. In two dimensions, we still need a center — now a vector μ=(μ1,μ2)T — but the “spread” becomes richer. We need not just the variance of each coordinate (σ12 and σ22), but also the covariance between them (σ12). Positive covariance stretches the distribution along the diagonal; negative covariance stretches it along the anti-diagonal. The contours of equal density, which are circles for independent equal-variance Normals, become ellipses whose orientation and eccentricity are controlled by the covariance. In p dimensions, the contours are ellipsoids in Rp, and the full covariance structure is captured by a p×p matrix Σ.
Definition 5 Multivariate Normal Distribution
A random vector X=(X1,X2,…,Xp)T has the multivariate Normal distribution with mean vector μ∈Rp and p×p positive definite covariance matrix Σ, written X∼Np(μ,Σ), if its joint PDF is
fX(x)=(2π)−p/2∣Σ∣−1/2exp(−21(x−μ)TΣ−1(x−μ))
where ∣Σ∣ denotes the determinant of Σ.
The term (x−μ)TΣ−1(x−μ) in the exponent is a quadratic form — a scalar that measures how far x is from μ, weighted by Σ−1. This is the squared Mahalanobis distance (Definition 9). The contours of constant density are the sets where this quadratic form is constant, which are ellipsoids centered at μ.
When p=1, we recover the univariate Normal: μ=μ, Σ=σ2, and the exponent becomes −(x−μ)2/(2σ2).
The MVN has a collection of remarkable properties that make it uniquely tractable. No other distribution enjoys all six of the properties below simultaneously.
Theorem 2 Properties of the Multivariate Normal
Let X∼Np(μ,Σ). Then:
Marginals are Normal. Every subvector of X is also multivariate Normal. In particular, each component Xj∼N(μj,Σjj).
Linear transformations preserve normality. If A is a q×p matrix and b∈Rq, then Y=AX+b∼Nq(Aμ+b,AΣAT).
Uncorrelated implies independent. If Cov(Xi,Xj)=0 for all i=j (equivalently, Σ is diagonal), then X1,…,Xp are mutually independent. This equivalence between uncorrelatedness and independence is unique to the multivariate Normal — it fails for every other distribution.
Moment-generating function.MX(t)=E[etTX]=exp(tTμ+21tTΣt) for all t∈Rp.
Sum of independent Normals. If X∼Np(μX,ΣX) and Y∼Np(μY,ΣY) are independent, then X+Y∼Np(μX+μY,ΣX+ΣY).
Characteristic function.φX(t)=E[eitTX]=exp(itTμ−21tTΣt), which is the MGF with t replaced by it and exists for all t∈Rp.
Proof
[show][hide]
Proof of Property 1 (Marginals are Normal).
We prove that the marginal distribution of X1=(X1,…,Xq)T is Nq(μ1,Σ11), where μ1=(μ1,…,μq)T and Σ11 is the upper-left q×q block of Σ.
Partition X=(X1T,X2T)T, μ=(μ1T,μ2T)T, and
Σ=(Σ11Σ21Σ12Σ22)
The marginal density of X1 is obtained by integrating out X2:
fX1(x1)=∫Rp−qfX(x1,x2)dx2
The exponent in the joint density is the quadratic form Q=(x−μ)TΣ−1(x−μ). Using the block matrix inverse formula, we can write
Σ−1=(Σ11Σ21Σ12Σ22)
where Σ22=(Σ22−Σ21Σ11−1Σ12)−1 and Σ11=(Σ11−Σ12Σ22−1Σ21)−1.
Expanding the quadratic form Q by completing the square in x2, we can separate the terms involving x2 from those involving only x1. Define a2=μ2+Σ21Σ11−1(x1−μ1) and S2∣1=Σ22−Σ21Σ11−1Σ12. Then
This is the MGF of Nq(Aμ+b,AΣAT). Since the MGF uniquely determines the distribution (when it exists in a neighborhood of the origin, which it does for the Normal), we conclude Y∼Nq(Aμ+b,AΣAT).
Note that Property 1 (marginals are Normal) is a special case: extracting a subvector is a linear transformation with A as a selection matrix (rows of the identity matrix) and b=0.
◼
Proof
[show][hide]
Proof: MVN in exponential family form (see Remark 1 below).
We show that the MVN PDF can be written in the canonical exponential family form f(x∣θ)=h(x)exp(η(θ)⋅T(x)−A(θ)) from Topic 7 — Exponential Families.
Start from the MVN PDF and take the logarithm:
logfX(x)=−2plog(2π)−21log∣Σ∣−21(x−μ)TΣ−1(x−μ)
Expand the quadratic form. Let Λ=Σ−1 (the precision matrix). Then
Now we identify the exponential family components. The natural parameters are η1=Λμ=Σ−1μ (a p-vector) and η2=−21Λ=−21Σ−1 (a p×p symmetric matrix, contributing p(p+1)/2 unique parameters). The corresponding sufficient statistics are T1(x)=x and T2(x)=xxT.
The components of the canonical form are:
h(x)=(2π)−p/2
η=(Σ−1μ,−21Σ−1)
T(x)=(x,xxT)
A(η)=21μTΣ−1μ+21log∣Σ∣
The inner product is η⋅T(x)=(Σ−1μ)Tx+tr(−21Σ−1xxT)=μTΣ−1x−21xTΣ−1x, matching the expansion above.
The MVN is a k-parameter exponential family with k=p+p(p+1)/2 natural parameters (the entries of Σ−1μ and the unique entries of −21Σ−1). When Σ is known, it reduces to a p-parameter family with natural parameters η=Σ−1μ and sufficient statistics T(x)=x.
◼
Remark 1 MVN as Exponential Family Member
The exponential family form of the MVN connects directly to Topic 7. The sufficient statistics T(x)=(x,xxT) tell us that for an i.i.d. sample x1,…,xn from a MVN, the sample mean xˉ=n1∑ixi and the sample second moment matrix n1∑ixixiT are jointly sufficient for (μ,Σ). No matter how large the sample, these two summary quantities capture everything the data has to say about the parameters.
The log-partition function A(η)=21μTΣ−1μ+21log∣Σ∣ generates the moments via differentiation, just as in the univariate case (Topic 7, Theorem 7.1). And the convexity of A in the natural parameters guarantees that the MLE for (μ,Σ) — namely μ^=xˉ and Σ^=n1∑i(xi−xˉ)(xi−xˉ)T — is unique.
Interactive: Multivariate Normal Explorer
Covariance Matrix
Σ =
1.000.00
0.001.00
Eigendecomposition
λ₁ =1.000
λ₂ =1.000
v₁ =(1.00, 0.00)
v₂ =(0.00, 1.00)
Matrix Properties
|Σ| =1.000
κ(Σ) =1.000
Σ is nearly spherical — the contours are approximately circular and the principal axes align with the coordinate axes.
8.4 Conditional Multivariate Normal
This section contains the single most consequential formula in this topic. The conditional MVN result is the exact same formula used to make predictions in Gaussian processes, to update states in Kalman filters, and to perform Bayesian linear regression. Learn this formula once, and you have the engine behind three of the most important tools in applied ML.
Theorem 3 Conditional Multivariate Normal
Let X∼Np(μ,Σ). Partition X, μ, and Σ as
X=(X1X2),μ=(μ1μ2),Σ=(Σ11Σ21Σ12Σ22)
where X1 is q-dimensional and X2 is (p−q)-dimensional. Then the conditional distribution of X1 given X2=x2 is
X1∣X2=x2∼Nq(μ1∣2,Σ1∣2)
where the conditional mean and conditional covariance are
The conditional mean is a linear function of x2. It equals the prior mean μ1 plus a correction term Σ12Σ22−1(x2−μ2) that adjusts for how much x2 deviates from its mean, weighted by the cross-covariance Σ12 relative to the variance of X2.
The conditional covariance does not depend on x2. No matter what value of x2 we observe, the uncertainty about X1 is always Σ1∣2=Σ11−Σ12Σ22−1Σ21, which is the Schur complement of Σ22 in Σ.
Proof
[show][hide]
We derive the conditional density by starting from the definition fX1∣X2(x1∣x2)=fX(x1,x2)/fX2(x2) and working with the exponent of the joint density.
Step 1: Set up the quadratic form. The exponent of the joint MVN density is
Q=−21(x−μ)TΣ−1(x−μ)
We need the block inverse of Σ. Using the formula for the inverse of a 2×2 block matrix:
Step 3: Complete the square in x1. We want to rewrite −2Q as the sum of a perfect square in d1 (which will give the conditional density) and a term in d2 alone (which belongs to the marginal of X2).
Define Σ1∣2=Σ11−Σ12Σ22−1Σ21 and m=Σ12Σ22−1d2, so that μ1∣2=μ1+m.
After completing the square (grouping terms involving d1 and absorbing cross-terms), the quadratic form separates as:
−2Q=(d1−m)TΣ1∣2−1(d1−m)+d2TΣ22−1d2
We verify this by expanding the right side. The first term expands to:
Using the Woodbury identity, Σ1∣2−1=(Σ11−Σ12Σ22−1Σ21)−1=Σ11−1+Σ11−1Σ12S−1Σ21Σ11−1, and substituting m=Σ12Σ22−1d2, each term matches the expansion in Step 2. The cross-terms in d1T(⋅)d2 combine correctly, and the terms purely in d2 collect to d2TΣ22−1d2.
Step 4: Read off the conditional. With the quadratic form separated, the joint density factors as
The first factor, viewed as a function of x1, is the kernel of Nq(μ1∣2,Σ1∣2). The second factor depends only on x2 and is the kernel of fX2(x2).
The conditional mean is linear in x2 (a regression of X1 on X2), and the conditional covariance is constant in x2 (it depends only on the covariance structure, not on the observed value).
◼
Example 1 Bivariate Case: Recovering the Scalar Formula
Let p=2 with X=(X1,X2)T∼N2(μ,Σ) where
μ=(μ1μ2),Σ=(σ12ρσ1σ2ρσ1σ2σ22)
Here q=1, Σ11=σ12, Σ12=ρσ1σ2, Σ21=ρσ1σ2, Σ22=σ22. Applying the conditional MVN formulas:
So X1∣X2=x2∼N(μ1+ρσ2σ1(x2−μ2),σ12(1−ρ2)).
This recovers the bivariate conditional Normal formula from Topic 3 (Example 9). The conditional mean traces out the regression line μ1+ρ(σ1/σ2)(x2−μ2), and the conditional variance σ12(1−ρ2) shrinks as ∣ρ∣→1 — stronger correlation means less residual uncertainty after conditioning.
When ρ=0, the conditional mean is just μ1 (knowing X2 tells us nothing about X1) and the conditional variance is σ12 (no uncertainty reduction). When ∣ρ∣=1, the conditional variance is 0 — knowing X2 determines X1 exactly. This is the perfect linear relationship X1=μ1+ρ(σ1/σ2)(X2−μ2).
The conditional MVN formula is the workhorse behind Gaussian process prediction. In a GP, we observe function values fobs at training inputs and want to predict function values f∗ at new inputs. The GP prior places a joint MVN on (f∗,fobs)T, and the posterior predictive f∗∣fobs is given exactly by the conditional MVN formula: μ1∣2 is the posterior mean and Σ1∣2 is the posterior covariance. See formalML: for the full development.
Interactive: Conditional Multivariate Normal Explorer
Drag the horizontal line on the joint density to condition on x₂
Partition Notation
μ = (0, 0)T
Σ =
[
1.00.60
0.601.0
]
Conditional Parameters
μ1|2 = μ₁ + (Σ₁₂ / Σ₂₂)(x₂ − μ₂)
= 0 + (0.60 / 1.0)(0.00 − 0)
= 0.000
↑ depends on x₂
σ²1|2 = Σ₁₁ − Σ₁₂² / Σ₂₂
= 1.0 − 0.60² / 1.0
= 0.640
↑ does NOT depend on x₂
The conditional mean is linear in x₂ (the regression line). The conditional variance is constant — it never changes as you drag.
Mean varies with x₂: μ1|2 = 0.000
Variance constant: σ²1|2 = 0.640
8.5 The Multinomial Distribution
The Binomial counts successes in n independent trials with two outcomes. What if there are k>2 possible outcomes? This is the setting of the Multinomial distribution — the natural generalization of the Binomial to multiple categories.
Consider classifying n=100 emails as spam, promotions, or legitimate. Each email falls into exactly one category, and we model the classification probabilities as (p1,p2,p3)=(0.3,0.2,0.5). The counts (X1,X2,X3) — how many emails fall into each category — follow a Multinomial distribution. The constraint X1+X2+X3=100 is baked in: every email goes somewhere. This fixed-total constraint has deep consequences for the covariance structure.
Definition 6 Multinomial Distribution
Let n∈{1,2,3,…} be the number of trials and p=(p1,p2,…,pk)T a probability vector with pj>0 and ∑j=1kpj=1. The random vector X=(X1,…,Xk)T has the Multinomial distributionMult(n,p) if its PMF is
P(X=x)=x1!x2!⋯xk!n!p1x1p2x2⋯pkxk
for non-negative integers x1,…,xk with ∑j=1kxj=n.
The coefficient x1!⋯xk!n!=(x1,…,xkn) is the multinomial coefficient — the number of ways to assign n items into k groups of sizes x1,…,xk.
When k=2, we have X=(X1,n−X1)T and the PMF reduces to (x1n)p1x1(1−p1)n−x1, which is the Binomial(n,p1) PMF. The Multinomial is genuinely the k-category Binomial.
Theorem 4 Multinomial Moments
Let X∼Mult(n,p). Then:
E[Xj]=npj for each j=1,…,k
Var(Xj)=npj(1−pj) for each j — each marginal Xj∼Bin(n,pj)
Cov(Xi,Xj)=−npipj for i=j — the covariance is always negative
Proof
[show][hide]
Proof of Part 1 (Mean via indicator variables).
Write each trial outcome as an indicator. For the ℓ-th trial (ℓ=1,…,n), define the indicator
Zℓj={10if trial ℓ results in category jotherwise
Then Xj=∑ℓ=1nZℓj, the count for category j is the sum of n independent Bernoulli(pj) indicators. By linearity of expectation:
E[Xj]=E[ℓ=1∑nZℓj]=ℓ=1∑nE[Zℓj]=ℓ=1∑npj=npj
Since each Zℓj is Bernoulli(pj), the marginal Xj=∑ℓZℓj is Bin(n,pj). From Topic 5, Var(Xj)=npj(1−pj).
◼
Proof
[show][hide]
Proof of Part 3 (Negative covariance via the variance constraint).
This is the most revealing proof, because it shows why the covariance must be negative.
The total count is fixed: ∑j=1kXj=n with probability 1. A constant has zero variance:
Var(j=1∑kXj)=0
Expanding the variance of the sum using the general formula Var(∑Xj)=∑Var(Xj)+2∑i<jCov(Xi,Xj):
j=1∑kVar(Xj)+2i<j∑Cov(Xi,Xj)=0
Substituting Var(Xj)=npj(1−pj):
j=1∑knpj(1−pj)+2i<j∑Cov(Xi,Xj)=0
The first sum is n∑jpj−n∑jpj2=n−n∑jpj2. Therefore
2i<j∑Cov(Xi,Xj)=−(n−nj∑pj2)=−n+nj∑pj2
Now we claim that Cov(Xi,Xj)=−npipj for i=j. To verify, we can compute it directly from the indicator representation. For distinct trials ℓ=m:
E[ZℓiZmj]=E[Zℓi]E[Zmj]=pipj
since trials are independent. For the same trial ℓ:
E[ZℓiZℓj]=P(trial ℓ in category i AND category j)=0
since a single trial cannot be in two categories simultaneously. Therefore
The covariance is always negative. This makes geometric sense: X1+X2+⋯+Xk=n is a hard constraint, so if one count goes up, the others must collectively go down. More emails classified as spam means fewer classified as legitimate — the negative covariance captures this competition for a fixed total.
We can verify consistency. The sum 2∑i<j(−npipj)=−n⋅2∑i<jpipj=−n[(∑jpj)2−∑jpj2]=−n(1−∑jpj2), which matches the constraint equation above.
◼
Remark 2 Multinomial as Exponential Family Member
The Multinomial is an exponential family member, but the parameterization requires care because of the constraint ∑pj=1. Working with the first k−1 components (since pk=1−∑j=1k−1pj):
The natural parameters are the log-odds ratiosηj=log(pj/pk) for j=1,…,k−1, and the sufficient statistics are Tj(x)=xj. The inverse map gives pj=eηj/(1+∑m=1k−1eηm) — this is the softmax function. The softmax output layer in a neural network classifier is performing Multinomial exponential family parameterization, just as the sigmoid is performing Bernoulli parameterization.
Off-diagonal covariances are negative: Cov(Xi, Xj) = -npipj (the fixed-total constraint).
8.6 The Dirichlet Distribution
The Multinomial models counts given fixed probabilities. But what if the probabilities themselves are unknown? We need a distribution over probability vectors — a distribution on the simplex Δk−1={(p1,…,pk):pj≥0,∑pj=1}. This is the Dirichlet distribution.
Consider a language model trained on a corpus. The topic proportions of a document — say, 40% sports, 35% politics, 25% science — are a probability vector on the simplex. Different documents have different topic proportions. The Dirichlet distribution models this variability: each document’s topic proportions are a draw from a Dirichlet, and the concentration parameters α control how spread out or concentrated the draws are.
Definition 7 Dirichlet Distribution
Let α=(α1,α2,…,αk)T with αj>0 for all j. The random vector P=(P1,P2,…,Pk)T has the Dirichlet distributionDir(α) if its PDF on the simplex Δk−1 is
f(p∣α)=∏j=1kΓ(αj)Γ(α0)j=1∏kpjαj−1
where α0=∑j=1kαj is the concentration sum (also called the total concentration or precision).
The normalizing constant ∏Γ(αj)Γ(α0) is the reciprocal of the multivariate Beta functionB(α)=Γ(α0)∏Γ(αj).
The concentration parameters control the shape:
αj=1 for all j: Dir(1) is the uniform distribution on the simplex — all probability vectors are equally likely
αj>1 for all j: probability mass concentrates toward the center of the simplex (probability vectors with similar components)
αj<1 for all j: probability mass concentrates toward the corners and edges (sparse probability vectors with most mass on a few categories)
Large α0: draws are tightly concentrated near the mean vector E[P]
Small α0: draws are highly variable
When k=2, the Dirichlet reduces to the Beta distribution: Dir(α1,α2)=Beta(α1,α2), since P2=1−P1 and f(p1)∝p1α1−1(1−p1)α2−1.
Theorem 5 Dirichlet Moments
Let P∼Dir(α) with α0=∑j=1kαj. Then:
E[Pj]=α0αj
Var(Pj)=α02(α0+1)αj(α0−αj)
Cov(Pi,Pj)=α02(α0+1)−αiαj for i=j
The mean E[Pj]=αj/α0 is the proportion of the total concentration allocated to category j. The variance decreases as α0 increases — larger total concentration means more precise draws. The covariance is always negative, reflecting the simplex constraint ∑Pj=1 (same structural reason as the Multinomial’s negative covariance).
The Dirichlet’s greatest significance in applied statistics and machine learning is its role as the conjugate prior for the Multinomial.
Theorem 6 Dirichlet-Multinomial Conjugacy
If the prior on p is Dir(α) and the likelihood is Mult(n,p) with observed counts x=(x1,…,xk)T, then the posterior is
p∣x∼Dir(α+x)
That is, p∣x∼Dir(α1+x1,α2+x2,…,αk+xk).
Proof
[show][hide]
By Bayes’ theorem, the posterior density is proportional to the likelihood times the prior:
f(p∣x)∝P(X=x∣p)⋅f(p)
The Multinomial likelihood (keeping only the terms that depend on p) is
P(X=x∣p)∝j=1∏kpjxj
The Dirichlet prior density (again, keeping only p-dependent terms) is
This is the kernel of a Dir(α+x) density. Since f(p∣x) must integrate to 1 over the simplex, the normalizing constant must be B(α+x)−1=∏j=1kΓ(αj+xj)Γ(α0+n), confirming the posterior is Dir(α+x).
The update rule is additive: each prior concentration αj is incremented by the observed count xj. The prior concentrations αj act as pseudo-counts — they represent the “data” implied by the prior. The total pseudo-sample size is α0, and after observing n real data points, the effective sample size is α0+n.
The posterior mean for category j is
E[Pj∣x]=α0+nαj+xj
This is a weighted average of the prior mean αj/α0 and the MLE xj/n:
E[Pj∣x]=α0+nα0⋅α0αj+α0+nn⋅nxj
As n→∞, the posterior mean converges to the MLE — the data overwhelms the prior, regardless of the choice of α.
◼
Example 2 Conjugate Updating with Observed Data
Suppose we are estimating the topic proportions for a corpus with k=3 topics: sports, politics, and science. We start with a weakly informative prior Dir(2,2,2) — slight preference for uniform proportions, with pseudo-sample size α0=6.
We classify n=30 documents and observe counts x=(12,10,8)T.
Prior:Dir(2,2,2) with prior means (1/3,1/3,1/3).
Posterior:Dir(2+12,2+10,2+8)=Dir(14,12,10) with posterior means:
MLE:p^j=xj/n, so (p^1,p^2,p^3)=(0.400,0.333,0.267).
The posterior means are pulled slightly toward the prior mean of 1/3 relative to the MLE — the Bayesian shrinkage effect. The shrinkage is mild because n=30 is much larger than α0=6. If we had used α0=300 (a very strong prior), the posterior means would be much closer to 1/3.
Now observe 20 more documents with counts (4,8,8). The posterior updates again:
Dir(14+4,12+8,10+8)=Dir(18,20,18)
with posterior means (18/56,20/56,18/56)≈(0.321,0.357,0.321). The second batch shifted the estimate toward politics. Sequential Bayesian updating works by using yesterday’s posterior as today’s prior — the algebra is the same at each step.
The Dirichlet-Multinomial conjugacy is the backbone of Latent Dirichlet Allocation (LDA), the foundational topic model in NLP. In LDA, each document’s topic proportions are drawn from a Dir(α), and the word counts for each topic follow a Multinomial. The conjugacy makes the posterior over topic proportions tractable, enabling variational inference and collapsed Gibbs sampling. See Topic 26 §26.3 Remark 10 for collapsed Gibbs and formalML: on formalml for VI.
Interactive: Dirichlet Simplex Explorer
Dirichlet Statistics
α = (1.0, 1.0, 1.0)
α₀ = Σαⱼ = 3.00
Marginal moments:
P₁: E = 0.3333, Var = 0.0556
P₂: E = 0.3333, Var = 0.0556
P₃: E = 0.3333, Var = 0.0556
Moderate concentration — broad coverage of the simplex.
● Dir(α) samples● E[P]
8.7 Covariance Matrices: Geometry and Structure
The covariance matrix Σ encodes the geometry of a multivariate distribution. Its diagonal entries are variances; its off-diagonal entries are covariances. But Σ is more than a table of numbers — it defines an inner product, a notion of distance, and a set of principal axes. Understanding this geometry is essential for PCA, Gaussian discriminant analysis, and the interpretation of MVN contour plots.
Definition 8 Covariance Matrix
The covariance matrix of a random vector X=(X1,…,Xp)T with mean μ=E[X] is the p×p matrix
Σ=E[(X−μ)(X−μ)T]
The (i,j) entry of Σ is Σij=Cov(Xi,Xj)=E[(Xi−μi)(Xj−μj)]. The diagonal entries Σjj=Var(Xj) are the variances of the individual components.
Equivalently, Σ=E[XXT]−μμT, which is the matrix analog of Var(X)=E[X2]−(E[X])2.
Definition 9 Mahalanobis Distance
The Mahalanobis distance from a point x to a distribution with mean μ and covariance Σ is
dM(x,μ)=(x−μ)TΣ−1(x−μ)
and the squared Mahalanobis distance is
dM2(x,μ)=(x−μ)TΣ−1(x−μ)
Unlike Euclidean distance, Mahalanobis distance accounts for the scale and correlation of the variables. Two points that are equidistant from μ in Euclidean terms may be very different in Mahalanobis terms if they lie in different directions relative to the covariance structure.
For X∼Np(μ,Σ), the squared Mahalanobis distance dM2(X,μ)∼χp2. This is why MVN contours (sets of constant dM2) are used for hypothesis testing and outlier detection: a point with dM2>χp,0.952 is an outlier at the 5% level.
Theorem 7 Properties of Covariance Matrices
The covariance matrix Σ of any random vector X has the following properties:
Symmetry.Σ=ΣT, since Cov(Xi,Xj)=Cov(Xj,Xi).
Positive semi-definiteness. For any vector a∈Rp, aTΣa≥0.
Spectral decomposition. Since Σ is symmetric and PSD, it has a spectral decomposition Σ=QΛQT, where Q is an orthogonal matrix of eigenvectors and Λ=diag(λ1,…,λp) is a diagonal matrix of non-negative eigenvalues.
The eigenvectors q1,…,qp are the principal axes of the distribution — the directions of maximum and minimum variance. The eigenvalues λ1≥λ2≥⋯≥λp≥0 are the variances along each principal axis. For the MVN, the ellipsoidal contours have semi-axes along qj with lengths proportional to λj.
Proof
[show][hide]
Proof of positive semi-definiteness.
For any fixed vector a∈Rp:
aTΣa=aTE[(X−μ)(X−μ)T]a
Since a is a constant vector, we can move it inside the expectation:
=E[aT(X−μ)(X−μ)Ta]
Let Y=aT(X−μ). This is a scalar random variable (a linear combination of the components of X−μ). The expression becomes:
=E[Y2]
Since Y2≥0 for every outcome, E[Y2]≥0. Therefore aTΣa≥0 for all a∈Rp.
Moreover, aTΣa=0 if and only if E[Y2]=0, which happens if and only if Y=aT(X−μ)=0 with probability 1. This means Σ is positive definite (not just semi-definite) when no non-trivial linear combination of the components is degenerate — which is the generic case, and the case assumed for the MVN with invertible Σ.
◼
Remark 3 Wishart and Inverse-Wishart Distributions
When the covariance matrix Σ itself is unknown and we want to perform Bayesian inference, we need a prior distribution on positive definite matrices. The Wishart distributionWp(V,n) is the multivariate generalization of the Chi-squared distribution — if X1,…,Xn are i.i.d. Np(0,Σ), then ∑i=1nXiXiT∼Wp(Σ,n).
The Inverse-Wishart distributionWp−1(Ψ,ν) is the distribution of the inverse of a Wishart-distributed matrix, and it serves as the conjugate prior for the covariance matrix of a multivariate Normal. If we observe data from Np(μ,Σ) with an Inverse-Wishart prior on Σ, the posterior on Σ is also Inverse-Wishart with updated parameters.
The full development of Wishart-based inference, including the Normal-Inverse-Wishart conjugate family for joint inference on (μ,Σ), belongs to Bayesian Computation.
The spectral decomposition Σ=QΛQT is the mathematical core of Principal Component Analysis. The eigenvectors qj are the principal directions, and the eigenvalues λj are the explained variances. PCA projects data onto the top eigenvectors — the directions of maximum variance — to reduce dimensionality while preserving as much information as possible. See formalML: for the full treatment, including the connection between the population covariance eigendecomposition (developed here) and the sample covariance eigendecomposition (used in practice).
8.8 Multivariate Transformations
In one dimension, if Y=g(X) and g is invertible with g−1 differentiable, then fY(y)=fX(g−1(y))⋅∣(g−1)′(y)∣. The multivariate generalization replaces the absolute derivative with the absolute value of the Jacobian determinant — the natural measure of how a transformation stretches or compresses volume in Rp. This requires the multivariable change of variables formula from formalCalculus: .
Theorem 8 Multivariate Change of Variables
Let X be a continuous random vector in Rp with PDF fX. Let g:Rp→Rp be an invertible, continuously differentiable transformation with inverse g−1. If Y=g(X), then the PDF of Y is
fY(y)=fX(g−1(y))⋅detJg−1(y)
where Jg−1(y) is the p×pJacobian matrix of g−1 evaluated at y, with entries [Jg−1]ij=∂yj∂[g−1]i.
The Jacobian determinant ∣detJ∣ measures the local volume distortion: if the transformation locally stretches volumes by a factor of c, then densities are locally compressed by a factor of 1/c to preserve total probability.
The most important application of the change of variables formula for ML is Cholesky sampling — the method used to generate samples from an arbitrary MVN using only standard Normal samples.
Proof
[show][hide]
Cholesky sampling produces the correct distribution.
Let Z∼Np(0,I) (a standard multivariate Normal with independent components). Let Σ=LLT be the Cholesky decomposition of Σ, where L is a lower triangular matrix with positive diagonal entries (which exists because Σ is positive definite). Define the affine transformation
X=μ+LZ
We claim that X∼Np(μ,Σ).
Mean.E[X]=μ+LE[Z]=μ+L⋅0=μ.
Covariance.Cov(X)=Cov(LZ)=LCov(Z)LT=LILT=LLT=Σ.
Normality. By Property 2 of Theorem 2, any affine transformation of a multivariate Normal is multivariate Normal. Since Z∼Np(0,I) and X=LZ+μ, we have X∼Np(μ,Σ).
Alternatively, via the change of variables formula. The transformation is g(z)=μ+Lz, with inverse g−1(x)=L−1(x−μ). The Jacobian of the inverse is Jg−1=L−1, so
∣detJg−1∣=∣detL−1∣=∣detL∣−1=∣Σ∣−1/2
since ∣detL∣2=det(LLT)=∣Σ∣. Substituting into the change of variables formula:
This is the PDF of Np(μ,Σ), confirming that the Cholesky sampling procedure produces the correct distribution.
◼
The Cholesky sampling formula x=μ+Lz is not just a computational convenience — it is the mathematical foundation of the reparameterization trick used in variational autoencoders (VAEs). In a VAE, the encoder outputs μ and L (or a diagonal approximation), and the decoder needs to sample from N(μ,Σ). Direct sampling would block gradient flow, but writing x=μ+Lε with ε∼N(0,I) makes the sampling deterministic given ε, allowing backpropagation through the sampling step. See formalML: for the full story.
Remark 4 Normalizing Flows and the Jacobian Determinant
The change of variables formula from Theorem 8 is the core mathematical tool behind normalizing flows. A normalizing flow models a complex target distribution fY by learning an invertible transformation g from a simple base distribution (typically Z∼Np(0,I)) to the target:
fY(y)=fZ(g−1(y))⋅∣detJg−1(y)∣
The entire architecture of a normalizing flow — RealNVP, Glow, Neural Spline Flows — is designed to ensure that (1) g is invertible, (2) the Jacobian determinant is efficiently computable, and (3) the transformation is flexible enough to model complex densities. The constraint of efficient Jacobian computation drives the use of triangular Jacobians (for which detJ=∏iJii, computed in O(p) rather than O(p3)). See formalML: for the full treatment.
8.9 Copulas: Separating Marginals from Dependence
Here is a question that sounds simple but has deep consequences: can two random vectors have the same marginal distributions but different joint distributions?
Yes, absolutely. Topic 3 flagged this point: the marginals alone do not determine the joint. Two variables X and Y can both be Exponential(1), but their joint distribution could show positive dependence, negative dependence, tail dependence, or no dependence at all — all while preserving the same marginals. The structure that separates marginals from dependence is called a copula.
As Topic 3 promised: “Specifying the dependence structure is the job of copulas.” We now deliver on that promise.
Definition 10 Copula
A copula is a joint CDF C:[0,1]p→[0,1] whose marginal distributions are all Uniform(0,1).
That is, C(u1,…,up) is a valid joint CDF on [0,1]p satisfying:
C(u1,…,uj−1,0,uj+1,…,up)=0 for any j (if any argument is 0, the probability is 0)
C(1,…,1,uj,1,…,1)=uj for each j (each marginal is Uniform(0,1))
C is non-decreasing in each argument and satisfies the rectangle inequality (ensuring probabilities of rectangles are non-negative)
The power of copulas comes from the following fundamental theorem, which says that any joint distribution can be decomposed into marginals plus a copula.
Theorem 9 Sklar's Theorem
Let F be a joint CDF of a random vector X=(X1,…,Xp)T with marginal CDFs F1,…,Fp. Then there exists a copula C:[0,1]p→[0,1] such that for all (x1,…,xp)∈Rp,
F(x1,…,xp)=C(F1(x1),…,Fp(xp))
If F1,…,Fp are all continuous, then C is unique. Conversely, for any copula C and any marginal CDFs F1,…,Fp, the function F(x1,…,xp)=C(F1(x1),…,Fp(xp)) is a valid joint CDF with marginals F1,…,Fp.
The proof of Sklar’s theorem requires measure-theoretic machinery that is beyond the scope of this topic. We state it without proof and focus on its consequences.
Sklar’s theorem is a decomposition result: the joint CDF F decomposes into marginals (F1,…,Fp, which determine the individual behavior of each variable) and a copula (C, which determines the dependence structure). This decomposition is modular: we can change the marginals while keeping the copula fixed, or change the copula while keeping the marginals fixed.
Three important copula families:
Independence copula.C⊥(u1,…,up)=u1⋅u2⋯up=∏j=1puj. This copula encodes independence: the joint CDF is the product of the marginal CDFs. No dependence structure at all.
Gaussian copula. Let R be a p×p correlation matrix and Φ the standard Normal CDF. The Gaussian copula is
CRGauss(u1,…,up)=ΦR(Φ−1(u1),…,Φ−1(up))
where ΦR is the joint CDF of Np(0,R). The Gaussian copula applies the Normal quantile transform to each margin, computes the multivariate Normal joint CDF, and maps back. It inherits the correlation structure of the MVN but can be paired with any marginals — not just Normal ones.
Student-t copula. Same construction as the Gaussian copula, but using the multivariate Student-t distribution with ν degrees of freedom instead of the Normal. The critical difference: the Student-t copula exhibits tail dependence — extreme values in one variable are associated with extreme values in another.
Remark 5 Tail Dependence: Gaussian vs. Student-t Copulas
Two variables with a Gaussian copula can have strong overall dependence (high ρ), but their tail dependence is exactly zero. This means that extreme events — both variables being in their top 1%, say — are no more likely than independent extreme events, conditional on at least one being extreme. This property caused serious problems in financial risk modeling during the 2008 financial crisis: Gaussian copula models drastically underestimated the probability of joint extreme losses.
The Student-t copula with finite ν has positive tail dependence that increases as ν decreases. With ν=3, extreme co-movement is far more likely than the Gaussian copula predicts. When modeling correlated risks (portfolio losses, simultaneous system failures, climate extremes), the choice between Gaussian and Student-t copulas has real consequences for tail risk estimation.
The coefficient of (upper) tail dependence is defined as λU=limu→1−P(U1>u∣U2>u), where (U1,U2) follows the copula. For the Gaussian copula with ∣ρ∣<1, λU=0. For the Student-t copula with correlation ρ and ν degrees of freedom, λU>0 whenever ρ>−1.
The multivariate distributions developed in this topic are the computational backbone of modern machine learning. This section traces five concrete connections, starting with the most important.
Example 3 GP Prediction as Conditional MVN
Consider a Gaussian process f∼GP(m(⋅),k(⋅,⋅)) with mean function m and kernel k. We observe function values fobs=(f(x1),…,f(xn))T at training inputs x1,…,xn and want to predict function values f∗=(f(x∗(1)),…,f(x∗(m)))T at new inputs.
By the definition of a GP, any finite collection of function values is jointly multivariate Normal. Stacking observed and predicted values:
These are exactly the standard GP prediction equations. The posterior mean μ∗∣obs is the prior mean plus a correction based on the observed residuals fobs−mobs. The posterior covariance Σ∗∣obs is the prior covariance minus the reduction due to the observations — the Schur complement. The conditional covariance does not depend on fobs (only on the input locations), which means the GP’s uncertainty is determined by where we observe, not what we observe.
The GP prediction formula is not an approximation — it is the exact conditional MVN formula, applied to the specific joint MVN defined by the kernel. See formalML: for the full development.
PCA as eigendecomposition of the sample covariance matrix. Given data x1,…,xn∈Rp, the sample covariance matrix is Σ^=n−11∑i=1n(xi−xˉ)(xi−xˉ)T. PCA computes the spectral decomposition Σ^=QΛQT and projects onto the top eigenvectors. The eigenvalues λ1≥⋯≥λp are the variances explained by each principal component, and the cumulative proportion ∑j=1kλj/∑j=1pλj determines how many components to retain. When the data are MVN, PCA gives the unique directions that maximize variance of the projections — the connection between Theorem 7 and the PCA algorithm. See formalML: .
Variational inference with MVN variational families. In variational inference, we approximate a complex posterior p(z∣x) with a tractable distribution q(z), chosen to minimize the KL divergence KL(q∥p). The most common choice for q is the multivariate Normal. A full-covariance variational family q(z)=Np(μ,Σ) has p+p(p+1)/2 parameters (the mean vector and the full covariance matrix). A diagonal (mean-field) variational family q(z)=Np(μ,diag(σ2)) has only 2p parameters but assumes independence — it misses all correlations. The reparameterization trick from Section 8.8 (z=μ+Lε) enables gradient-based optimization of the ELBO with respect to μ and L.
LDA as Dirichlet prior with Multinomial likelihood. In Latent Dirichlet Allocation, each document d has topic proportions θd∼Dir(α), each topic k has a word distribution ϕk∼Dir(β), and the words in document d are drawn from a Multinomial governed by the topic mixture. The entire generative model is: draw topic proportions from a Dirichlet, draw a topic for each word from a Multinomial over topics, then draw the word from a Multinomial over the vocabulary. The Dirichlet-Multinomial conjugacy (Theorem 6) makes collapsed Gibbs sampling tractable: we can integrate out θd and ϕk analytically, sampling only the topic assignments.
BNN weight priors as multivariate Normal. In Bayesian neural networks, a common prior on the weight vector w∈Rp is w∼Np(0,σ2I) — a spherical MVN centered at the origin. This prior regularizes the weights (equivalent to L2 regularization in the MAP limit) and defines a prior over functions. The posterior p(w∣D) is approximately MVN for well-specified models (by the Bernstein-von Mises theorem), with the Hessian of the negative log-posterior at the MAP estimate serving as the inverse covariance (the Laplace approximation). Topic 25 §25.8 proves BvM in the scalar case and Rem 16 handles the multivariate Laplace extension. See formalML: .
8.11 Summary
This topic completes Track 2: Core Distributions and Families. With Topics 5-8 now in place, we have a complete catalog of the distributions that appear throughout statistics and machine learning, along with the structural frameworks (exponential families, multivariate distributions) that connect them.
The conditional MVN formula (Theorem 3) is the single most consequential result. The conditional mean μ1∣2=μ1+Σ12Σ22−1(x2−μ2) is linear in the observed value, and the conditional covariance Σ1∣2=Σ11−Σ12Σ22−1Σ21 (the Schur complement) is constant. This formula is the GP prediction formula, the Kalman filter update, and the engine behind Bayesian linear regression.
The Dirichlet-Multinomial conjugacy (Theorem 6) gives the cleanest example of Bayesian updating: prior Dir(α) plus observed counts x yields posterior Dir(α+x). This powers LDA topic models and categorical Bayesian inference.
The spectral decomposition of the covariance matrix (Theorem 7) Σ=QΛQT gives the principal axes and variances of the distribution. This is the mathematical foundation of PCA.
The change of variables formula (Theorem 8) with the Jacobian determinant is the mathematical tool behind normalizing flows and the reparameterization trick.
Sklar’s theorem (Theorem 9) decomposes any joint distribution into marginals plus a copula, enabling the separation of individual behavior from dependence structure.
What comes next. Track 3 (Convergence and Limit Theorems) develops the asymptotic theory that makes statistics work in practice. The Law of Large Numbers explains why sample means converge to population means. The Central Limit Theorem explains why so many estimators are asymptotically Normal — including the multivariate Normal. The Delta Method, Slutsky’s theorem, and the Continuous Mapping Theorem provide the tools for deriving the limiting distributions of complex statistics. The multivariate Normal developed here is the target of most convergence results — virtually every well-behaved estimator converges to a multivariate Normal distribution as sample size grows.