foundational 55 min read · April 11, 2026

Expectation, Variance & Moments

The center of mass, the spread, and the shape — the numerical summaries that reduce distributions to the quantities that drive all of statistical inference and machine learning.

1. Expectation: The Center of Mass

In Topic 3, we built the machinery of random variables, PMFs, PDFs, and CDFs — the full description of how probability is distributed over numbers. But a full distribution is a lot of information. Often we need a single number that summarizes the “location” of a distribution: where is the probability concentrated? What value do we “expect” to see?

The expectation (or expected value, or mean) of a random variable answers this question. It is the center of mass of the distribution — the balance point. If you placed the PMF bars (or PDF curve) on a number line and balanced it on a fulcrum, the balance point would be E[X]E[X].

Definition 1 Expectation (Discrete and Continuous)

Let XX be a random variable.

Discrete case. If XX takes values in a countable set {x1,x2,}\{x_1, x_2, \ldots\} with PMF pXp_X, the expectation of XX is

E[X]=ixipX(xi)E[X] = \sum_{i} x_i \, p_X(x_i)

provided ixipX(xi)<\sum_i |x_i| \, p_X(x_i) < \infty (absolute convergence).

Continuous case. If XX has PDF fXf_X, the expectation of XX is

E[X]=xfX(x)dxE[X] = \int_{-\infty}^{\infty} x \, f_X(x) \, dx

provided xfX(x)dx<\int_{-\infty}^{\infty} |x| \, f_X(x) \, dx < \infty (absolute integrability).

The expectation is also written μ\mu, μX\mu_X, or E[X]\mathbb{E}[X].

Remark Absolute convergence is essential

The absolute convergence condition E[X]<E[|X|] < \infty is not a technicality — without it, the expectation can depend on the order of summation or the way we partition the integral. The Cauchy distribution with PDF f(x)=1π(1+x2)f(x) = \frac{1}{\pi(1 + x^2)} is the standard example: xf(x)dx=\int_{-\infty}^{\infty} |x| f(x) \, dx = \infty, so E[X]E[X] does not exist. If you compute the Cauchy principal value limRRRxf(x)dx\lim_{R \to \infty} \int_{-R}^{R} x f(x) \, dx, you get 0 — but that’s a cancellation artifact, not a genuine expectation. The absolute convergence condition from formalCalculus: Sequences & Limits ensures the expectation is well-defined regardless of ordering.

Three-panel figure showing expectation as center of mass: fair die with E[X]=3.5, loaded die with E[X]=4.85, and Exponential(1.5) with E[X]=0.667, each with a balance-point triangle

The expectation is a weighted average of the values, weighted by their probabilities. For the fair die, E[X]=16(1+2+3+4+5+6)=3.5E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5. For the loaded die that favors high rolls, the balance point shifts rightward. For a continuous distribution, the sum becomes an integral but the idea is identical: multiply each value by its probability density and integrate.

One of the most useful tools for computing expectations is LOTUS — the Law of the Unconscious Statistician. It lets us compute E[g(X)]E[g(X)] directly from the distribution of XX, without first finding the distribution of g(X)g(X).

Theorem 1 LOTUS (Law of the Unconscious Statistician)

Let XX be a random variable and g:RRg : \mathbb{R} \to \mathbb{R} a function.

Discrete case: E[g(X)]=xg(x)pX(x)\displaystyle E[g(X)] = \sum_{x} g(x) \, p_X(x)

Continuous case: E[g(X)]=g(x)fX(x)dx\displaystyle E[g(X)] = \int_{-\infty}^{\infty} g(x) \, f_X(x) \, dx

provided the sum/integral converges absolutely.

Proof [show]

Discrete case. Let Y=g(X)Y = g(X). We need to show that computing E[Y]E[Y] via the PMF of YY gives the same result as summing g(x)pX(x)g(x) \, p_X(x) over the support of XX.

The PMF of YY is pY(y)=P(g(X)=y)=x:g(x)=ypX(x)p_Y(y) = P(g(X) = y) = \sum_{x : g(x) = y} p_X(x). Therefore:

E[Y]=yypY(y)=yyx:g(x)=ypX(x)E[Y] = \sum_{y} y \, p_Y(y) = \sum_{y} y \sum_{x : g(x) = y} p_X(x)

Swapping the order of summation — every xx appears in exactly one group (the group indexed by y=g(x)y = g(x)):

=xg(x)pX(x)= \sum_{x} g(x) \, p_X(x)

The continuous case follows by the same argument with integrals and the change of variables formula (see formalCalculus: Change of Variables ). \square

LOTUS is called the “law of the unconscious statistician” because students often apply it without thinking — and it works. The name is mildly pejorative, but the theorem is anything but trivial: it saves you from having to derive the distribution of g(X)g(X) before computing the expectation.

Example 1 Die roll expectation

Roll a fair die. X{1,2,3,4,5,6}X \in \{1,2,3,4,5,6\} with pX(k)=1/6p_X(k) = 1/6 for each kk.

E[X]=k=16k16=16(1+2+3+4+5+6)=216=3.5E[X] = \sum_{k=1}^{6} k \cdot \frac{1}{6} = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5

Notice that E[X]=3.5E[X] = 3.5 is not a value XX can actually take — this is normal. The expectation is the center of mass, not a mode or a median.

Using LOTUS, E[X2]=k=16k216=16(1+4+9+16+25+36)=91615.17E[X^2] = \sum_{k=1}^{6} k^2 \cdot \frac{1}{6} = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17.

Example 2 Exponential expectation

Let XExp(λ)X \sim \text{Exp}(\lambda) with PDF fX(x)=λeλxf_X(x) = \lambda e^{-\lambda x} for x0x \geq 0.

E[X]=0xλeλxdxE[X] = \int_0^{\infty} x \cdot \lambda e^{-\lambda x} \, dx

Using integration by parts (with u=xu = x, dv=λeλxdxdv = \lambda e^{-\lambda x} dx) from formalCalculus: Integration by Parts :

=[xeλx]0+0eλxdx=0+1λ=1λ= \left[-x e^{-\lambda x}\right]_0^{\infty} + \int_0^{\infty} e^{-\lambda x} \, dx = 0 + \frac{1}{\lambda} = \frac{1}{\lambda}

An Exponential(λ\lambda) random variable has mean 1/λ1/\lambda. If a server processes requests at rate λ=5\lambda = 5 per second, the mean inter-arrival time is 1/5=0.21/5 = 0.2 seconds.

Use the explorer below to visualize expectation as the balance point. Toggle between discrete and continuous distributions, or enter your own probability values:

Expectation Balance Explorer

123456E[X]=3.50
E[X]
3.5000
E[X²]
15.1667
Var(X)
2.9167

2. Properties of Expectation

Expectation is a linear operation — and this is its single most powerful property. Linearity holds without any independence assumption.

Theorem 2 Linearity of Expectation

For any random variables XX and YY (with finite expectations) and constants a,bRa, b \in \mathbb{R}:

E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = a \, E[X] + b \, E[Y]

Proof [show]

We prove the discrete case; the continuous case is analogous with integrals replacing sums.

Let (X,Y)(X, Y) have joint PMF pX,Y(x,y)p_{X,Y}(x, y). Then:

E[aX+bY]=xy(ax+by)pX,Y(x,y)E[aX + bY] = \sum_x \sum_y (ax + by) \, p_{X,Y}(x, y)

Expanding the sum:

=axyxpX,Y(x,y)+bxyypX,Y(x,y)= a \sum_x \sum_y x \, p_{X,Y}(x, y) + b \sum_x \sum_y y \, p_{X,Y}(x, y)

The inner sum in the first term: ypX,Y(x,y)=pX(x)\sum_y p_{X,Y}(x, y) = p_X(x) (the marginal PMF of XX, from Topic 3). So:

=axxpX(x)+byypY(y)=aE[X]+bE[Y]= a \sum_x x \, p_X(x) + b \sum_y y \, p_Y(y) = a \, E[X] + b \, E[Y]

No independence was used — only the existence of marginals from the joint. \square

Remark Linearity requires no independence

This is worth emphasizing: E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] always, even when XX and YY are dependent. The proof uses only marginalization, not factorization of the joint. This makes linearity enormously useful — we can compute E[sum]E[\text{sum}] as a sum of expectations even when the summands are tangled together in complex ways. The classic application: expected number of fixed points in a random permutation (Example 3 below).

Three-panel figure showing linearity of expectation: Bin(5,0.4) PMF, Bin(5,0.6) PMF, and convolution PMF of X+Y with E[X+Y] = E[X]+E[Y]
Theorem 3 Monotonicity

If XYX \leq Y almost surely (i.e., P(XY)=1P(X \leq Y) = 1), then E[X]E[Y]E[X] \leq E[Y].

Proof [show]

Define Z=YXZ = Y - X. Since XYX \leq Y a.s., we have Z0Z \geq 0 a.s. For a nonnegative random variable, E[Z]=zzpZ(z)0E[Z] = \sum_z z \, p_Z(z) \geq 0 (every term is nonneg). So E[Y]E[X]=E[YX]=E[Z]0E[Y] - E[X] = E[Y - X] = E[Z] \geq 0. \square

Theorem 4 Expectation of Constants

For any constant cRc \in \mathbb{R}: E[c]=cE[c] = c.

The proof is immediate: a constant random variable has PMF concentrated at a single point, so E[c]=c1=cE[c] = c \cdot 1 = c.

Theorem 5 Expectation of Independent Products

If XX and YY are independent random variables with finite expectations, then

E[XY]=E[X]E[Y]E[XY] = E[X] \cdot E[Y]

Proof [show]

Since XYX \perp Y, the joint PMF factors: pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) (from Topic 2 and Topic 3). Then:

E[XY]=xyxypX,Y(x,y)=xyxypX(x)pY(y)E[XY] = \sum_x \sum_y xy \, p_{X,Y}(x, y) = \sum_x \sum_y xy \, p_X(x) \, p_Y(y)

Factoring:

=(xxpX(x))(yypY(y))=E[X]E[Y]= \left(\sum_x x \, p_X(x)\right) \left(\sum_y y \, p_Y(y)\right) = E[X] \cdot E[Y]

\square

Remark E[XY] = E[X]E[Y] does not imply independence

The converse is false. If XUniform{1,0,1}X \sim \text{Uniform}\{-1, 0, 1\} and Y=X2Y = X^2, then E[XY]=E[X3]=0=E[X]E[Y]E[XY] = E[X^3] = 0 = E[X] \cdot E[Y], but XX and YY are clearly dependent (YY is a deterministic function of XX). The condition E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] is called uncorrelatedness — it is strictly weaker than independence.

Example 3 Expected number of matches (linearity without independence)

Randomly shuffle nn cards labeled 1,,n1, \ldots, n. A match (or fixed point) occurs at position ii if card ii lands in position ii. Let M=i=1nXiM = \sum_{i=1}^{n} X_i where Xi=1{card i is in position i}X_i = \mathbf{1}\{\text{card } i \text{ is in position } i\}.

The XiX_i‘s are dependent (if card 1 is in position 1, the remaining cards are shuffled among n1n - 1 positions, changing the probabilities for X2,,XnX_2, \ldots, X_n). But linearity doesn’t care:

E[M]=i=1nE[Xi]=i=1nP(card i in position i)=i=1n1n=1E[M] = \sum_{i=1}^{n} E[X_i] = \sum_{i=1}^{n} P(\text{card } i \text{ in position } i) = \sum_{i=1}^{n} \frac{1}{n} = 1

The expected number of matches is exactly 1, regardless of nn. This surprising result — the same whether you shuffle 10 cards or 10 million — follows effortlessly from linearity.


3. Variance: Measuring Spread

The expectation tells us where a distribution is centered. But two distributions can have the same center and look completely different — one tightly concentrated, the other spread wide. We need a measure of spread.

Definition 2 Variance and Standard Deviation

The variance of a random variable XX with mean μ=E[X]\mu = E[X] is

Var(X)=E[(Xμ)2]\text{Var}(X) = E[(X - \mu)^2]

The standard deviation is σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}.

Variance is also written σ2\sigma^2, σX2\sigma_X^2, or Var(X)\text{Var}(X).

Variance is the average squared distance from the mean. It measures how far a random variable typically falls from its expected value. The squaring ensures that deviations above and below the mean both contribute positively. The standard deviation σ\sigma returns us to the original units (if XX is in meters, Var(X)\text{Var}(X) is in meters2^2 but σX\sigma_X is in meters).

Theorem 6 Variance Decomposition (Computational Formula)

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Proof [show]

Expand the definition using linearity:

Var(X)=E[(Xμ)2]=E[X22μX+μ2]\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2]

By linearity of expectation (Theorem 2):

=E[X2]2μE[X]+μ2= E[X^2] - 2\mu \, E[X] + \mu^2

Since μ=E[X]\mu = E[X]:

=E[X2]2(E[X])2+(E[X])2=E[X2](E[X])2= E[X^2] - 2(E[X])^2 + (E[X])^2 = E[X^2] - (E[X])^2

\square

This computational formula — “the mean of the square minus the square of the mean” — is almost always easier to use than the definition.

Theorem 7 Properties of Variance
  1. Var(X)0\text{Var}(X) \geq 0, with equality iff XX is constant a.s.
  2. Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \, \text{Var}(X) for constants a,ba, b
  3. If XYX \perp Y, then Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)
Proof [show]

Property 1. Var(X)=E[(Xμ)2]0\text{Var}(X) = E[(X - \mu)^2] \geq 0 because (Xμ)20(X - \mu)^2 \geq 0 a.s. If Var(X)=0\text{Var}(X) = 0, then E[(Xμ)2]=0E[(X - \mu)^2] = 0, so (Xμ)2=0(X - \mu)^2 = 0 a.s., meaning X=μX = \mu a.s.

Property 2. Let Y=aX+bY = aX + b. Then E[Y]=aE[X]+bE[Y] = aE[X] + b, so:

Var(Y)=E[(YE[Y])2]=E[(aX+baE[X]b)2]=E[a2(XE[X])2]=a2Var(X)\text{Var}(Y) = E[(Y - E[Y])^2] = E[(aX + b - aE[X] - b)^2] = E[a^2(X - E[X])^2] = a^2 \, \text{Var}(X)

Adding a constant shifts the distribution but doesn’t change the spread. Scaling by aa scales the variance by a2a^2.

Property 3. By the computational formula, Var(X+Y)=E[(X+Y)2](E[X+Y])2\text{Var}(X + Y) = E[(X+Y)^2] - (E[X+Y])^2.

Expanding:

E[(X+Y)2]=E[X2+2XY+Y2]=E[X2]+2E[XY]+E[Y2]E[(X+Y)^2] = E[X^2 + 2XY + Y^2] = E[X^2] + 2E[XY] + E[Y^2]

(E[X+Y])2=(E[X]+E[Y])2=(E[X])2+2E[X]E[Y]+(E[Y])2(E[X+Y])^2 = (E[X] + E[Y])^2 = (E[X])^2 + 2E[X]E[Y] + (E[Y])^2

Subtracting:

Var(X+Y)=(E[X2](E[X])2)+(E[Y2](E[Y])2)+2(E[XY]E[X]E[Y])\text{Var}(X+Y) = \bigl(E[X^2] - (E[X])^2\bigr) + \bigl(E[Y^2] - (E[Y])^2\bigr) + 2\bigl(E[XY] - E[X]E[Y]\bigr)

The first two terms are Var(X)\text{Var}(X) and Var(Y)\text{Var}(Y). The third term is 2Cov(X,Y)2\text{Cov}(X,Y) (Definition 3 below). When XYX \perp Y, E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] (Theorem 5), so the covariance term vanishes. \square

Remark Variance does NOT split for dependent variables

Property 3 requires independence. In general, Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y). If XX and YY are positively correlated (Cov(X,Y)>0\text{Cov}(X,Y) > 0), the variance of their sum is larger than the sum of variances. If negatively correlated, it’s smaller. This is the mathematical foundation of portfolio diversification: combining negatively correlated assets reduces total variance.

Three-panel figure showing variance as spread: Bin(20,0.5) with ±σ band, Bin(20,0.1) vs Bin(20,0.9), and three Normals with different σ
Example 4 Die roll variance

For a fair die with E[X]=3.5E[X] = 3.5 and E[X2]=91/615.17E[X^2] = 91/6 \approx 15.17:

Var(X)=E[X2](E[X])2=916(72)2=916494=18214712=35122.92\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.92

Standard deviation: σ=35/121.71\sigma = \sqrt{35/12} \approx 1.71.

Example 5 A/B test: same mean, different variance

Variant A: win 5 dollars with probability 0.4, else 0. E[A]=2E[A] = 2, Var(A)=0.4254=6\text{Var}(A) = 0.4 \cdot 25 - 4 = 6.

Variant B: win 20 dollars with probability 0.1, else 0. E[B]=2E[B] = 2, Var(B)=0.14004=36\text{Var}(B) = 0.1 \cdot 400 - 4 = 36.

Both have the same mean payout (2 dollars), but Variant B is 6x more variable. In an A/B test, you’d need far more samples to detect a treatment effect in B than in A — because the noise-to-signal ratio is much higher. This is why variance matters for experimental design.

0.050.100.140.19123456P(X = x)E[X] = 3.50

Variance Formulas

Definition (average squared deviation)
Var(X) = E[(X μ)²]
(1 3.50)² · 0.1667
+ (2 3.50)² · 0.1667
+ (3 3.50)² · 0.1667
+ (4 3.50)² · 0.1667
+ (5 3.50)² · 0.1667
+ (6 3.50)² · 0.1667
= 2.9167
Computational formula
Var(X) = E[X²] (E[X])²
E[X²] = 15.1667
(E[X])² = (3.5000)² = 12.2500
= 15.1667 12.2500 = 2.9167
E[X] = 3.5000
E[X²] = 15.1667
Var(X) = 2.9167
σ = Var(X) = 1.7078
Both formulas agree — same result.
PMF / PDF
E[X] (mean)
Deviation lines
(x μ)² squares

4. Covariance and Correlation

When we have two random variables, we want to quantify their linear association. Do they tend to be large together (positive association) or does one tend to be large when the other is small (negative association)?

Definition 3 Covariance

The covariance of random variables XX and YY is

Cov(X,Y)=E[(XμX)(YμY)]\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]

where μX=E[X]\mu_X = E[X] and μY=E[Y]\mu_Y = E[Y].

Theorem 8 Computational Formula for Covariance

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[XY] - E[X] \cdot E[Y]

Proof [show]

Expand the definition:

Cov(X,Y)=E[(XμX)(YμY)]=E[XYμXYXμY+μXμY]\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY - \mu_X Y - X \mu_Y + \mu_X \mu_Y]

By linearity:

=E[XY]μXE[Y]E[X]μY+μXμY=E[XY]μXμYμXμY+μXμY= E[XY] - \mu_X E[Y] - E[X] \mu_Y + \mu_X \mu_Y = E[XY] - \mu_X \mu_Y - \mu_X \mu_Y + \mu_X \mu_Y

=E[XY]E[X]E[Y]= E[XY] - E[X] \cdot E[Y]

\square

Note that Cov(X,X)=E[X2](E[X])2=Var(X)\text{Cov}(X, X) = E[X^2] - (E[X])^2 = \text{Var}(X).

Theorem 9 Properties of Covariance
  1. Cov(X,Y)=Cov(Y,X)\text{Cov}(X, Y) = \text{Cov}(Y, X) (symmetry)
  2. Cov(aX+b,cY+d)=acCov(X,Y)\text{Cov}(aX + b, cY + d) = ac \, \text{Cov}(X, Y) (bilinearity with constants)
  3. Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) (general variance of sum)
  4. Cov(iXi,jYj)=ijCov(Xi,Yj)\text{Cov}\bigl(\sum_i X_i, \sum_j Y_j\bigr) = \sum_i \sum_j \text{Cov}(X_i, Y_j) (multilinearity)

The proof of Property 3 was given in Theorem 7. Properties 1, 2, and 4 follow from the definition and linearity of expectation.

Definition 4 Correlation Coefficient

The Pearson correlation coefficient of XX and YY is

ρ(X,Y)=Cov(X,Y)σXσY=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \, \text{Var}(Y)}}

provided both variances are positive.

Theorem 10 Correlation Bounds

1ρ(X,Y)1-1 \leq \rho(X, Y) \leq 1

with ρ(X,Y)=1|\rho(X, Y)| = 1 if and only if Y=aX+bY = aX + b for some constants a,ba, b (i.e., XX and YY are related by a perfect linear function).

Proof [show]

Consider the random variable Z=X/σXtY/σYZ = X/\sigma_X - t \cdot Y/\sigma_Y for some real number tt. Since Var(Z)0\text{Var}(Z) \geq 0:

0Var(Z)=Var(X/σX)2tCov(X/σX,Y/σY)+t2Var(Y/σY)0 \leq \text{Var}(Z) = \text{Var}(X/\sigma_X) - 2t \, \text{Cov}(X/\sigma_X, Y/\sigma_Y) + t^2 \, \text{Var}(Y/\sigma_Y)

=12tρ+t2= 1 - 2t\rho + t^2

This quadratic in tt is nonneg for all tt, so its discriminant must be 0\leq 0:

4ρ240    ρ21    1ρ14\rho^2 - 4 \leq 0 \implies \rho^2 \leq 1 \implies -1 \leq \rho \leq 1

Equality holds when Var(Z)=0\text{Var}(Z) = 0 for some tt, meaning ZZ is constant a.s., i.e., X/σX=tY/σY+cX/\sigma_X = tY/\sigma_Y + c. \square

Remark Zero covariance from independence; converse false

Independence     \implies Cov(X,Y)=0\text{Cov}(X,Y) = 0     \implies ρ(X,Y)=0\rho(X,Y) = 0 (Theorem 5). But the converse fails: uncorrelatedness (ρ=0\rho = 0) does not imply independence. The example from Remark 3 (XX uniform on {1,0,1}\{-1, 0, 1\}, Y=X2Y = X^2) has ρ=0\rho = 0 but complete functional dependence. Correlation measures linear association only — it can miss nonlinear dependencies entirely. This distinction matters in ML: two features can be uncorrelated yet carry highly redundant information through nonlinear relationships.

Three-panel scatter plot showing correlation: ρ=0.85 (positive), ρ=0 (zero), and ρ=−0.75 (negative)

5. Standard Inequalities

Probability bounds are the bread and butter of theoretical statistics and machine learning. When we can’t compute exact probabilities, we use inequalities to bound them from above. The three workhorses are Markov, Chebyshev, and Jensen.

Theorem 11 Markov's Inequality

If X0X \geq 0 a.s. and a>0a > 0, then

P(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}

Proof [show]

Since X0X \geq 0:

E[X]=0xfX(x)dxaxfX(x)dxaafX(x)dx=aP(Xa)E[X] = \int_0^{\infty} x \, f_X(x) \, dx \geq \int_a^{\infty} x \, f_X(x) \, dx \geq a \int_a^{\infty} f_X(x) \, dx = a \, P(X \geq a)

Dividing both sides by aa: P(Xa)E[X]/aP(X \geq a) \leq E[X]/a. \square

Markov’s inequality is very weak — but it uses almost no information (only E[X]E[X] and X0X \geq 0). The bound is tight: for a Bernoulli variable with P(X=n)=1/nP(X = n) = 1/n and P(X=0)=11/nP(X = 0) = 1 - 1/n, P(Xn)=1/nP(X \geq n) = 1/n and E[X]/n=1/nE[X]/n = 1/n.

Theorem 12 Chebyshev's Inequality

For any random variable XX with E[X]=μE[X] = \mu and Var(X)=σ2<\text{Var}(X) = \sigma^2 < \infty:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

for any k>0k > 0. Equivalently, P(Xμε)Var(X)/ε2P(|X - \mu| \geq \varepsilon) \leq \text{Var}(X)/\varepsilon^2.

Proof [show]

Apply Markov’s inequality to the nonneg random variable (Xμ)2(X - \mu)^2 with threshold ε2\varepsilon^2:

P(Xμε)=P((Xμ)2ε2)E[(Xμ)2]ε2=Var(X)ε2P(|X - \mu| \geq \varepsilon) = P((X - \mu)^2 \geq \varepsilon^2) \leq \frac{E[(X - \mu)^2]}{\varepsilon^2} = \frac{\text{Var}(X)}{\varepsilon^2}

Setting ε=kσ\varepsilon = k\sigma gives P(Xμkσ)1/k2P(|X - \mu| \geq k\sigma) \leq 1/k^2. \square

Chebyshev uses both the mean and the variance, so it’s tighter than Markov. At k=2k = 2 standard deviations: Chebyshev gives 25%\leq 25\%, while for the normal distribution the true probability is 4.6%\approx 4.6\%. At k=3k = 3: Chebyshev gives 11.1%\leq 11.1\%; normal gives 0.3%\approx 0.3\%. Chebyshev applies to any distribution — that’s why it’s loose for the well-behaved normal.

Example 6 Chebyshev in practice

A quality control process produces items with mean weight μ=100\mu = 100g and standard deviation σ=2\sigma = 2g. What fraction of items can weigh more than 106g?

Using Chebyshev with k=3k = 3 (since 106100=6=3σ|106 - 100| = 6 = 3\sigma):

P(X1006)132=1911.1%P(|X - 100| \geq 6) \leq \frac{1}{3^2} = \frac{1}{9} \approx 11.1\%

If we know the weights are normally distributed, the true probability is P(Z3)0.27%P(|Z| \geq 3) \approx 0.27\% — 40x smaller. Chebyshev’s power is that it works regardless of the distribution shape.

Theorem 13 Jensen's Inequality

If gg is a convex function and E[X]E[X] exists, then

g(E[X])E[g(X)]g(E[X]) \leq E[g(X)]

If gg is concave, the inequality reverses: g(E[X])E[g(X)]g(E[X]) \geq E[g(X)].

Proof [show]

Since gg is convex, it lies above every tangent line. At the point μ=E[X]\mu = E[X], there exists a slope mm (a subgradient) such that for all xx:

g(x)g(μ)+m(xμ)g(x) \geq g(\mu) + m(x - \mu)

Taking expectations of both sides (which preserves the inequality by monotonicity, Theorem 3):

E[g(X)]g(μ)+mE[Xμ]=g(μ)+m0=g(μ)=g(E[X])E[g(X)] \geq g(\mu) + m \, E[X - \mu] = g(\mu) + m \cdot 0 = g(\mu) = g(E[X])

\square

Example 7 Jensen and the AM-GM inequality

Let g(x)=log(x)g(x) = -\log(x) (convex on (0,)(0, \infty)). Jensen gives:

log(E[X])E[log(X)]=E[log(X)]-\log(E[X]) \leq E[-\log(X)] = -E[\log(X)]

So log(E[X])E[log(X)]\log(E[X]) \geq E[\log(X)], or equivalently E[X]eE[logX]E[X] \geq e^{E[\log X]}. For nn equal-probability values x1,,xnx_1, \ldots, x_n:

x1++xnn(x1xn)1/n\frac{x_1 + \cdots + x_n}{n} \geq (x_1 \cdots x_n)^{1/n}

This is the arithmetic-mean ≥ geometric-mean inequality — a pure consequence of Jensen.

ML application: Jensen’s inequality with g(x)=log(x)g(x) = -\log(x) is exactly what gives us the evidence lower bound (ELBO) in variational inference: logp(x)Eq[logp(x,z)logq(z)]\log p(\mathbf{x}) \geq E_q[\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z})]. See formalML: Information Theory for the full derivation.

Three-panel figure showing Markov on Exponential, Chebyshev on Normal, and Jensen with convex function x² and tangent line
4.0
0510a = 4.0E[X]=1.98f(x)
True P(X ≥ 4.0)0.1353Markov: E[X]/a0.4944Bound / True = 3.65x (1.00x = tight)
E[X] = 1.9774
Markov: P(X 4.0) E[X]/a = 1.9774 / 4.0 = 0.4944
True: P(X 4.0) = 0.1353

6. Conditional Expectation

In Topic 2, we developed conditional probability P(AB)P(A \mid B) — the probability of an event given partial information. Now we extend this idea from events to random variables. The conditional expectation E[XY]E[X \mid Y] is our best guess of XX given what YY tells us.

Definition 5 Conditional Expectation Given an Event

If BB is an event with P(B)>0P(B) > 0, the conditional expectation of XX given BB is

E[XB]=xxP(X=xB)(discrete)E[X \mid B] = \sum_x x \, P(X = x \mid B) \quad \text{(discrete)}

E[XB]=xfXB(x)dx(continuous)E[X \mid B] = \int_{-\infty}^{\infty} x \, f_{X|B}(x) \, dx \quad \text{(continuous)}

This is just the ordinary expectation computed using the conditional distribution.

Definition 6 Conditional Expectation as a Function

For discrete XX and YY with joint PMF pX,Yp_{X,Y}:

E[XY=y]=xxpXY(xy)=xxpX,Y(x,y)pY(y)E[X \mid Y = y] = \sum_x x \, p_{X|Y}(x \mid y) = \sum_x x \, \frac{p_{X,Y}(x, y)}{p_Y(y)}

For continuous XX and YY with joint PDF fX,Yf_{X,Y}:

E[XY=y]=xfXY(xy)dx=xfX,Y(x,y)fY(y)dxE[X \mid Y = y] = \int_{-\infty}^{\infty} x \, f_{X|Y}(x \mid y) \, dx = \int_{-\infty}^{\infty} x \, \frac{f_{X,Y}(x, y)}{f_Y(y)} \, dx

Here E[XY=y]E[X \mid Y = y] is a function of yy — we write it as h(y)=E[XY=y]h(y) = E[X \mid Y = y].

Definition 7 Conditional Expectation as a Random Variable

The conditional expectation E[XY]E[X \mid Y] is the random variable obtained by evaluating the function h(y)=E[XY=y]h(y) = E[X \mid Y = y] at YY:

E[XY]=h(Y)E[X \mid Y] = h(Y)

This is a random variable — it inherits its randomness from YY. Different realizations of YY produce different “best guesses” of XX.

The progression from Definition 5 to Definition 7 is crucial: we start with a number (E[XB]E[X \mid B]), then a function of yy (E[XY=y]E[X \mid Y = y]), then a random variable (E[XY]E[X \mid Y]). The random variable interpretation is what makes the tower property (Theorem 14) meaningful — we can take expectations of conditional expectations.

Three-panel figure showing conditional expectation: joint scatter with E[X|Y=y] line, conditional PDF slices at y=2,3,4, and histogram of E[X|Y] as random variable
Example 8 Bivariate normal conditional expectation

Let (X,Y)(X, Y) be bivariate normal with means μX,μY\mu_X, \mu_Y, standard deviations σX,σY\sigma_X, \sigma_Y, and correlation ρ\rho. From Topic 3, §8:

E[XY=y]=μX+ρσXσY(yμY)E[X \mid Y = y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y)

This is a linear function of yy — it’s the regression line. The slope is ρσX/σY\rho \sigma_X / \sigma_Y, and when ρ=0\rho = 0, the conditional mean equals the unconditional mean μX\mu_X (knowing YY provides no information about XX).

The conditional variance is Var(XY=y)=σX2(1ρ2)\text{Var}(X \mid Y = y) = \sigma_X^2(1 - \rho^2), which does not depend on yy. This homoscedasticity is special to the bivariate normal.


7. The Law of Total Expectation and Eve’s Law

The law of total expectation (also called the tower property or Adam’s law) is one of the most powerful tools in probability. It says: to compute E[X]E[X], first compute E[XY]E[X \mid Y] for each value of YY, then average over YY.

Theorem 14 Law of Total Expectation (Tower Property)

E[X]=E[E[XY]]E[X] = E[E[X \mid Y]]

More precisely, if YY is discrete with values {y1,y2,}\{y_1, y_2, \ldots\}:

E[X]=jE[XY=yj]P(Y=yj)E[X] = \sum_j E[X \mid Y = y_j] \, P(Y = y_j)

Proof [show]

We prove the discrete case. Start with the right side:

jE[XY=yj]P(Y=yj)=j(ixiP(X=xiY=yj))P(Y=yj)\sum_j E[X \mid Y = y_j] \, P(Y = y_j) = \sum_j \left(\sum_i x_i \, P(X = x_i \mid Y = y_j)\right) P(Y = y_j)

By the definition of conditional probability, P(X=xiY=yj)P(Y=yj)=P(X=xi,Y=yj)P(X = x_i \mid Y = y_j) \cdot P(Y = y_j) = P(X = x_i, Y = y_j):

=jixiP(X=xi,Y=yj)= \sum_j \sum_i x_i \, P(X = x_i, Y = y_j)

Swapping the order of summation:

=ixijP(X=xi,Y=yj)=ixiP(X=xi)=E[X]= \sum_i x_i \sum_j P(X = x_i, Y = y_j) = \sum_i x_i \, P(X = x_i) = E[X]

The last step uses the law of total probability: jP(X=xi,Y=yj)=P(X=xi)\sum_j P(X = x_i, Y = y_j) = P(X = x_i) (marginalizing out YY). \square

Notice the parallel with the law of total probability from Topic 2: P(A)=jP(ABj)P(Bj)P(A) = \sum_j P(A \mid B_j) P(B_j). The tower property is the same idea applied to expectations.

Theorem 15 Law of Total Variance (Eve's Law)

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = E[\text{Var}(X \mid Y)] + \text{Var}(E[X \mid Y])

In words: total variance = expected within-group variance + between-group variance.

Proof [show]

Use the computational formula Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 and apply the tower property to both terms.

By the tower property: E[X2]=E[E[X2Y]]E[X^2] = E[E[X^2 \mid Y]] and E[X]=E[E[XY]]E[X] = E[E[X \mid Y]].

Now note that Var(XY)=E[X2Y](E[XY])2\text{Var}(X \mid Y) = E[X^2 \mid Y] - (E[X \mid Y])^2 (the computational formula applied conditionally). So:

E[X2Y]=Var(XY)+(E[XY])2E[X^2 \mid Y] = \text{Var}(X \mid Y) + (E[X \mid Y])^2

Taking expectations: E[X2]=E[Var(XY)]+E[(E[XY])2]E[X^2] = E[\text{Var}(X \mid Y)] + E[(E[X \mid Y])^2].

Therefore:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

=E[Var(XY)]+E[(E[XY])2](E[E[XY]])2= E[\text{Var}(X \mid Y)] + E[(E[X \mid Y])^2] - (E[E[X \mid Y]])^2

The last two terms are E[Z2](E[Z])2E[Z^2] - (E[Z])^2 where Z=E[XY]Z = E[X \mid Y], which is Var(Z)=Var(E[XY])\text{Var}(Z) = \text{Var}(E[X \mid Y]).

=E[Var(XY)]+Var(E[XY])= E[\text{Var}(X \mid Y)] + \text{Var}(E[X \mid Y])

\square

Eve’s law is the mathematical foundation of ANOVA (analysis of variance): total variation decomposes into within-group and between-group components. In ML, it underlies the bias-variance decomposition (§9).

Three-panel figure showing tower property on mixture histogram, Eve's law stacked bar chart with within/between/total decomposition
Example 9 Mixture model (tower property)

A company has two customer segments: Casual (60%) with mean spending of 50, and Power Users (40%) with mean spending of 120. Let YY indicate the segment.

E[Spending]=E[SpendingCasual]P(Casual)+E[SpendingPower]P(Power)E[\text{Spending}] = E[\text{Spending} \mid \text{Casual}] \cdot P(\text{Casual}) + E[\text{Spending} \mid \text{Power}] \cdot P(\text{Power})

=500.6+1200.4=30+48=78= 50 \cdot 0.6 + 120 \cdot 0.4 = 30 + 48 = 78

The unconditional mean is a weighted average of the conditional means.

Example 10 Mixture model (Eve's law decomposition)

Continuing Example 9, suppose Var(SpendingCasual)=400\text{Var}(\text{Spending} \mid \text{Casual}) = 400 (standard deviation 20) and Var(SpendingPower)=2500\text{Var}(\text{Spending} \mid \text{Power}) = 2500 (standard deviation 50).

Within-group variance (expected conditional variance):

E[Var(SpendingY)]=4000.6+25000.4=240+1000=1240E[\text{Var}(\text{Spending} \mid Y)] = 400 \cdot 0.6 + 2500 \cdot 0.4 = 240 + 1000 = 1240

Between-group variance (variance of conditional means): The conditional means are 50 and 120, with weights 0.6 and 0.4. Their mean is 78 (from Example 9).

Var(E[SpendingY])=(5078)20.6+(12078)20.4=7840.6+17640.4=470.4+705.6=1176\text{Var}(E[\text{Spending} \mid Y]) = (50 - 78)^2 \cdot 0.6 + (120 - 78)^2 \cdot 0.4 = 784 \cdot 0.6 + 1764 \cdot 0.4 = 470.4 + 705.6 = 1176

Total variance: 1240+1176=24161240 + 1176 = 2416.

About half the variance comes from within segments (customers vary within their segment) and half from between segments (the segments have different means).

Law of Total Expectation Explorer

Two customer segments with different spending patterns
Casual
Power User
E[X|Casual]=50.0E[X|Power User]=120.0E[X]=78.00-3541117193269

Eve's Law: Var(X) = E[Var(X|Y)] + Var(E[X|Y])

Within1240.0Between1176.0Total2416.0
E[X]
78.00
E[E[X|Y]]
78.00
E[Var(X|Y)]
1240.00
Var(E[X|Y])
1176.00
Var(X)
2416.00

8. Moment-Generating Functions

A moment-generating function (MGF) packages all the moments of a distribution — E[X]E[X], E[X2]E[X^2], E[X3]E[X^3], and so on — into a single function. It’s the probabilist’s version of the Laplace transform.

Definition 8 Moment-Generating Function

The moment-generating function (MGF) of a random variable XX is

MX(t)=E[etX]M_X(t) = E[e^{tX}]

defined for all tRt \in \mathbb{R} where the expectation exists. Explicitly:

MX(t)=xetxpX(x)(discrete)M_X(t) = \sum_x e^{tx} p_X(x) \quad \text{(discrete)}

MX(t)=etxfX(x)dx(continuous)M_X(t) = \int_{-\infty}^{\infty} e^{tx} f_X(x) \, dx \quad \text{(continuous)}

The name “moment-generating function” is literal: the nnth derivative of MX(t)M_X(t) evaluated at t=0t = 0 gives the nnth moment E[Xn]E[X^n].

Theorem 16 Moments from the MGF

If MX(t)M_X(t) exists in an open interval around t=0t = 0, then

MX(n)(0)=E[Xn]M_X^{(n)}(0) = E[X^n]

where MX(n)M_X^{(n)} denotes the nnth derivative.

Proof [show]

Expand etXe^{tX} in its Taylor series (from formalCalculus: Taylor Series ):

MX(t)=E[etX]=E[n=0(tX)nn!]=n=0E[Xn]n!tnM_X(t) = E[e^{tX}] = E\left[\sum_{n=0}^{\infty} \frac{(tX)^n}{n!}\right] = \sum_{n=0}^{\infty} \frac{E[X^n]}{n!} \, t^n

(The interchange of expectation and sum is justified by the assumption that MXM_X exists in an interval around 0, which provides the absolute convergence needed.)

This is a power series in tt with coefficients E[Xn]/n!E[X^n]/n!. By the Taylor coefficient formula:

MX(n)(0)n!=E[Xn]n!\frac{M_X^{(n)}(0)}{n!} = \frac{E[X^n]}{n!}

So MX(n)(0)=E[Xn]M_X^{(n)}(0) = E[X^n]. In particular:

  • MX(0)=1M_X(0) = 1 (always)
  • MX(0)=E[X]M_X'(0) = E[X] (the mean)
  • MX(0)=E[X2]M_X''(0) = E[X^2], so Var(X)=MX(0)(MX(0))2\text{Var}(X) = M_X''(0) - (M_X'(0))^2

\square

Theorem 17 Uniqueness of the MGF

If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in some open interval (δ,δ)(-\delta, \delta) around 0, then XX and YY have the same distribution.

This uniqueness theorem is what makes MGFs a powerful proof tool: if you can show two random variables have the same MGF, you’ve shown they have the same distribution. We’ll use this in the proof of the Central Limit Theorem — MGF uniqueness is the final step that identifies the limiting distribution as N(0,1)N(0, 1).

Theorem 18 MGF of Independent Sums

If XX and YY are independent, then

MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t)

Proof [show]

MX+Y(t)=E[et(X+Y)]=E[etXetY]M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX} \cdot e^{tY}]

Since XYX \perp Y, the functions etXe^{tX} and etYe^{tY} are independent (functions of independent variables are independent). By Theorem 5:

=E[etX]E[etY]=MX(t)MY(t)= E[e^{tX}] \cdot E[e^{tY}] = M_X(t) \cdot M_Y(t)

\square

Example 11 Bernoulli MGF

XBernoulli(p)X \sim \text{Bernoulli}(p):

MX(t)=E[etX]=et0(1p)+et1p=(1p)+petM_X(t) = E[e^{tX}] = e^{t \cdot 0}(1-p) + e^{t \cdot 1} p = (1 - p) + pe^t

Check: MX(0)=pe0=p=E[X]M_X'(0) = pe^0 = p = E[X]. \checkmark

Example 12 Normal MGF

XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2). By completing the square in the exponent of the integral (a standard technique):

MX(t)=exp(μt+σ2t22)M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Check: MX(t)=(μ+σ2t)MX(t)M_X'(t) = (\mu + \sigma^2 t) M_X(t), so MX(0)=μM_X'(0) = \mu. MX(0)=σ2+μ2M_X''(0) = \sigma^2 + \mu^2, so Var(X)=σ2+μ2μ2=σ2\text{Var}(X) = \sigma^2 + \mu^2 - \mu^2 = \sigma^2. \checkmark

Example 13 Exponential MGF

XExp(λ)X \sim \text{Exp}(\lambda):

MX(t)=0etxλeλxdx=λ0e(λt)xdx=λλtM_X(t) = \int_0^{\infty} e^{tx} \lambda e^{-\lambda x} \, dx = \lambda \int_0^{\infty} e^{-({\lambda - t})x} \, dx = \frac{\lambda}{\lambda - t}

for t<λt < \lambda (the integral diverges for tλt \geq \lambda).

Check: MX(t)=λ/(λt)2M_X'(t) = \lambda/(\lambda - t)^2, so MX(0)=1/λ=E[X]M_X'(0) = 1/\lambda = E[X]. \checkmark

Example 14 Sum of independent normals via MGF

If XN(μ1,σ12)X \sim \mathcal{N}(\mu_1, \sigma_1^2) and YN(μ2,σ22)Y \sim \mathcal{N}(\mu_2, \sigma_2^2) are independent, then by Theorem 18:

MX+Y(t)=MX(t)MY(t)=eμ1t+σ12t2/2eμ2t+σ22t2/2=e(μ1+μ2)t+(σ12+σ22)t2/2M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{\mu_1 t + \sigma_1^2 t^2/2} \cdot e^{\mu_2 t + \sigma_2^2 t^2/2} = e^{(\mu_1 + \mu_2)t + (\sigma_1^2 + \sigma_2^2)t^2/2}

By the uniqueness theorem (Theorem 17), this is the MGF of N(μ1+μ2,σ12+σ22)\mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2). Therefore:

X+YN(μ1+μ2,σ12+σ22)X + Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)

Independent normals sum to a normal — the mean adds, the variance adds. This is a property unique to the normal distribution and underlies much of classical statistics.

Three-panel figure showing MGFs of Bernoulli, Normal, and Exponential; derivatives at zero extracting moments; product of MGFs for sum of independent normals
0.50
M(t) = E[e^{tX}] for Bernoulli(p)
0.02.04.06.08.010.0-3.0-1.01.03.0tM(t)M(0) = 1slope = 0.500
Moments from MGF
DerivativeNumericalExact
M'(0) = E[X]0.50000.5000
M''(0) = E[X²]0.50000.5000
Var(X) = M''(0) − (M'(0))²0.25000.2500
Green = exact from distribution parameters. Numerical = central finite differences at t=0.
The red dashed line is the tangent at t=0 with slope M′(0) = E[X]. The green dot marks M(0) = 1, which holds for every distribution (since E[e^{0 \cdot X}] = 1).

9. Connections to ML

Every concept in this topic has a direct counterpart in machine learning. Let us highlight the central connection: the bias-variance decomposition.

Theorem 19 Conditional Expectation Minimizes MSE

Among all functions g(Y)g(Y) of YY, the conditional expectation E[XY]E[X \mid Y] minimizes the mean squared error:

E[XY]=argmingE[(Xg(Y))2]E[X \mid Y] = \arg\min_{g} E[(X - g(Y))^2]

This is why supervised learning works: the optimal prediction of YY given features XX (under squared loss) is E[YX]E[Y \mid X]. Every regression model is an approximation to this conditional expectation.

The bias-variance decomposition connects Eve’s law (Theorem 15) to prediction error. For an estimator f^(x)\hat{f}(x) of a target f(x)=E[YX=x]f(x) = E[Y \mid X = x]:

E[(Yf^(X))2]=(E[f^(X)]f(X))2Bias2+E[(f^(X)E[f^(X)])2]Variance+E[(Yf(X))2]Irreducible noiseE[(Y - \hat{f}(X))^2] = \underbrace{(E[\hat{f}(X)] - f(X))^2}_{\text{Bias}^2} + \underbrace{E[(\hat{f}(X) - E[\hat{f}(X)])^2]}_{\text{Variance}} + \underbrace{E[(Y - f(X))^2]}_{\text{Irreducible noise}}

This is Eve’s law in disguise: the total prediction error decomposes into a systematic component (bias) and a variability component (variance), plus noise that no model can remove.

Three-panel figure showing polynomial fits: degree-1 (high bias, low variance), degree-5 (balanced), degree-15 (low bias, high variance) with Bias² and Var annotations
Concept from this topicML application
E[X]E[X] (expectation)Risk = E[(Y,f^(X))]E[\ell(Y, \hat{f}(X))] — the expected loss that training minimizes
Linearity of EESGD: E[Li]=E[L]E[\nabla L_i] = \nabla E[L] — minibatch gradients are unbiased ( formalML: Optimization )
Var(X)\text{Var}(X)Variance of estimator determines confidence interval width
Eve’s lawBias-variance decomposition of prediction error ( formalML: Bias-Variance Tradeoff )
E[XY]E[X \mid Y]Optimal prediction function under squared loss
Jensen’s inequalityELBO logp(x)\leq \log p(\mathbf{x}) in variational inference ( formalML: Information Theory )
MGF uniquenessProof of the Central Limit Theorem (identifies the limit as N(0,1)N(0,1))
Chebyshev’s inequalityPAC learning bounds; weak law of large numbers (Modes of ConvergenceLaw of Large Numbers)

10. Summary

This topic completes Track 1: Foundations of Probability. We now have the complete toolkit:

TopicCore objectsCore results
Sample Spaces(Ω,F,P)(\Omega, \mathcal{F}, P)Kolmogorov axioms, inclusion-exclusion
Conditional ProbabilityP(AB)P(A \mid B)Bayes’ theorem, independence, total probability
Random VariablesX:ΩRX : \Omega \to \mathbb{R}, PMF, PDF, CDFDistributions, joint/marginal/conditional, transformations
Expectation & MomentsE[X]E[X], Var(X)\text{Var}(X), Cov(X,Y)\text{Cov}(X,Y), MX(t)M_X(t)Linearity, variance decomposition, Chebyshev, Jensen, tower property, Eve’s law, MGF uniqueness

What comes next. Track 1’s machinery feeds directly into five parallel tracks:

  • Discrete Distributions and Continuous Distributions apply the expectation and variance formulas to every named distribution — Binomial, Poisson, Normal, Exponential, Gamma, Beta.
  • Modes of Convergence uses Markov and Chebyshev as the starting point for the law of large numbers and the central limit theorem.
  • Point Estimation & Bias-Variance defines bias as E[θ^]θE[\hat{\theta}] - \theta and MSE as Var(θ^)+Bias2\text{Var}(\hat{\theta}) + \text{Bias}^2.
  • Method of Moments equates sample moments Xˉ,X2,\bar{X}, \overline{X^2}, \ldots to population moments E[X],E[X2],E[X], E[X^2], \ldots.

References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  7. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  8. Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. JASA, 112(518), 859–877.