The center of mass, the spread, and the shape — the numerical summaries that reduce distributions to the quantities that drive all of statistical inference and machine learning.
In Topic 3, we built the machinery of random variables, PMFs, PDFs, and CDFs — the full description of how probability is distributed over numbers. But a full distribution is a lot of information. Often we need a single number that summarizes the “location” of a distribution: where is the probability concentrated? What value do we “expect” to see?
The expectation (or expected value, or mean) of a random variable answers this question. It is the center of mass of the distribution — the balance point. If you placed the PMF bars (or PDF curve) on a number line and balanced it on a fulcrum, the balance point would be E[X].
Definition 1 Expectation (Discrete and Continuous)
Let X be a random variable.
Discrete case. If X takes values in a countable set {x1,x2,…} with PMF pX, the expectation of X is
The absolute convergence condition E[∣X∣]<∞ is not a technicality — without it, the expectation can depend on the order of summation or the way we partition the integral. The Cauchy distribution with PDF f(x)=π(1+x2)1 is the standard example: ∫−∞∞∣x∣f(x)dx=∞, so E[X] does not exist. If you compute the Cauchy principal value limR→∞∫−RRxf(x)dx, you get 0 — but that’s a cancellation artifact, not a genuine expectation. The absolute convergence condition from formalCalculus: Sequences & Limits ensures the expectation is well-defined regardless of ordering.
The expectation is a weighted average of the values, weighted by their probabilities. For the fair die, E[X]=61(1+2+3+4+5+6)=3.5. For the loaded die that favors high rolls, the balance point shifts rightward. For a continuous distribution, the sum becomes an integral but the idea is identical: multiply each value by its probability density and integrate.
One of the most useful tools for computing expectations is LOTUS — the Law of the Unconscious Statistician. It lets us compute E[g(X)] directly from the distribution of X, without first finding the distribution of g(X).
Theorem 1 LOTUS (Law of the Unconscious Statistician)
Let X be a random variable and g:R→R a function.
Discrete case:E[g(X)]=x∑g(x)pX(x)
Continuous case:E[g(X)]=∫−∞∞g(x)fX(x)dx
provided the sum/integral converges absolutely.
Proof
[show][hide]
Discrete case. Let Y=g(X). We need to show that computing E[Y] via the PMF of Y gives the same result as summing g(x)pX(x) over the support of X.
The PMF of Y is pY(y)=P(g(X)=y)=∑x:g(x)=ypX(x). Therefore:
E[Y]=∑yypY(y)=∑yy∑x:g(x)=ypX(x)
Swapping the order of summation — every x appears in exactly one group (the group indexed by y=g(x)):
LOTUS is called the “law of the unconscious statistician” because students often apply it without thinking — and it works. The name is mildly pejorative, but the theorem is anything but trivial: it saves you from having to derive the distribution of g(X) before computing the expectation.
Example 1 Die roll expectation
Roll a fair die. X∈{1,2,3,4,5,6} with pX(k)=1/6 for each k.
E[X]=∑k=16k⋅61=61(1+2+3+4+5+6)=621=3.5
Notice that E[X]=3.5 is not a value X can actually take — this is normal. The expectation is the center of mass, not a mode or a median.
Using LOTUS, E[X2]=∑k=16k2⋅61=61(1+4+9+16+25+36)=691≈15.17.
An Exponential(λ) random variable has mean 1/λ. If a server processes requests at rate λ=5 per second, the mean inter-arrival time is 1/5=0.2 seconds.
Use the explorer below to visualize expectation as the balance point. Toggle between discrete and continuous distributions, or enter your own probability values:
Expectation Balance Explorer
E[X]
3.5000
E[X²]
15.1667
Var(X)
2.9167
2. Properties of Expectation
Expectation is a linear operation — and this is its single most powerful property. Linearity holds without any independence assumption.
Theorem 2 Linearity of Expectation
For any random variables X and Y (with finite expectations) and constants a,b∈R:
E[aX+bY]=aE[X]+bE[Y]
Proof
[show][hide]
We prove the discrete case; the continuous case is analogous with integrals replacing sums.
Let (X,Y) have joint PMF pX,Y(x,y). Then:
E[aX+bY]=∑x∑y(ax+by)pX,Y(x,y)
Expanding the sum:
=a∑x∑yxpX,Y(x,y)+b∑x∑yypX,Y(x,y)
The inner sum in the first term: ∑ypX,Y(x,y)=pX(x) (the marginal PMF of X, from Topic 3). So:
=a∑xxpX(x)+b∑yypY(y)=aE[X]+bE[Y]
No independence was used — only the existence of marginals from the joint. □
◼
Remark Linearity requires no independence
This is worth emphasizing: E[X+Y]=E[X]+E[Y] always, even when X and Y are dependent. The proof uses only marginalization, not factorization of the joint. This makes linearity enormously useful — we can compute E[sum] as a sum of expectations even when the summands are tangled together in complex ways. The classic application: expected number of fixed points in a random permutation (Example 3 below).
Theorem 3 Monotonicity
If X≤Y almost surely (i.e., P(X≤Y)=1), then E[X]≤E[Y].
Proof
[show][hide]
Define Z=Y−X. Since X≤Y a.s., we have Z≥0 a.s. For a nonnegative random variable, E[Z]=∑zzpZ(z)≥0 (every term is nonneg). So E[Y]−E[X]=E[Y−X]=E[Z]≥0. □
◼
Theorem 4 Expectation of Constants
For any constant c∈R: E[c]=c.
The proof is immediate: a constant random variable has PMF concentrated at a single point, so E[c]=c⋅1=c.
Theorem 5 Expectation of Independent Products
If X and Y are independent random variables with finite expectations, then
E[XY]=E[X]⋅E[Y]
Proof
[show][hide]
Since X⊥Y, the joint PMF factors: pX,Y(x,y)=pX(x)⋅pY(y) (from Topic 2 and Topic 3). Then:
E[XY]=∑x∑yxypX,Y(x,y)=∑x∑yxypX(x)pY(y)
Factoring:
=(∑xxpX(x))(∑yypY(y))=E[X]⋅E[Y]
□
◼
Remark E[XY] = E[X]E[Y] does not imply independence
The converse is false. If X∼Uniform{−1,0,1} and Y=X2, then E[XY]=E[X3]=0=E[X]⋅E[Y], but X and Y are clearly dependent (Y is a deterministic function of X). The condition E[XY]=E[X]E[Y] is called uncorrelatedness — it is strictly weaker than independence.
Example 3 Expected number of matches (linearity without independence)
Randomly shuffle n cards labeled 1,…,n. A match (or fixed point) occurs at position i if card i lands in position i. Let M=∑i=1nXi where Xi=1{card i is in position i}.
The Xi‘s are dependent (if card 1 is in position 1, the remaining cards are shuffled among n−1 positions, changing the probabilities for X2,…,Xn). But linearity doesn’t care:
E[M]=∑i=1nE[Xi]=∑i=1nP(card i in position i)=∑i=1nn1=1
The expected number of matches is exactly 1, regardless of n. This surprising result — the same whether you shuffle 10 cards or 10 million — follows effortlessly from linearity.
3. Variance: Measuring Spread
The expectation tells us where a distribution is centered. But two distributions can have the same center and look completely different — one tightly concentrated, the other spread wide. We need a measure of spread.
Definition 2 Variance and Standard Deviation
The variance of a random variable X with mean μ=E[X] is
Var(X)=E[(X−μ)2]
The standard deviation is σX=Var(X).
Variance is also written σ2, σX2, or Var(X).
Variance is the average squared distance from the mean. It measures how far a random variable typically falls from its expected value. The squaring ensures that deviations above and below the mean both contribute positively. The standard deviation σ returns us to the original units (if X is in meters, Var(X) is in meters2 but σX is in meters).
The first two terms are Var(X) and Var(Y). The third term is 2Cov(X,Y) (Definition 3 below). When X⊥Y, E[XY]=E[X]E[Y] (Theorem 5), so the covariance term vanishes. □
◼
Remark Variance does NOT split for dependent variables
Property 3 requires independence. In general, Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y). If X and Y are positively correlated (Cov(X,Y)>0), the variance of their sum is larger than the sum of variances. If negatively correlated, it’s smaller. This is the mathematical foundation of portfolio diversification: combining negatively correlated assets reduces total variance.
Example 4 Die roll variance
For a fair die with E[X]=3.5 and E[X2]=91/6≈15.17:
Variant A: win 5 dollars with probability 0.4, else 0. E[A]=2, Var(A)=0.4⋅25−4=6.
Variant B: win 20 dollars with probability 0.1, else 0. E[B]=2, Var(B)=0.1⋅400−4=36.
Both have the same mean payout (2 dollars), but Variant B is 6x more variable. In an A/B test, you’d need far more samples to detect a treatment effect in B than in A — because the noise-to-signal ratio is much higher. This is why variance matters for experimental design.
Variance Formulas
Definition (average squared deviation)
Var(X) = E[(X − μ)²]
(1 − 3.50)² · 0.1667
+ (2 − 3.50)² · 0.1667
+ (3 − 3.50)² · 0.1667
+ (4 − 3.50)² · 0.1667
+ (5 − 3.50)² · 0.1667
+ (6 − 3.50)² · 0.1667
= 2.9167
Computational formula
Var(X) = E[X²] − (E[X])²
E[X²] = 15.1667
(E[X])² = (3.5000)² = 12.2500
= 15.1667 − 12.2500 = 2.9167
E[X] = 3.5000
E[X²] = 15.1667
Var(X) = 2.9167
σ = √Var(X) = 1.7078
Both formulas agree — same result.
PMF / PDF
E[X] (mean)
Deviation lines
(x − μ)² squares
4. Covariance and Correlation
When we have two random variables, we want to quantify their linear association. Do they tend to be large together (positive association) or does one tend to be large when the other is small (negative association)?
This quadratic in t is nonneg for all t, so its discriminant must be ≤0:
4ρ2−4≤0⟹ρ2≤1⟹−1≤ρ≤1
Equality holds when Var(Z)=0 for some t, meaning Z is constant a.s., i.e., X/σX=tY/σY+c. □
◼
Remark Zero covariance from independence; converse false
Independence ⟹Cov(X,Y)=0⟹ρ(X,Y)=0 (Theorem 5). But the converse fails: uncorrelatedness (ρ=0) does not imply independence. The example from Remark 3 (X uniform on {−1,0,1}, Y=X2) has ρ=0 but complete functional dependence. Correlation measures linear association only — it can miss nonlinear dependencies entirely. This distinction matters in ML: two features can be uncorrelated yet carry highly redundant information through nonlinear relationships.
5. Standard Inequalities
Probability bounds are the bread and butter of theoretical statistics and machine learning. When we can’t compute exact probabilities, we use inequalities to bound them from above. The three workhorses are Markov, Chebyshev, and Jensen.
Markov’s inequality is very weak — but it uses almost no information (only E[X] and X≥0). The bound is tight: for a Bernoulli variable with P(X=n)=1/n and P(X=0)=1−1/n, P(X≥n)=1/n and E[X]/n=1/n.
Theorem 12 Chebyshev's Inequality
For any random variable X with E[X]=μ and Var(X)=σ2<∞:
P(∣X−μ∣≥kσ)≤k21
for any k>0. Equivalently, P(∣X−μ∣≥ε)≤Var(X)/ε2.
Proof
[show][hide]
Apply Markov’s inequality to the nonneg random variable (X−μ)2 with threshold ε2:
P(∣X−μ∣≥ε)=P((X−μ)2≥ε2)≤ε2E[(X−μ)2]=ε2Var(X)
Setting ε=kσ gives P(∣X−μ∣≥kσ)≤1/k2. □
◼
Chebyshev uses both the mean and the variance, so it’s tighter than Markov. At k=2 standard deviations: Chebyshev gives ≤25%, while for the normal distribution the true probability is ≈4.6%. At k=3: Chebyshev gives ≤11.1%; normal gives ≈0.3%. Chebyshev applies to any distribution — that’s why it’s loose for the well-behaved normal.
Example 6 Chebyshev in practice
A quality control process produces items with mean weight μ=100g and standard deviation σ=2g. What fraction of items can weigh more than 106g?
Using Chebyshev with k=3 (since ∣106−100∣=6=3σ):
P(∣X−100∣≥6)≤321=91≈11.1%
If we know the weights are normally distributed, the true probability is P(∣Z∣≥3)≈0.27% — 40x smaller. Chebyshev’s power is that it works regardless of the distribution shape.
Theorem 13 Jensen's Inequality
If g is a convex function and E[X] exists, then
g(E[X])≤E[g(X)]
If g is concave, the inequality reverses: g(E[X])≥E[g(X)].
Proof
[show][hide]
Since g is convex, it lies above every tangent line. At the point μ=E[X], there exists a slope m (a subgradient) such that for all x:
g(x)≥g(μ)+m(x−μ)
Taking expectations of both sides (which preserves the inequality by monotonicity, Theorem 3):
E[g(X)]≥g(μ)+mE[X−μ]=g(μ)+m⋅0=g(μ)=g(E[X])
□
◼
Example 7 Jensen and the AM-GM inequality
Let g(x)=−log(x) (convex on (0,∞)). Jensen gives:
−log(E[X])≤E[−log(X)]=−E[log(X)]
So log(E[X])≥E[log(X)], or equivalently E[X]≥eE[logX]. For n equal-probability values x1,…,xn:
nx1+⋯+xn≥(x1⋯xn)1/n
This is the arithmetic-mean ≥ geometric-mean inequality — a pure consequence of Jensen.
ML application: Jensen’s inequality with g(x)=−log(x) is exactly what gives us the evidence lower bound (ELBO) in variational inference: logp(x)≥Eq[logp(x,z)−logq(z)]. See formalML: Information Theory for the full derivation.
In Topic 2, we developed conditional probability P(A∣B) — the probability of an event given partial information. Now we extend this idea from events to random variables. The conditional expectationE[X∣Y] is our best guess of X given what Y tells us.
Definition 5 Conditional Expectation Given an Event
If B is an event with P(B)>0, the conditional expectation of X given B is
E[X∣B]=∑xxP(X=x∣B)(discrete)
E[X∣B]=∫−∞∞xfX∣B(x)dx(continuous)
This is just the ordinary expectation computed using the conditional distribution.
Definition 6 Conditional Expectation as a Function
Here E[X∣Y=y] is a function of y — we write it as h(y)=E[X∣Y=y].
Definition 7 Conditional Expectation as a Random Variable
The conditional expectationE[X∣Y] is the random variable obtained by evaluating the function h(y)=E[X∣Y=y] at Y:
E[X∣Y]=h(Y)
This is a random variable — it inherits its randomness from Y. Different realizations of Y produce different “best guesses” of X.
The progression from Definition 5 to Definition 7 is crucial: we start with a number (E[X∣B]), then a function of y (E[X∣Y=y]), then a random variable (E[X∣Y]). The random variable interpretation is what makes the tower property (Theorem 14) meaningful — we can take expectations of conditional expectations.
Example 8 Bivariate normal conditional expectation
Let (X,Y) be bivariate normal with means μX,μY, standard deviations σX,σY, and correlation ρ. From Topic 3, §8:
E[X∣Y=y]=μX+ρσYσX(y−μY)
This is a linear function of y — it’s the regression line. The slope is ρσX/σY, and when ρ=0, the conditional mean equals the unconditional mean μX (knowing Y provides no information about X).
The conditional variance is Var(X∣Y=y)=σX2(1−ρ2), which does not depend on y. This homoscedasticity is special to the bivariate normal.
7. The Law of Total Expectation and Eve’s Law
The law of total expectation (also called the tower property or Adam’s law) is one of the most powerful tools in probability. It says: to compute E[X], first compute E[X∣Y] for each value of Y, then average over Y.
Theorem 14 Law of Total Expectation (Tower Property)
E[X]=E[E[X∣Y]]
More precisely, if Y is discrete with values {y1,y2,…}:
E[X]=∑jE[X∣Y=yj]P(Y=yj)
Proof
[show][hide]
We prove the discrete case. Start with the right side:
By the definition of conditional probability, P(X=xi∣Y=yj)⋅P(Y=yj)=P(X=xi,Y=yj):
=∑j∑ixiP(X=xi,Y=yj)
Swapping the order of summation:
=∑ixi∑jP(X=xi,Y=yj)=∑ixiP(X=xi)=E[X]
The last step uses the law of total probability: ∑jP(X=xi,Y=yj)=P(X=xi) (marginalizing out Y). □
◼
Notice the parallel with the law of total probability from Topic 2: P(A)=∑jP(A∣Bj)P(Bj). The tower property is the same idea applied to expectations.
Theorem 15 Law of Total Variance (Eve's Law)
Var(X)=E[Var(X∣Y)]+Var(E[X∣Y])
In words: total variance = expected within-group variance + between-group variance.
Proof
[show][hide]
Use the computational formula Var(X)=E[X2]−(E[X])2 and apply the tower property to both terms.
By the tower property: E[X2]=E[E[X2∣Y]] and E[X]=E[E[X∣Y]].
Now note that Var(X∣Y)=E[X2∣Y]−(E[X∣Y])2 (the computational formula applied conditionally). So:
The last two terms are E[Z2]−(E[Z])2 where Z=E[X∣Y], which is Var(Z)=Var(E[X∣Y]).
=E[Var(X∣Y)]+Var(E[X∣Y])
□
◼
Eve’s law is the mathematical foundation of ANOVA (analysis of variance): total variation decomposes into within-group and between-group components. In ML, it underlies the bias-variance decomposition (§9).
Example 9 Mixture model (tower property)
A company has two customer segments: Casual (60%) with mean spending of 50, and Power Users (40%) with mean spending of 120. Let Y indicate the segment.
Between-group variance (variance of conditional means): The conditional means are 50 and 120, with weights 0.6 and 0.4. Their mean is 78 (from Example 9).
About half the variance comes from within segments (customers vary within their segment) and half from between segments (the segments have different means).
Law of Total Expectation Explorer
Two customer segments with different spending patterns
Casual
Power User
Eve's Law: Var(X) = E[Var(X|Y)] + Var(E[X|Y])
E[X]
78.00
E[E[X|Y]]
78.00
E[Var(X|Y)]
1240.00
Var(E[X|Y])
1176.00
Var(X)
2416.00
8. Moment-Generating Functions
A moment-generating function (MGF) packages all the moments of a distribution — E[X], E[X2], E[X3], and so on — into a single function. It’s the probabilist’s version of the Laplace transform.
Definition 8 Moment-Generating Function
The moment-generating function (MGF) of a random variable X is
MX(t)=E[etX]
defined for all t∈R where the expectation exists. Explicitly:
MX(t)=∑xetxpX(x)(discrete)
MX(t)=∫−∞∞etxfX(x)dx(continuous)
The name “moment-generating function” is literal: the nth derivative of MX(t) evaluated at t=0 gives the nth moment E[Xn].
Theorem 16 Moments from the MGF
If MX(t) exists in an open interval around t=0, then
(The interchange of expectation and sum is justified by the assumption that MX exists in an interval around 0, which provides the absolute convergence needed.)
This is a power series in t with coefficients E[Xn]/n!. By the Taylor coefficient formula:
n!MX(n)(0)=n!E[Xn]
So MX(n)(0)=E[Xn]. In particular:
MX(0)=1 (always)
MX′(0)=E[X] (the mean)
MX′′(0)=E[X2], so Var(X)=MX′′(0)−(MX′(0))2
□
◼
Theorem 17 Uniqueness of the MGF
If MX(t)=MY(t) for all t in some open interval (−δ,δ) around 0, then X and Y have the same distribution.
This uniqueness theorem is what makes MGFs a powerful proof tool: if you can show two random variables have the same MGF, you’ve shown they have the same distribution. We’ll use this in the proof of the Central Limit Theorem — MGF uniqueness is the final step that identifies the limiting distribution as N(0,1).
Theorem 18 MGF of Independent Sums
If X and Y are independent, then
MX+Y(t)=MX(t)⋅MY(t)
Proof
[show][hide]
MX+Y(t)=E[et(X+Y)]=E[etX⋅etY]
Since X⊥Y, the functions etX and etY are independent (functions of independent variables are independent). By Theorem 5:
=E[etX]⋅E[etY]=MX(t)⋅MY(t)
□
◼
Example 11 Bernoulli MGF
X∼Bernoulli(p):
MX(t)=E[etX]=et⋅0(1−p)+et⋅1p=(1−p)+pet
Check: MX′(0)=pe0=p=E[X]. ✓
Example 12 Normal MGF
X∼N(μ,σ2). By completing the square in the exponent of the integral (a standard technique):
MX(t)=exp(μt+2σ2t2)
Check: MX′(t)=(μ+σ2t)MX(t), so MX′(0)=μ. MX′′(0)=σ2+μ2, so Var(X)=σ2+μ2−μ2=σ2. ✓
Example 13 Exponential MGF
X∼Exp(λ):
MX(t)=∫0∞etxλe−λxdx=λ∫0∞e−(λ−t)xdx=λ−tλ
for t<λ (the integral diverges for t≥λ).
Check: MX′(t)=λ/(λ−t)2, so MX′(0)=1/λ=E[X]. ✓
Example 14 Sum of independent normals via MGF
If X∼N(μ1,σ12) and Y∼N(μ2,σ22) are independent, then by Theorem 18:
By the uniqueness theorem (Theorem 17), this is the MGF of N(μ1+μ2,σ12+σ22). Therefore:
X+Y∼N(μ1+μ2,σ12+σ22)
Independent normals sum to a normal — the mean adds, the variance adds. This is a property unique to the normal distribution and underlies much of classical statistics.
0.50
M(t) = E[e^{tX}] for Bernoulli(p)
Moments from MGF
Derivative
Numerical
Exact
M'(0) = E[X]
0.5000
0.5000
M''(0) = E[X²]
0.5000
0.5000
Var(X) = M''(0) − (M'(0))²
0.2500
0.2500
Green = exact from distribution parameters. Numerical = central finite differences at t=0.
The red dashed line is the tangent at t=0 with slope M′(0) = E[X]. The green dot marks M(0) = 1, which holds for every distribution (since E[e^{0 \cdot X}] = 1).
9. Connections to ML
Every concept in this topic has a direct counterpart in machine learning. Let us highlight the central connection: the bias-variance decomposition.
Theorem 19 Conditional Expectation Minimizes MSE
Among all functions g(Y) of Y, the conditional expectation E[X∣Y] minimizes the mean squared error:
E[X∣Y]=argmingE[(X−g(Y))2]
This is why supervised learning works: the optimal prediction of Y given features X (under squared loss) is E[Y∣X]. Every regression model is an approximation to this conditional expectation.
The bias-variance decomposition connects Eve’s law (Theorem 15) to prediction error. For an estimator f^(x) of a target f(x)=E[Y∣X=x]:
This is Eve’s law in disguise: the total prediction error decomposes into a systematic component (bias) and a variability component (variance), plus noise that no model can remove.
Concept from this topic
ML application
E[X] (expectation)
Risk = E[ℓ(Y,f^(X))] — the expected loss that training minimizes