foundational 50 min read · April 12, 2026

Discrete Distributions

Bernoulli, Binomial, Geometric, Negative Binomial, Poisson, Hypergeometric, and Discrete Uniform — the named PMFs that appear everywhere in statistical modeling and machine learning, each derived from a distinct probabilistic mechanism.

5.1 The Discrete Distribution Catalog

In Topic 3, we built the general framework: random variables as measurable functions, PMFs as probability assignments over countable supports, CDFs as cumulative summaries, and measurability as the bridge from probability spaces to number lines. In Topic 4, we built the tools: expectation as center of mass, variance as spread, covariance as linear association, and moment-generating functions as the engine for proving distributional results.

Those topics answered the question what is a distribution? This topic answers a different question: which distributions matter, and why?

The answer is seven specific PMFs, each arising from a distinct probabilistic mechanism. The Bernoulli models a single yes/no trial. The Binomial counts successes across independent trials. The Geometric waits for the first success. The Negative Binomial waits for the rr-th. The Poisson counts rare events at a constant rate. The Hypergeometric counts successes when sampling without replacement. And the Discrete Uniform assigns equal probability to every outcome in a finite set.

Remark A catalog, not a narrative

Topics 1–4 followed a linear arc — each concept building toward the next. This topic is structurally different: it’s a parallel catalog. Each distribution gets the same systematic treatment (PMF → E[X] proof → Var(X) proof → MGF → key properties → ML connection), and the value comes from seeing how the same tools from Topics 3–4 produce different results when applied to different mechanisms. The repetition is the point — it’s how you internalize the tools.

Gallery of PMF bar charts for all seven discrete distributions with E[X] balance-point triangles

The table below summarizes what we’re about to build. For each distribution, the “mechanism” column describes the experiment that produces it, the “parameters” column identifies what you need to specify, and the “support” column lists which values the random variable can take.

DistributionMechanismParametersSupport
BernoulliSingle binary trialp(0,1)p \in (0,1){0,1}\{0, 1\}
Binomialnn independent trials, count successesnNn \in \mathbb{N}, p(0,1)p \in (0,1){0,1,,n}\{0, 1, \ldots, n\}
GeometricTrials until first successp(0,1)p \in (0,1){1,2,3,}\{1, 2, 3, \ldots\}
Negative BinomialTrials until rr-th successrNr \in \mathbb{N}, p(0,1)p \in (0,1){r,r+1,r+2,}\{r, r+1, r+2, \ldots\}
PoissonCount of rare events at rate λ\lambdaλ>0\lambda > 0{0,1,2,}\{0, 1, 2, \ldots\}
HypergeometricSample without replacementN,K,nNN, K, n \in \mathbb{N}{max(0,nN+K),,min(n,K)}\{\max(0, n-N+K), \ldots, \min(n, K)\}
Discrete UniformEqual probability on finite seta,bZa, b \in \mathbb{Z}{a,a+1,,b}\{a, a+1, \ldots, b\}

Five of these seven distributions — Bernoulli, Binomial, Geometric, Negative Binomial, and Poisson — belong to the exponential family. We’ll flag this for each one as we go, and Exponential Families unifies the pattern.

Interactive: Distribution Catalog Explorer
PMF: P(X = k)
0.0000.150.300.450.6001
CDF: F(x) = P(X ≤ x)
0.000.250.500.751.00
E[X] = 0.6000Var(X) = 0.2400σ = 0.4899Mode = 1M(t) = (1-p) + pe^t
Exponential family member

5.2 The Bernoulli Distribution

The simplest possible random experiment: a single trial with two outcomes. Flip a coin, check if a user clicks, test whether a component passes quality control. The outcome is 1 (success) with probability pp and 0 (failure) with probability 1p1 - p.

Definition 1 Bernoulli Distribution

A random variable XX has the Bernoulli distribution with parameter p(0,1)p \in (0, 1), written XBernoulli(p)X \sim \text{Bernoulli}(p), if its PMF is

pX(k)=pk(1p)1k,k{0,1}p_X(k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\}

Equivalently: P(X=1)=pP(X = 1) = p and P(X=0)=1p=qP(X = 0) = 1 - p = q.

The PMF is trivially a valid probability mass function: it assigns non-negative values to exactly two outcomes, and p+(1p)=1p + (1-p) = 1. Throughout this topic, we write q=1pq = 1 - p for brevity.

Theorem 1 Bernoulli Moments and MGF

If XBernoulli(p)X \sim \text{Bernoulli}(p), then:

  1. E[X]=pE[X] = p
  2. Var(X)=p(1p)=pq\text{Var}(X) = p(1-p) = pq
  3. MX(t)=q+petM_X(t) = q + pe^t
Proof [show]

Expectation. Since XX takes only two values:

E[X]=0P(X=0)+1P(X=1)=0q+1p=pE[X] = 0 \cdot P(X = 0) + 1 \cdot P(X = 1) = 0 \cdot q + 1 \cdot p = p

Variance. First compute the second moment. Since X{0,1}X \in \{0, 1\}, we have X2=XX^2 = X, so E[X2]=E[X]=pE[X^2] = E[X] = p. By the variance shortcut from Topic 4:

Var(X)=E[X2](E[X])2=pp2=p(1p)\text{Var}(X) = E[X^2] - (E[X])^2 = p - p^2 = p(1-p)

MGF. By definition of the moment-generating function:

MX(t)=E[etX]=et0q+et1p=q+petM_X(t) = E[e^{tX}] = e^{t \cdot 0} \cdot q + e^{t \cdot 1} \cdot p = q + pe^t

This is defined for all tRt \in \mathbb{R}. \square

Remark Bernoulli exponential family form

We can rewrite the Bernoulli PMF as

pX(k)=(1p)exp ⁣(klnp1p)p_X(k) = (1-p) \exp\!\left(k \ln \frac{p}{1-p}\right)

This is the exponential family form with natural parameter η=ln(p/(1p))\eta = \ln(p/(1-p)) — the logit of pp. The inverse map p=1/(1+eη)p = 1/(1 + e^{-\eta}) is the sigmoid function. This connection is why logistic regression uses the Bernoulli distribution: the model is YXBernoulli(σ(βTX))Y \mid X \sim \text{Bernoulli}(\sigma(\beta^T X)), where σ\sigma is the sigmoid (see formalML: Logistic Regression ).

Example 1 Bernoulli as logistic regression foundation

In binary classification, we model the response as YXBernoulli(p(X))Y \mid X \sim \text{Bernoulli}(p(X)) where p(X)=σ(βTX)p(X) = \sigma(\beta^T X). The negative log-likelihood of a single observation (xi,yi)(x_i, y_i) is:

lnp(yixi)=[yilnp^i+(1yi)ln(1p^i)]-\ln p(y_i \mid x_i) = -[y_i \ln \hat{p}_i + (1 - y_i) \ln(1 - \hat{p}_i)]

This is the cross-entropy loss — the loss function you minimize in logistic regression is literally the negative Bernoulli log-likelihood. Every gradient descent step in logistic regression is doing maximum likelihood estimation for Bernoulli parameters.

Three-panel figure: Bernoulli PMF varying p, Binomial PMF varying p with fixed n=15, and Binomial PMF varying n with fixed p=0.3

5.3 The Binomial Distribution

What happens when we repeat a Bernoulli trial nn independent times and count the total number of successes? The answer is the Binomial distribution — the most natural generalization of Bernoulli.

Definition 2 Binomial Distribution

A random variable XX has the Binomial distribution with parameters nNn \in \mathbb{N} and p(0,1)p \in (0,1), written XBinomial(n,p)X \sim \text{Binomial}(n, p), if its PMF is

pX(k)=(nk)pk(1p)nk,k{0,1,,n}p_X(k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in \{0, 1, \ldots, n\}

where (nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!} is the binomial coefficient.

The PMF counts exactly the probability of getting kk specific successes (each contributing pp) and nkn-k specific failures (each contributing q=1pq = 1-p), multiplied by (nk)\binom{n}{k} because we don’t care which trials are the successes — only how many. The binomial theorem confirms the PMF sums to 1: k=0n(nk)pkqnk=(p+q)n=1\sum_{k=0}^n \binom{n}{k} p^k q^{n-k} = (p + q)^n = 1.

Theorem 2 Binomial as Sum of Bernoullis

If X1,X2,,XniidBernoulli(p)X_1, X_2, \ldots, X_n \overset{\text{iid}}{\sim} \text{Bernoulli}(p) and X=X1+X2++XnX = X_1 + X_2 + \cdots + X_n, then XBinomial(n,p)X \sim \text{Binomial}(n, p).

Proof [show]

We need P(X=k)=(nk)pkqnkP(X = k) = \binom{n}{k} p^k q^{n-k}. The event {X=k}\{X = k\} means exactly kk of the nn independent Bernoulli trials are 1 and nkn - k are 0. There are (nk)\binom{n}{k} ways to choose which kk trials succeed. For any specific such pattern, the probability is pkqnkp^k q^{n-k} by independence. Summing over all (nk)\binom{n}{k} patterns gives P(X=k)=(nk)pkqnkP(X = k) = \binom{n}{k} p^k q^{n-k}. \square

This representation is not just a derivation trick — it’s the key to computing moments.

Theorem 3 Binomial Expectation

If XBinomial(n,p)X \sim \text{Binomial}(n, p), then E[X]=npE[X] = np.

Proof [show]

Via linearity (the elegant proof). Write X=X1++XnX = X_1 + \cdots + X_n where each XiBernoulli(p)X_i \sim \text{Bernoulli}(p). By linearity of expectation — which does not require independence:

E[X]=E[X1]+E[X2]++E[Xn]=npE[X] = E[X_1] + E[X_2] + \cdots + E[X_n] = np

Direct PMF proof. We can also verify directly from the PMF. The k=0k = 0 term vanishes, so:

E[X]=k=1nk(nk)pkqnkE[X] = \sum_{k=1}^n k \binom{n}{k} p^k q^{n-k}

Using the identity k(nk)=n(n1k1)k \binom{n}{k} = n \binom{n-1}{k-1} and substituting j=k1j = k - 1:

=nj=0n1(n1j)pj+1q(n1)j=npj=0n1(n1j)pjq(n1)j=np1=np= n \sum_{j=0}^{n-1} \binom{n-1}{j} p^{j+1} q^{(n-1)-j} = np \sum_{j=0}^{n-1} \binom{n-1}{j} p^j q^{(n-1)-j} = np \cdot 1 = np

where the last step uses the binomial theorem. \square

Theorem 4 Binomial Variance and MGF

If XBinomial(n,p)X \sim \text{Binomial}(n, p), then:

  1. Var(X)=np(1p)=npq\text{Var}(X) = np(1-p) = npq
  2. MX(t)=(q+pet)nM_X(t) = (q + pe^t)^n
Proof [show]

Variance. Since X=X1++XnX = X_1 + \cdots + X_n with the XiX_i independent, the variance is additive:

Var(X)=Var(X1)++Var(Xn)=npq=npq\text{Var}(X) = \text{Var}(X_1) + \cdots + \text{Var}(X_n) = n \cdot pq = npq

MGF. Again using the iid Bernoulli representation:

MX(t)=E[etX]=E[et(X1++Xn)]=i=1nE[etXi]=(q+pet)nM_X(t) = E[e^{tX}] = E[e^{t(X_1 + \cdots + X_n)}] = \prod_{i=1}^n E[e^{tX_i}] = (q + pe^t)^n

where the product factorization uses independence, and each factor is the Bernoulli MGF. \square

Theorem 5 Binomial Reproductive Property

If XBinomial(n1,p)X \sim \text{Binomial}(n_1, p) and YBinomial(n2,p)Y \sim \text{Binomial}(n_2, p) are independent (same pp), then

X+YBinomial(n1+n2,p)X + Y \sim \text{Binomial}(n_1 + n_2, p)

Proof [show]

By the MGF uniqueness theorem from Topic 4:

MX+Y(t)=MX(t)MY(t)=(q+pet)n1(q+pet)n2=(q+pet)n1+n2M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = (q + pe^t)^{n_1} \cdot (q + pe^t)^{n_2} = (q + pe^t)^{n_1 + n_2}

This is the MGF of Binomial(n1+n2,p)\text{Binomial}(n_1 + n_2, p), so by uniqueness, X+YBinomial(n1+n2,p)X + Y \sim \text{Binomial}(n_1 + n_2, p). \square

Example 2 A/B testing: confidence interval for conversion rate

An e-commerce site runs an A/B test: 1200 users see design A, with 84 conversions. The sample proportion p^=84/1200=0.07\hat{p} = 84/1200 = 0.07 estimates the Binomial parameter pp. By the Central Limit Theorem, an approximate 95% confidence interval is

p^±1.96p^(1p^)n=0.07±0.014\hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} = 0.07 \pm 0.014

so we’re 95% confident the true conversion rate is between 5.6% and 8.4%. The p^(1p^)/n\sqrt{\hat{p}(1-\hat{p})/n} in the denominator is Var(p^)=pq/n\sqrt{\text{Var}(\hat{p})} = \sqrt{pq/n} — the Binomial variance formula, divided by nn, is doing all the work.


5.4 The Geometric Distribution

The Binomial fixes nn (number of trials) and counts successes. The Geometric flips the question: we keep running independent Bernoulli trials until the first success. How long do we wait?

Definition 3 Geometric Distribution

A random variable XX has the Geometric distribution with parameter p(0,1)p \in (0, 1), written XGeometric(p)X \sim \text{Geometric}(p), if its PMF is

pX(k)=p(1p)k1,k{1,2,3,}p_X(k) = p(1-p)^{k-1}, \quad k \in \{1, 2, 3, \ldots\}

Here XX counts the number of trials until (and including) the first success.

Remark Two conventions for the Geometric

Some textbooks define XX as the number of failures before the first success, giving P(X=k)=p(1p)kP(X = k) = p(1-p)^k for k=0,1,2,k = 0, 1, 2, \ldots with E[X]=(1p)/pE[X] = (1-p)/p. We use the trials until first success convention (support starts at 1) because it parallels the Negative Binomial more naturally: NegBin counts trials until the rr-th success, and Geometric is the r=1r = 1 case. Always check which convention a text uses — the formulas for E[X] and Var(X) differ.

The PMF sums to 1 by the geometric series: k=1pqk1=p11q=1\sum_{k=1}^{\infty} p q^{k-1} = p \cdot \frac{1}{1 - q} = 1, using the series formula from formalCalculus: Sequences & Limits .

Theorem 6 Geometric Expectation

If XGeometric(p)X \sim \text{Geometric}(p), then E[X]=1/pE[X] = 1/p.

Proof [show]

We need k=1kpqk1\sum_{k=1}^{\infty} k p q^{k-1}. Factor out pp and use the derivative of the geometric series. Since k=0qk=11q\sum_{k=0}^{\infty} q^k = \frac{1}{1-q}, differentiating both sides with respect to qq:

k=1kqk1=1(1q)2=1p2\sum_{k=1}^{\infty} k q^{k-1} = \frac{1}{(1-q)^2} = \frac{1}{p^2}

Therefore:

E[X]=pk=1kqk1=p1p2=1pE[X] = p \sum_{k=1}^{\infty} k q^{k-1} = p \cdot \frac{1}{p^2} = \frac{1}{p}

Sanity check: if p=1p = 1 (certain success), E[X]=1E[X] = 1 — one trial suffices. If p=0.01p = 0.01 (rare success), E[X]=100E[X] = 100 — we expect to wait 100 trials. \square

Theorem 7 Geometric Variance and MGF

If XGeometric(p)X \sim \text{Geometric}(p), then:

  1. Var(X)=1pp2=qp2\text{Var}(X) = \frac{1-p}{p^2} = \frac{q}{p^2}
  2. MX(t)=pet1qetM_X(t) = \frac{pe^t}{1 - qe^t}, defined for t<ln(1p)t < -\ln(1-p)
Proof [show]

Variance. We need E[X2]E[X^2]. Using LOTUS and the second derivative of the geometric series:

E[X2]=k=1k2pqk1E[X^2] = \sum_{k=1}^{\infty} k^2 p q^{k-1}

Write k2=k(k1)+kk^2 = k(k-1) + k, so E[X2]=E[X(X1)]+E[X]E[X^2] = E[X(X-1)] + E[X]. For E[X(X1)]E[X(X-1)]:

E[X(X1)]=pk=2k(k1)qk1=pqk=2k(k1)qk2E[X(X-1)] = p \sum_{k=2}^{\infty} k(k-1) q^{k-1} = pq \sum_{k=2}^{\infty} k(k-1) q^{k-2}

The sum k=2k(k1)qk2=2(1q)3=2p3\sum_{k=2}^{\infty} k(k-1) q^{k-2} = \frac{2}{(1-q)^3} = \frac{2}{p^3} (second derivative of geometric series). So E[X(X1)]=pq2p3=2qp2E[X(X-1)] = pq \cdot \frac{2}{p^3} = \frac{2q}{p^2}.

Var(X)=E[X2](E[X])2=(2qp2+1p)1p2=2q+p1p2=qp2\text{Var}(X) = E[X^2] - (E[X])^2 = \left(\frac{2q}{p^2} + \frac{1}{p}\right) - \frac{1}{p^2} = \frac{2q + p - 1}{p^2} = \frac{q}{p^2}

MGF. By definition:

MX(t)=k=1etkpqk1=petk=1(qet)k1=pet1qetM_X(t) = \sum_{k=1}^{\infty} e^{tk} p q^{k-1} = pe^t \sum_{k=1}^{\infty} (qe^t)^{k-1} = \frac{pe^t}{1 - qe^t}

The geometric series converges when qet<1|qe^t| < 1, i.e., t<ln(1/q)=ln(1p)t < \ln(1/q) = -\ln(1-p). \square

Now the most important structural property of the Geometric distribution — the one that makes it unique among discrete distributions.

Theorem 8 Memoryless Property of the Geometric Distribution

The Geometric distribution is the only discrete distribution with the memoryless property: for all s,t{0,1,2,}s, t \in \{0, 1, 2, \ldots\},

P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t)

That is, given that we’ve already waited ss trials without success, the remaining waiting time has the same distribution as if we started fresh.

Proof [show]

The Geometric is memoryless. The survival function is P(X>n)=qnP(X > n) = q^n (probability of nn consecutive failures). Then:

P(X>s+tX>s)=P(X>s+t)P(X>s)=qs+tqs=qt=P(X>t)P(X > s + t \mid X > s) = \frac{P(X > s + t)}{P(X > s)} = \frac{q^{s+t}}{q^s} = q^t = P(X > t)

Uniqueness. Suppose XX takes values in {1,2,}\{1, 2, \ldots\} and satisfies P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t) for all s,t0s, t \geq 0. Let g(n)=P(X>n)g(n) = P(X > n), so g(0)=1g(0) = 1. The memoryless property gives g(s+t)=g(s)g(t)g(s + t) = g(s) g(t). The only function satisfying this Cauchy functional equation on the non-negative integers with g(0)=1g(0) = 1 and 0<g(1)<10 < g(1) < 1 is g(n)=g(1)ng(n) = g(1)^n. Setting q=g(1)q = g(1) gives P(X>n)=qnP(X > n) = q^n, which is exactly the Geometric survival function with p=1qp = 1 - q. \square

Three-panel figure: Geometric PMF for various p, memoryless property P(X=k|X>5) overlay, and survival function on log scale showing linearity

The memoryless property means that Bernoulli trials have no “momentum” — the fact that you’ve failed 100 times doesn’t make the 101st trial any more likely to succeed. This is the mathematical content of “the coin has no memory.”

Interactive: Memoryless Property Explorer
Full PMF + Conditional (given X > 5)
X > 51471013161922252831
Re-indexed: same shape as original
Re-indexed k (starting from 1)
Verification: P(X > s+t | X > s) = P(X > t)
tP(X > 5+t | X > 5)P(X > t)Match?
10.7500000.750000=
20.5625000.562500=
30.4218750.421875=
40.3164060.316406=
50.2373050.237305=
The past doesn't help predict the future — the trials are independent.
Example 3 Expected trials until first click

A display ad has a click-through rate of p=0.02p = 0.02. The number of impressions until the first click follows XGeometric(0.02)X \sim \text{Geometric}(0.02). Expected impressions: E[X]=1/0.02=50E[X] = 1/0.02 = 50. Standard deviation: σ=q/p2=0.98/0.000449.5\sigma = \sqrt{q/p^2} = \sqrt{0.98/0.0004} \approx 49.5.

The memoryless property is practically important here: if an ad has been shown 100 times without a click, the expected number of additional impressions until first click is still 50 — the past doesn’t help predict the future, because each impression is an independent Bernoulli trial.


5.5 The Negative Binomial Distribution

The Geometric waits for the first success. The Negative Binomial generalizes: wait for the rr-th success.

Definition 4 Negative Binomial Distribution

A random variable XX has the Negative Binomial distribution with parameters rNr \in \mathbb{N} and p(0,1)p \in (0,1), written XNegBin(r,p)X \sim \text{NegBin}(r, p), if its PMF is

pX(k)=(k1r1)pr(1p)kr,k{r,r+1,r+2,}p_X(k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}, \quad k \in \{r, r+1, r+2, \ldots\}

Here XX counts the total number of trials until (and including) the rr-th success.

The PMF logic: on trial kk, the rr-th success occurs. That means exactly r1r - 1 successes in the first k1k - 1 trials (there are (k1r1)\binom{k-1}{r-1} ways to arrange these), each contributing pr1qkrp^{r-1} q^{k-r}, and then a final success on trial kk contributing pp. Total: (k1r1)prqkr\binom{k-1}{r-1} p^r q^{k-r}.

Note: NegBin(1,p)=Geometric(p)\text{NegBin}(1, p) = \text{Geometric}(p). Setting r=1r = 1 gives (k10)pqk1=pqk1\binom{k-1}{0} p q^{k-1} = p q^{k-1}, which is exactly the Geometric PMF.

Theorem 9 Negative Binomial as Sum of Geometrics

If Y1,Y2,,YriidGeometric(p)Y_1, Y_2, \ldots, Y_r \overset{\text{iid}}{\sim} \text{Geometric}(p) and X=Y1+Y2++YrX = Y_1 + Y_2 + \cdots + Y_r, then XNegBin(r,p)X \sim \text{NegBin}(r, p).

Proof [show]

Think of YiY_i as the number of trials from the (i1)(i-1)-th success to the ii-th success. By the memoryless property of the Geometric, after each success the process “resets” — the remaining waiting time is independent of the past. So Y1,,YrY_1, \ldots, Y_r are independent, and X=Y1++YrX = Y_1 + \cdots + Y_r is the total number of trials until the rr-th success.

To verify the PMF, we can use MGFs. The MGF of YiY_i is pet/(1qet)pe^t/(1 - qe^t), so by independence:

MX(t)=(pet1qet)rM_X(t) = \left(\frac{pe^t}{1 - qe^t}\right)^r

This uniquely determines the NegBin(r,p)(r, p) distribution. \square

Theorem 10 Negative Binomial Moments and MGF

If XNegBin(r,p)X \sim \text{NegBin}(r, p), then:

  1. E[X]=r/pE[X] = r/p
  2. Var(X)=rq/p2\text{Var}(X) = rq/p^2
  3. MX(t)=(pet1qet)rM_X(t) = \left(\frac{pe^t}{1 - qe^t}\right)^r, defined for t<ln(1p)t < -\ln(1-p)
Proof [show]

Write X=Y1++YrX = Y_1 + \cdots + Y_r with YiiidGeometric(p)Y_i \overset{\text{iid}}{\sim} \text{Geometric}(p). By linearity:

E[X]=rE[Y1]=r1p=rpE[X] = r \cdot E[Y_1] = r \cdot \frac{1}{p} = \frac{r}{p}

By independence:

Var(X)=rVar(Y1)=rqp2=rqp2\text{Var}(X) = r \cdot \text{Var}(Y_1) = r \cdot \frac{q}{p^2} = \frac{rq}{p^2}

The MGF follows from the product of rr independent Geometric MGFs, as shown in Theorem 9. \square

Three-panel figure: NegBin PMF varying r, NegBin PMF varying p, and Geometric = NegBin(r=1) overlap

A key feature of the Negative Binomial is that Var(X)=rq/p2=E[X]q/p=E[X](E[X]/r)\text{Var}(X) = rq/p^2 = E[X] \cdot q/p = E[X] \cdot (E[X]/r), which means the variance is always larger than the mean when q>0q > 0. This overdispersion (variance exceeding the mean) makes the Negative Binomial the go-to alternative when Poisson is too restrictive.

Example 4 RNA-seq count modeling with overdispersion

In RNA-seq experiments, gene expression is measured as read counts. If counts followed a Poisson model, Var(Y)=E[Y]\text{Var}(Y) = E[Y]. But biological replicates consistently show Var(Y)>E[Y]\text{Var}(Y) > E[Y] — overdispersion due to biological variability between samples. Tools like DESeq2 model counts as YNegBin(μ,r)Y \sim \text{NegBin}(\mu, r), where the dispersion parameter rr captures the excess variability. As rr \to \infty, NegBinPoisson\text{NegBin} \to \text{Poisson}, so the Poisson model is a special case — and a testable hypothesis.


5.6 The Poisson Distribution

The Poisson distribution arises in a completely different way from the Bernoulli family. Instead of counting successes in a fixed number of trials, it counts events occurring at a constant average rate in a continuous interval — photons hitting a detector, customers arriving at a store, typos per page, mutations per genome.

Definition 5 Poisson Distribution

A random variable XX has the Poisson distribution with parameter λ>0\lambda > 0, written XPoisson(λ)X \sim \text{Poisson}(\lambda), if its PMF is

pX(k)=eλλkk!,k{0,1,2,}p_X(k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k \in \{0, 1, 2, \ldots\}

The parameter λ\lambda is both the mean and the variance (equidispersion).

The PMF sums to 1 because k=0λk/k!=eλ\sum_{k=0}^{\infty} \lambda^k / k! = e^{\lambda} (the exponential power series from formalCalculus: Taylor Series ), so k=0eλλk/k!=eλeλ=1\sum_{k=0}^{\infty} e^{-\lambda} \lambda^k / k! = e^{-\lambda} \cdot e^{\lambda} = 1.

But where does this PMF come from? The answer is the Poisson limit theorem — arguably the most beautiful limit in discrete probability.

Theorem 11 Poisson Limit Theorem

If XnBinomial(n,λ/n)X_n \sim \text{Binomial}(n, \lambda/n) with λ>0\lambda > 0 fixed, then for each k{0,1,2,}k \in \{0, 1, 2, \ldots\}:

limnP(Xn=k)=eλλkk!\lim_{n \to \infty} P(X_n = k) = \frac{e^{-\lambda} \lambda^k}{k!}

That is, Binomial(n,λ/n)nPoisson(λ)\text{Binomial}(n, \lambda/n) \xrightarrow{n \to \infty} \text{Poisson}(\lambda).

Proof [show]

Fix kk and compute P(Xn=k)P(X_n = k) with p=λ/np = \lambda/n:

P(Xn=k)=(nk)(λn)k(1λn)nkP(X_n = k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}

Rewrite the binomial coefficient:

=λkk!n(n1)(nk+1)nk(1λn)n(1λn)k= \frac{\lambda^k}{k!} \cdot \frac{n(n-1) \cdots (n-k+1)}{n^k} \cdot \left(1 - \frac{\lambda}{n}\right)^n \cdot \left(1 - \frac{\lambda}{n}\right)^{-k}

Now take nn \to \infty with kk and λ\lambda fixed. The three factors converge:

n(n1)(nk+1)nk=1(11n)(1k1n)1\frac{n(n-1) \cdots (n-k+1)}{n^k} = 1 \cdot \left(1 - \frac{1}{n}\right) \cdots \left(1 - \frac{k-1}{n}\right) \to 1

(1λn)neλ\left(1 - \frac{\lambda}{n}\right)^n \to e^{-\lambda}

(1λn)k1\left(1 - \frac{\lambda}{n}\right)^{-k} \to 1

The limit (1λ/n)neλ(1 - \lambda/n)^n \to e^{-\lambda} uses the characterization of ee from formalCalculus: Sequences & Limits . Combining:

limnP(Xn=k)=λkk!1eλ1=eλλkk!\lim_{n \to \infty} P(X_n = k) = \frac{\lambda^k}{k!} \cdot 1 \cdot e^{-\lambda} \cdot 1 = \frac{e^{-\lambda} \lambda^k}{k!}

which is the Poisson PMF. \square

Three-panel figure: Poisson PMFs for λ=1,3,7,15; Binomial(n,5/n)→Poisson(5) convergence; mean-variance equidispersion plot
Interactive: Poisson Limit Theorem
0.0000.050.100.150.20012345678910111213141516171819Poisson(λ=5)Binomial(n=20, p=0.2500)
TV distance: 0.0714 Divergent
E[X]Var(X)
Binomial5.00003.7500
Poisson5.00005.0000
Theorem 12 Poisson Equidispersion

If XPoisson(λ)X \sim \text{Poisson}(\lambda), then E[X]=Var(X)=λE[X] = \text{Var}(X) = \lambda.

Proof [show]

Expectation. By LOTUS:

E[X]=k=0keλλkk!=k=1eλλk(k1)!E[X] = \sum_{k=0}^{\infty} k \cdot \frac{e^{-\lambda} \lambda^k}{k!} = \sum_{k=1}^{\infty} \frac{e^{-\lambda} \lambda^k}{(k-1)!}

Substituting j=k1j = k - 1:

=eλλj=0λjj!=eλλeλ=λ= e^{-\lambda} \lambda \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = e^{-\lambda} \lambda \cdot e^{\lambda} = \lambda

Variance. Compute E[X(X1)]E[X(X-1)] first:

E[X(X1)]=k=2k(k1)eλλkk!=eλλ2j=0λjj!=λ2E[X(X-1)] = \sum_{k=2}^{\infty} k(k-1) \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \lambda^2 \sum_{j=0}^{\infty} \frac{\lambda^j}{j!} = \lambda^2

So E[X2]=E[X(X1)]+E[X]=λ2+λE[X^2] = E[X(X-1)] + E[X] = \lambda^2 + \lambda, and:

Var(X)=E[X2](E[X])2=λ2+λλ2=λ\text{Var}(X) = E[X^2] - (E[X])^2 = \lambda^2 + \lambda - \lambda^2 = \lambda

The mean equals the variance — this is the equidispersion property. \square

Theorem 13 Poisson MGF

If XPoisson(λ)X \sim \text{Poisson}(\lambda), then MX(t)=eλ(et1)M_X(t) = e^{\lambda(e^t - 1)}, defined for all tRt \in \mathbb{R}.

Proof [show]

MX(t)=E[etX]=k=0etkeλλkk!=eλk=0(λet)kk!=eλeλet=eλ(et1)M_X(t) = E[e^{tX}] = \sum_{k=0}^{\infty} e^{tk} \frac{e^{-\lambda} \lambda^k}{k!} = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda} \cdot e^{\lambda e^t} = e^{\lambda(e^t - 1)}

where we recognized the sum as the Taylor series of eλete^{\lambda e^t}. \square

Theorem 14 Poisson Reproductive Property

If XPoisson(λ1)X \sim \text{Poisson}(\lambda_1) and YPoisson(λ2)Y \sim \text{Poisson}(\lambda_2) are independent, then

X+YPoisson(λ1+λ2)X + Y \sim \text{Poisson}(\lambda_1 + \lambda_2)

Proof [show]

By MGF uniqueness:

MX+Y(t)=MX(t)MY(t)=eλ1(et1)eλ2(et1)=e(λ1+λ2)(et1)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{\lambda_1(e^t - 1)} \cdot e^{\lambda_2(e^t - 1)} = e^{(\lambda_1 + \lambda_2)(e^t - 1)}

This is the Poisson(λ1+λ2)(\lambda_1 + \lambda_2) MGF. \square

Example 5 Website event counting

A website receives an average of λ=12\lambda = 12 clicks per minute. Model the count in any minute as XPoisson(12)X \sim \text{Poisson}(12). Then P(X=0)=e126×106P(X = 0) = e^{-12} \approx 6 \times 10^{-6} — essentially no chance of a minute with zero clicks. The probability of an unusually high burst: P(X20)P(X \geq 20) can be computed by summing the PMF tail.

By the reproductive property, the count in a 5-minute window follows Poisson(60)\text{Poisson}(60). The equidispersion property (Var=mean\text{Var} = \text{mean}) gives σ=607.7\sigma = \sqrt{60} \approx 7.7, so a count of 75 or more (2σ\approx 2\sigma above the mean) would be suspicious — a signal worth investigating for bot traffic.


5.7 The Hypergeometric Distribution

Every distribution so far assumes either independent trials (Bernoulli, Binomial, Geometric, NegBin) or a rate process (Poisson). The Hypergeometric breaks the independence assumption: it models sampling without replacement from a finite population.

Definition 6 Hypergeometric Distribution

A random variable XX has the Hypergeometric distribution with parameters NN (population size), KK (number of success states), and nn (number of draws), written XHypergeometric(N,K,n)X \sim \text{Hypergeometric}(N, K, n), if its PMF is

pX(k)=(Kk)(NKnk)(Nn),k{max(0,nN+K),,min(n,K)}p_X(k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}, \quad k \in \{\max(0, n-N+K), \ldots, \min(n, K)\}

The PMF counts combinatorially: choose kk successes from KK available ((Kk)\binom{K}{k} ways), choose nkn - k failures from NKN - K available ((NKnk)\binom{N-K}{n-k} ways), divide by total ways to choose nn items from NN ((Nn)\binom{N}{n}). The support bounds ensure non-negative binomial coefficients.

Theorem 15 Hypergeometric Expectation

If XHypergeometric(N,K,n)X \sim \text{Hypergeometric}(N, K, n), then E[X]=nK/NE[X] = nK/N.

Proof [show]

Define indicator variables Xi=1X_i = 1 if the ii-th draw is a success. Then X=X1++XnX = X_1 + \cdots + X_n. By symmetry, P(Xi=1)=K/NP(X_i = 1) = K/N for every ii — each draw is equally likely to pick any of the NN items, regardless of order. By linearity:

E[X]=i=1nE[Xi]=nKNE[X] = \sum_{i=1}^n E[X_i] = n \cdot \frac{K}{N}

Note this is the same as Binomial(n,K/N)\text{Binomial}(n, K/N) — the mean doesn’t know whether we’re sampling with or without replacement. \square

Theorem 16 Hypergeometric Variance and Finite Population Correction

If XHypergeometric(N,K,n)X \sim \text{Hypergeometric}(N, K, n), then

Var(X)=nKNNKNNnN1\text{Var}(X) = n \cdot \frac{K}{N} \cdot \frac{N-K}{N} \cdot \frac{N-n}{N-1}

The factor NnN1\frac{N-n}{N-1} is the finite population correction (FPC).

Proof [show]

Using the indicator decomposition X=X1++XnX = X_1 + \cdots + X_n:

Var(X)=i=1nVar(Xi)+2i<jCov(Xi,Xj)\text{Var}(X) = \sum_{i=1}^n \text{Var}(X_i) + 2 \sum_{i < j} \text{Cov}(X_i, X_j)

Each XiX_i is Bernoulli with p=K/Np = K/N, so Var(Xi)=KN(1KN)\text{Var}(X_i) = \frac{K}{N}\left(1 - \frac{K}{N}\right).

For the covariance, P(Xi=1,Xj=1)=KNK1N1P(X_i = 1, X_j = 1) = \frac{K}{N} \cdot \frac{K-1}{N-1}, so:

Cov(Xi,Xj)=K(K1)N(N1)(KN)2=KNNKN1N1\text{Cov}(X_i, X_j) = \frac{K(K-1)}{N(N-1)} - \left(\frac{K}{N}\right)^2 = \frac{K}{N} \cdot \frac{N-K}{N} \cdot \frac{-1}{N-1}

Combining nn variance terms and (n2)\binom{n}{2} covariance terms yields the result. The FPC factor (Nn)/(N1)(N-n)/(N-1) reduces the variance compared to Binomial: sampling without replacement reduces uncertainty because each draw provides more information. \square

Remark The 5% rule for finite population correction

When n/N<0.05n/N < 0.05 (sampling less than 5% of the population), the FPC is (Nn)/(N1)>0.95(N-n)/(N-1) > 0.95, and the Binomial approximation Var(X)npq\text{Var}(X) \approx npq is within 5% of the truth. This is why opinion polls of 1000 people can represent millions: n/Nn/N is tiny, so whether we sample with or without replacement barely matters.

Theorem 17 Hypergeometric Converges to Binomial

If NN \to \infty with K/N=pK/N = p fixed and nn fixed, then

Hypergeometric(N,Np,n)Binomial(n,p)\text{Hypergeometric}(N, Np, n) \to \text{Binomial}(n, p)

in the sense that the PMF converges pointwise.

Proof [show]

Fix kk and compute:

(Npk)(NNpnk)(Nn)=(Np)!(N(1p))!n!(Nn)!k!(Npk)!(nk)!(N(1p)n+k)!N!\frac{\binom{Np}{k}\binom{N - Np}{n-k}}{\binom{N}{n}} = \frac{(Np)! \, (N(1-p))! \, n! \, (N-n)!}{k! \, (Np - k)! \, (n-k)! \, (N(1-p) - n + k)! \, N!}

As NN \to \infty, using Stirling-type asymptotics on the factorials, the ratio of falling factorials converges:

(Np)(Np1)(Npk+1)N(N1)(Nk+1)pk\frac{(Np)(Np-1)\cdots(Np-k+1)}{N(N-1)\cdots(N-k+1)} \to p^k

and similarly for the failure terms, giving (nk)pk(1p)nk\binom{n}{k} p^k (1-p)^{n-k} in the limit. \square

Three-panel figure: Hypergeometric PMFs, convergence to Binomial as N grows, and FPC curve with 5% rule annotation
Interactive: With vs. Without Replacement
Binomial(n=10, p=0.40) — with replacement
0.0000.1400.280012345678910
Hypergeometric(N=50, K=20, n=10) — without replacement
0.0000.1400.280012345678910
E[X]Var(X)
Binomial4.00002.4000
Hypergeometric4.00001.9592
n/N = 0.2000FPC = (N−n)/(N−1) = 0.8163Var ratio = 0.8163
FPC matters — sampling 20.0% of the population
Example 6 Fisher's exact test for drug trials

A clinical trial tests whether a drug reduces infection. Out of N=20N = 20 patients, K=8K = 8 recover. If the drug were irrelevant, the n=10n = 10 treated patients’ recoveries would follow XHypergeometric(20,8,10)X \sim \text{Hypergeometric}(20, 8, 10). If 7 of the 10 treated patients recovered, the p-value is P(X7)P(X \geq 7) under this null — computed exactly from the Hypergeometric PMF, with no normal approximation needed. This is Fisher’s exact test: the gold standard for small-sample 2×22 \times 2 contingency tables, and a standard tool in gene set enrichment analysis (GSEA).


5.8 The Discrete Uniform Distribution

The simplest discrete distribution: every outcome in a finite set is equally likely.

Definition 7 Discrete Uniform Distribution

A random variable XX has the Discrete Uniform distribution on {a,a+1,,b}\{a, a+1, \ldots, b\}, written XDiscreteUniform(a,b)X \sim \text{DiscreteUniform}(a, b), if its PMF is

pX(k)=1ba+1=1n,k{a,a+1,,b}p_X(k) = \frac{1}{b - a + 1} = \frac{1}{n}, \quad k \in \{a, a+1, \ldots, b\}

where n=ba+1n = b - a + 1 is the number of outcomes.

Theorem 18 Discrete Uniform Moments

If XDiscreteUniform(a,b)X \sim \text{DiscreteUniform}(a, b) with n=ba+1n = b - a + 1, then:

  1. E[X]=a+b2E[X] = \frac{a + b}{2} (the midpoint)
  2. Var(X)=n2112\text{Var}(X) = \frac{n^2 - 1}{12}
Proof [show]

Expectation. By symmetry of the uniform distribution around its midpoint:

E[X]=1nk=abk=1nn(a+b)2=a+b2E[X] = \frac{1}{n} \sum_{k=a}^{b} k = \frac{1}{n} \cdot \frac{n(a + b)}{2} = \frac{a + b}{2}

using the arithmetic series formula.

Variance. Shift to Y=XaDiscreteUniform(0,n1)Y = X - a \sim \text{DiscreteUniform}(0, n-1) so E[Y]=(n1)/2E[Y] = (n-1)/2. Then:

E[Y2]=1nj=0n1j2=(n1)(2n1)6E[Y^2] = \frac{1}{n} \sum_{j=0}^{n-1} j^2 = \frac{(n-1)(2n-1)}{6}

using the sum-of-squares formula. Therefore:

Var(Y)=(n1)(2n1)6(n12)2=(n1)(2n1)6(n1)24\text{Var}(Y) = \frac{(n-1)(2n-1)}{6} - \left(\frac{n-1}{2}\right)^2 = \frac{(n-1)(2n-1)}{6} - \frac{(n-1)^2}{4}

=(n1)[2(2n1)3(n1)]12=(n1)(n+1)12=n2112= \frac{(n-1)[2(2n-1) - 3(n-1)]}{12} = \frac{(n-1)(n+1)}{12} = \frac{n^2 - 1}{12}

Since Var(X)=Var(Y)\text{Var}(X) = \text{Var}(Y) (variance is shift-invariant), the result holds for any a,ba, b. For a fair die (a=1,b=6,n=6a = 1, b = 6, n = 6): E[X]=3.5E[X] = 3.5, Var(X)=35/122.917\text{Var}(X) = 35/12 \approx 2.917. \square

Three-panel figure: Uniform PMFs for three supports, step CDF, and entropy H=log(n) with landmarks
Remark Maximum entropy and non-informative priors

Among all distributions on a finite set {a,,b}\{a, \ldots, b\}, the Discrete Uniform maximizes entropy: H(X)=lognH(X) = \log n. Entropy H(X)=p(k)logp(k)H(X) = -\sum p(k) \log p(k) measures uncertainty, and the uniform distribution is the one that assumes the least — no value is favored over any other. This makes it the canonical non-informative prior in Bayesian statistics when all you know is the support. It’s also the foundation for random initialization in ML: shuffling training data, random feature selection in bagging, and hash-based tricks all use discrete uniform randomness.


5.9 Probability-Generating Functions

Topic 4 introduced the moment-generating function MX(t)=E[etX]M_X(t) = E[e^{tX}] as a tool for packaging moments and proving distributional results. For discrete random variables taking non-negative integer values, there’s an even more natural generating function: the probability-generating function.

Definition 8 Probability-Generating Function

Let XX be a non-negative integer-valued random variable with PMF pXp_X. The probability-generating function (PGF) of XX is

GX(s)=E[sX]=k=0pX(k)sk,s1G_X(s) = E[s^X] = \sum_{k=0}^{\infty} p_X(k) \, s^k, \quad |s| \leq 1

The coefficients of the power series are the probabilities: pX(k)=GX(k)(0)/k!p_X(k) = G_X^{(k)}(0) / k!.

The PGF is related to the MGF by GX(s)=MX(lns)G_X(s) = M_X(\ln s) and MX(t)=GX(et)M_X(t) = G_X(e^t). But the PGF has a key advantage for discrete variables: its coefficients are the probabilities. It’s a power series whose coefficient of sks^k is P(X=k)P(X = k).

Theorem 19 Moments from the PGF

If GX(s)G_X(s) is the PGF of XX, then:

  1. GX(1)=E[X]G_X'(1) = E[X]
  2. GX(r)(1)=E[X(X1)(Xr+1)]G_X^{(r)}(1) = E[X(X-1) \cdots (X-r+1)] (the rr-th factorial moment)
  3. Var(X)=GX(1)+GX(1)[GX(1)]2\text{Var}(X) = G_X''(1) + G_X'(1) - [G_X'(1)]^2
Proof [show]

Differentiate GX(s)=kpX(k)skG_X(s) = \sum_k p_X(k) s^k term by term:

GX(s)=k=1kpX(k)sk1G_X'(s) = \sum_{k=1}^{\infty} k \, p_X(k) \, s^{k-1}

Evaluating at s=1s = 1: GX(1)=kkpX(k)=E[X]G_X'(1) = \sum_k k \, p_X(k) = E[X].

For the second derivative:

GX(s)=k=2k(k1)pX(k)sk2G_X''(s) = \sum_{k=2}^{\infty} k(k-1) \, p_X(k) \, s^{k-2}

So GX(1)=E[X(X1)]G_X''(1) = E[X(X-1)]. Since Var(X)=E[X2](E[X])2=E[X(X1)]+E[X](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 = E[X(X-1)] + E[X] - (E[X])^2:

Var(X)=GX(1)+GX(1)[GX(1)]2\text{Var}(X) = G_X''(1) + G_X'(1) - [G_X'(1)]^2

The general rr-th derivative follows by induction. \square

Here are the PGFs for our seven distributions:

DistributionPGF G(s)G(s)
Bernoulli(p)\text{Bernoulli}(p)q+psq + ps
Binomial(n,p)\text{Binomial}(n, p)(q+ps)n(q + ps)^n
Geometric(p)\text{Geometric}(p)ps/(1qs)ps/(1 - qs)
NegBin(r,p)\text{NegBin}(r, p)(ps/(1qs))r(ps/(1-qs))^r
Poisson(λ)\text{Poisson}(\lambda)eλ(s1)e^{\lambda(s-1)}
DiscreteUniform(0,n1)\text{DiscreteUniform}(0, n-1)(1sn)/(n(1s))(1 - s^n)/(n(1 - s))
Three-panel figure: PGF curves for four distributions, moment extraction via derivatives (Poisson demo), and compound distribution simulation
Theorem 20 PGF Sum Property

If XX and YY are independent non-negative integer-valued random variables, then

GX+Y(s)=GX(s)GY(s)G_{X+Y}(s) = G_X(s) \cdot G_Y(s)

Proof [show]

GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]E[sY]=GX(s)GY(s)G_{X+Y}(s) = E[s^{X+Y}] = E[s^X s^Y] = E[s^X] \cdot E[s^Y] = G_X(s) \cdot G_Y(s), where the third equality uses independence. \square

The real payoff of PGFs is the compound distribution formula — the reason PGFs exist as a separate tool rather than just being a notational variant of MGFs.

Theorem 21 Compound Distribution Formula

Let NN be a non-negative integer-valued random variable, and let X1,X2,X_1, X_2, \ldots be iid non-negative integer-valued random variables, independent of NN. Define the random sum S=X1+X2++XNS = X_1 + X_2 + \cdots + X_N (with S=0S = 0 when N=0N = 0). Then

GS(s)=GN(GX(s))G_S(s) = G_N(G_X(s))

Proof [show]

Condition on NN:

GS(s)=E[sS]=n=0E[sSN=n]P(N=n)=n=0E[sX1++Xn]P(N=n)G_S(s) = E[s^S] = \sum_{n=0}^{\infty} E[s^S \mid N = n] \cdot P(N = n) = \sum_{n=0}^{\infty} E[s^{X_1 + \cdots + X_n}] \cdot P(N = n)

Since X1,,XnX_1, \ldots, X_n are iid and independent of NN:

=n=0[GX(s)]nP(N=n)=E[(GX(s))N]=GN(GX(s))= \sum_{n=0}^{\infty} [G_X(s)]^n \cdot P(N = n) = E[(G_X(s))^N] = G_N(G_X(s))

The last step recognizes n[GX(s)]nP(N=n)=E[zN]\sum_n [G_X(s)]^n P(N = n) = E[z^N] evaluated at z=GX(s)z = G_X(s), which is GN(z)=GN(GX(s))G_N(z) = G_N(G_X(s)). \square

Interactive: Probability-Generating Functions
G(s) = E[s^X]
0.000.250.500.751.00slope = E[X] = 0.4000.000.250.500.751.00
Moments from PGF
QuantityPGFClosed-form
G'(1) = E[X]0.40000.4000
G''(1) = E[X(X−1)]0.0000
Var(X)0.24000.2400
Var = G''(1) + G'(1) − [G'(1)]²
Example 7 Poisson thinning via compound distributions

A Poisson process produces NPoisson(λ)N \sim \text{Poisson}(\lambda) events per unit time. Each event is independently “kept” with probability pp (and discarded with probability 1p1-p). The number of retained events is S=X1++XNS = X_1 + \cdots + X_N where XiBernoulli(p)X_i \sim \text{Bernoulli}(p).

Applying the compound formula:

GS(s)=GN(GX(s))=eλ(GX(s)1)=eλ((q+ps)1)=eλp(s1)G_S(s) = G_N(G_X(s)) = e^{\lambda(G_X(s) - 1)} = e^{\lambda((q + ps) - 1)} = e^{\lambda p(s - 1)}

This is the PGF of Poisson(λp)\text{Poisson}(\lambda p)! Thinning a Poisson process by probability pp gives another Poisson process with rate λp\lambda p. This result is used in practice for network traffic modeling, rare event simulation, and the “thinning” algorithm for simulating non-homogeneous Poisson processes.


5.10 Relationships Between Distributions

The seven distributions are not isolated — they form a web of connections through special cases, sums, and limiting relationships. The diagram below captures these connections.

Relationship web diagram: nodes for all 7 distributions, arrows for sums, limits, and special cases, with exponential family boundary

Special cases:

  • Bernoulli(p)=Binomial(1,p)\text{Bernoulli}(p) = \text{Binomial}(1, p)
  • Geometric(p)=NegBin(1,p)\text{Geometric}(p) = \text{NegBin}(1, p)

Sum relationships (independent, same pp):

  • i=1nBernoulli(p)Binomial(n,p)\sum_{i=1}^n \text{Bernoulli}(p) \sim \text{Binomial}(n, p)
  • i=1rGeometric(p)NegBin(r,p)\sum_{i=1}^r \text{Geometric}(p) \sim \text{NegBin}(r, p)

Reproductive properties (independent, same pp or λ\lambda):

  • Binomial(n1,p)+Binomial(n2,p)Binomial(n1+n2,p)\text{Binomial}(n_1, p) + \text{Binomial}(n_2, p) \sim \text{Binomial}(n_1 + n_2, p)
  • Poisson(λ1)+Poisson(λ2)Poisson(λ1+λ2)\text{Poisson}(\lambda_1) + \text{Poisson}(\lambda_2) \sim \text{Poisson}(\lambda_1 + \lambda_2)
  • NegBin(r1,p)+NegBin(r2,p)NegBin(r1+r2,p)\text{NegBin}(r_1, p) + \text{NegBin}(r_2, p) \sim \text{NegBin}(r_1 + r_2, p)

Limit relationships:

  • Binomial(n,λ/n)Poisson(λ)\text{Binomial}(n, \lambda/n) \to \text{Poisson}(\lambda) as nn \to \infty (Poisson limit theorem)
  • Hypergeometric(N,Np,n)Binomial(n,p)\text{Hypergeometric}(N, Np, n) \to \text{Binomial}(n, p) as NN \to \infty (infinite population limit)
Remark Five of seven are exponential family members

Bernoulli, Binomial, Geometric, Negative Binomial, and Poisson all belong to the exponential family — distributions whose PMFs can be written as p(kθ)=h(k)exp(η(θ)T(k)A(θ))p(k \mid \theta) = h(k) \exp(\eta(\theta) \cdot T(k) - A(\theta)). The Hypergeometric does not belong because its support {max(0,nN+K),,min(n,K)}\{\max(0, n-N+K), \ldots, \min(n, K)\} depends on the parameter NN — violating the requirement that the support is fixed. The Discrete Uniform does not belong because its support {1,,m}\{1, \ldots, m\} depends on the parameter mm, also violating the fixed-support requirement. Exponential Families makes this precise and shows why exponential family membership matters for estimation, testing, and GLMs.


5.11 Connections to ML

Every distribution in this topic appears in machine learning — not as an abstract exercise, but as a modeling choice that shapes loss functions, estimators, and inference procedures.

Three-panel figure: logistic regression sigmoid with Bernoulli data, A/B test sampling distribution, and Poisson vs NegBin overdispersion fit

Binary classification (Bernoulli): Logistic regression models YXBernoulli(σ(βTX))Y \mid X \sim \text{Bernoulli}(\sigma(\beta^T X)). Cross-entropy loss = negative Bernoulli log-likelihood. The logit link η=ln(p/(1p))\eta = \ln(p/(1-p)) comes directly from the exponential family natural parameter (see formalML: Logistic Regression ).

A/B testing (Binomial): The sample proportion p^=X/n\hat{p} = X/n is the MLE of the Binomial parameter. Confidence intervals use Var(p^)=pq/n\text{Var}(\hat{p}) = pq/n. Power calculations for determining sample sizes reduce to Binomial tail probabilities.

Count regression (Poisson, NegBin): Poisson regression models YXPoisson(exp(βTX))Y \mid X \sim \text{Poisson}(\exp(\beta^T X)) via the log link. When data are overdispersed (Var(Y)>E[Y]\text{Var}(Y) > E[Y]), Negative Binomial regression provides the fix. The variance function Var(Y)=μ+μ2/r\text{Var}(Y) = \mu + \mu^2/r interpolates between Poisson (rr \to \infty) and pure quadratic variance (rr finite) — see Topic 22 §22.5 (Poisson regression) for the worked Poisson treatment and §22.9 Rem 20 for the Negative Binomial overdispersion fix.

Sampling and exact tests (Hypergeometric): Fisher’s exact test uses the Hypergeometric to compute exact p-values for 2×22 \times 2 tables. Gene set enrichment analysis (GSEA) asks whether a gene set is overrepresented in a ranked list — a Hypergeometric calculation.

Random initialization (Discrete Uniform): Shuffling training data, selecting features in random forests, hash-based tricks for dimensionality reduction — all use discrete uniform randomness as a foundation (see formalML: Naive Bayes ).

Example 8 Distribution choice flowchart for count data

When modeling count data in ML, the choice of distribution matters:

  1. Binary outcome (0 or 1)? → Bernoulli / Logistic regression
  2. Fixed number of trials (nn known), independent, same pp? → Binomial
  3. Event counts in a continuous interval, no upper bound? → Poisson (check equidispersion: if VarMean\text{Var} \gg \text{Mean}, use Negative Binomial)
  4. Waiting time until the rr-th success? → Negative Binomial (Geometric if r=1r = 1)
  5. Sampling without replacement from a finite population? → Hypergeometric
  6. No prior information, finite outcomes? → Discrete Uniform

The choice determines the loss function, the link function, and the variance structure of your model. Getting it wrong — using Poisson when data are overdispersed, for instance — leads to overconfident inference and unreliable p-values.


Summary

We’ve cataloged seven discrete distributions, each arising from a distinct probabilistic mechanism. Here’s the reference table:

DistributionPMFE[X]E[X]Var(X)\text{Var}(X)MGF MX(t)M_X(t)Exp. Family?
Bern(p)\text{Bern}(p)pkq1kp^k q^{1-k}pppqpqq+petq + pe^tYes
Binom(n,p)\text{Binom}(n,p)(nk)pkqnk\binom{n}{k}p^k q^{n-k}npnpnpqnpq(q+pet)n(q+pe^t)^nYes
Geom(p)\text{Geom}(p)pqk1pq^{k-1}1/p1/pq/p2q/p^2pet/(1qet)pe^t/(1-qe^t)Yes
NB(r,p)\text{NB}(r,p)(k1r1)prqkr\binom{k-1}{r-1}p^r q^{k-r}r/pr/prq/p2rq/p^2(pet/(1qet))r(pe^t/(1-qe^t))^rYes
Pois(λ)\text{Pois}(\lambda)eλλk/k!e^{-\lambda}\lambda^k/k!λ\lambdaλ\lambdaeλ(et1)e^{\lambda(e^t-1)}Yes
HGeom(N,K,n)\text{HGeom}(N,K,n)(Kk)(NKnk)(Nn)\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}nK/NnK/NnpqFPCnpq \cdot \text{FPC}(no closed form)No
DUnif(a,b)\text{DUnif}(a,b)1/n1/n(a+b)/2(a+b)/2(n21)/12(n^2-1)/12eta(1etn)n(1et)\frac{e^{ta}(1-e^{tn})}{n(1-e^t)}No

Key structural insights:

  • The Bernoulli is the atom from which Binomial and Geometric are built
  • The Poisson arises as the limit of the Binomial for rare events
  • The Hypergeometric reduces to the Binomial when the population is large
  • Five of seven distributions belong to the exponential family — the unifying framework of Exponential Families
  • PGFs provide a dedicated tool for discrete distributions, with the compound distribution formula as its signature application

What comes next. This topic cataloged the discrete distributions. The parallel treatment continues:

  • Continuous Distributions applies the same systematic approach to Normal, Exponential, Gamma, Beta, and Uniform — the continuous counterparts
  • Exponential Families unifies the five exponential family members here with their continuous counterparts, identifying natural parameters, sufficient statistics, and log-partition functions
  • Modes of Convergence makes the Poisson limit theorem rigorous using convergence in distribution, and previews the weak law of large numbers via Chebyshev

References

  1. Billingsley, P. (2012). Probability and Measure (Anniversary ed.). Wiley.
  2. Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press.
  3. Grimmett, G. & Stirzaker, D. (2020). Probability and Random Processes (4th ed.). Oxford University Press.
  4. Wasserman, L. (2004). All of Statistics. Springer.
  5. Casella, G. & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  7. McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall/CRC.
  8. Hilbe, J. M. (2011). Negative Binomial Regression (2nd ed.). Cambridge University Press.