Bernoulli, Binomial, Geometric, Negative Binomial, Poisson, Hypergeometric, and Discrete Uniform — the named PMFs that appear everywhere in statistical modeling and machine learning, each derived from a distinct probabilistic mechanism.
In Topic 3, we built the general framework: random variables as measurable functions, PMFs as probability assignments over countable supports, CDFs as cumulative summaries, and measurability as the bridge from probability spaces to number lines. In Topic 4, we built the tools: expectation as center of mass, variance as spread, covariance as linear association, and moment-generating functions as the engine for proving distributional results.
Those topics answered the question what is a distribution? This topic answers a different question: which distributions matter, and why?
The answer is seven specific PMFs, each arising from a distinct probabilistic mechanism. The Bernoulli models a single yes/no trial. The Binomial counts successes across independent trials. The Geometric waits for the first success. The Negative Binomial waits for the r-th. The Poisson counts rare events at a constant rate. The Hypergeometric counts successes when sampling without replacement. And the Discrete Uniform assigns equal probability to every outcome in a finite set.
Remark A catalog, not a narrative
Topics 1–4 followed a linear arc — each concept building toward the next. This topic is structurally different: it’s a parallel catalog. Each distribution gets the same systematic treatment (PMF → E[X] proof → Var(X) proof → MGF → key properties → ML connection), and the value comes from seeing how the same tools from Topics 3–4 produce different results when applied to different mechanisms. The repetition is the point — it’s how you internalize the tools.
The table below summarizes what we’re about to build. For each distribution, the “mechanism” column describes the experiment that produces it, the “parameters” column identifies what you need to specify, and the “support” column lists which values the random variable can take.
Distribution
Mechanism
Parameters
Support
Bernoulli
Single binary trial
p∈(0,1)
{0,1}
Binomial
n independent trials, count successes
n∈N, p∈(0,1)
{0,1,…,n}
Geometric
Trials until first success
p∈(0,1)
{1,2,3,…}
Negative Binomial
Trials until r-th success
r∈N, p∈(0,1)
{r,r+1,r+2,…}
Poisson
Count of rare events at rate λ
λ>0
{0,1,2,…}
Hypergeometric
Sample without replacement
N,K,n∈N
{max(0,n−N+K),…,min(n,K)}
Discrete Uniform
Equal probability on finite set
a,b∈Z
{a,a+1,…,b}
Five of these seven distributions — Bernoulli, Binomial, Geometric, Negative Binomial, and Poisson — belong to the exponential family. We’ll flag this for each one as we go, and Exponential Families unifies the pattern.
The simplest possible random experiment: a single trial with two outcomes. Flip a coin, check if a user clicks, test whether a component passes quality control. The outcome is 1 (success) with probability p and 0 (failure) with probability 1−p.
Definition 1 Bernoulli Distribution
A random variable X has the Bernoulli distribution with parameter p∈(0,1), written X∼Bernoulli(p), if its PMF is
pX(k)=pk(1−p)1−k,k∈{0,1}
Equivalently: P(X=1)=p and P(X=0)=1−p=q.
The PMF is trivially a valid probability mass function: it assigns non-negative values to exactly two outcomes, and p+(1−p)=1. Throughout this topic, we write q=1−p for brevity.
Theorem 1 Bernoulli Moments and MGF
If X∼Bernoulli(p), then:
E[X]=p
Var(X)=p(1−p)=pq
MX(t)=q+pet
Proof
[show][hide]
Expectation. Since X takes only two values:
E[X]=0⋅P(X=0)+1⋅P(X=1)=0⋅q+1⋅p=p
Variance. First compute the second moment. Since X∈{0,1}, we have X2=X, so E[X2]=E[X]=p. By the variance shortcut from Topic 4:
Var(X)=E[X2]−(E[X])2=p−p2=p(1−p)
MGF. By definition of the moment-generating function:
MX(t)=E[etX]=et⋅0⋅q+et⋅1⋅p=q+pet
This is defined for all t∈R. □
◼
Remark Bernoulli exponential family form
We can rewrite the Bernoulli PMF as
pX(k)=(1−p)exp(kln1−pp)
This is the exponential family form with natural parameterη=ln(p/(1−p)) — the logit of p. The inverse map p=1/(1+e−η) is the sigmoid function. This connection is why logistic regression uses the Bernoulli distribution: the model is Y∣X∼Bernoulli(σ(βTX)), where σ is the sigmoid (see formalML: Logistic Regression ).
Example 1 Bernoulli as logistic regression foundation
In binary classification, we model the response as Y∣X∼Bernoulli(p(X)) where p(X)=σ(βTX). The negative log-likelihood of a single observation (xi,yi) is:
−lnp(yi∣xi)=−[yilnp^i+(1−yi)ln(1−p^i)]
This is the cross-entropy loss — the loss function you minimize in logistic regression is literally the negative Bernoulli log-likelihood. Every gradient descent step in logistic regression is doing maximum likelihood estimation for Bernoulli parameters.
5.3 The Binomial Distribution
What happens when we repeat a Bernoulli trial n independent times and count the total number of successes? The answer is the Binomial distribution — the most natural generalization of Bernoulli.
Definition 2 Binomial Distribution
A random variable X has the Binomial distribution with parameters n∈N and p∈(0,1), written X∼Binomial(n,p), if its PMF is
pX(k)=(kn)pk(1−p)n−k,k∈{0,1,…,n}
where (kn)=k!(n−k)!n! is the binomial coefficient.
The PMF counts exactly the probability of getting k specific successes (each contributing p) and n−k specific failures (each contributing q=1−p), multiplied by (kn) because we don’t care which trials are the successes — only how many. The binomial theorem confirms the PMF sums to 1: ∑k=0n(kn)pkqn−k=(p+q)n=1.
Theorem 2 Binomial as Sum of Bernoullis
If X1,X2,…,Xn∼iidBernoulli(p) and X=X1+X2+⋯+Xn, then X∼Binomial(n,p).
Proof
[show][hide]
We need P(X=k)=(kn)pkqn−k. The event {X=k} means exactly k of the n independent Bernoulli trials are 1 and n−k are 0. There are (kn) ways to choose which k trials succeed. For any specific such pattern, the probability is pkqn−k by independence. Summing over all (kn) patterns gives P(X=k)=(kn)pkqn−k. □
◼
This representation is not just a derivation trick — it’s the key to computing moments.
Theorem 3 Binomial Expectation
If X∼Binomial(n,p), then E[X]=np.
Proof
[show][hide]
Via linearity (the elegant proof). Write X=X1+⋯+Xn where each Xi∼Bernoulli(p). By linearity of expectation — which does not require independence:
E[X]=E[X1]+E[X2]+⋯+E[Xn]=np
Direct PMF proof. We can also verify directly from the PMF. The k=0 term vanishes, so:
E[X]=∑k=1nk(kn)pkqn−k
Using the identity k(kn)=n(k−1n−1) and substituting j=k−1:
This is the MGF of Binomial(n1+n2,p), so by uniqueness, X+Y∼Binomial(n1+n2,p). □
◼
Example 2 A/B testing: confidence interval for conversion rate
An e-commerce site runs an A/B test: 1200 users see design A, with 84 conversions. The sample proportion p^=84/1200=0.07 estimates the Binomial parameter p. By the Central Limit Theorem, an approximate 95% confidence interval is
p^±1.96np^(1−p^)=0.07±0.014
so we’re 95% confident the true conversion rate is between 5.6% and 8.4%. The p^(1−p^)/n in the denominator is Var(p^)=pq/n — the Binomial variance formula, divided by n, is doing all the work.
5.4 The Geometric Distribution
The Binomial fixes n (number of trials) and counts successes. The Geometric flips the question: we keep running independent Bernoulli trials until the first success. How long do we wait?
Definition 3 Geometric Distribution
A random variable X has the Geometric distribution with parameter p∈(0,1), written X∼Geometric(p), if its PMF is
pX(k)=p(1−p)k−1,k∈{1,2,3,…}
Here X counts the number of trials until (and including) the first success.
Remark Two conventions for the Geometric
Some textbooks define X as the number of failures before the first success, giving P(X=k)=p(1−p)k for k=0,1,2,… with E[X]=(1−p)/p. We use the trials until first success convention (support starts at 1) because it parallels the Negative Binomial more naturally: NegBin counts trials until the r-th success, and Geometric is the r=1 case. Always check which convention a text uses — the formulas for E[X] and Var(X) differ.
The geometric series converges when ∣qet∣<1, i.e., t<ln(1/q)=−ln(1−p). □
◼
Now the most important structural property of the Geometric distribution — the one that makes it unique among discrete distributions.
Theorem 8 Memoryless Property of the Geometric Distribution
The Geometric distribution is the only discrete distribution with the memoryless property: for all s,t∈{0,1,2,…},
P(X>s+t∣X>s)=P(X>t)
That is, given that we’ve already waited s trials without success, the remaining waiting time has the same distribution as if we started fresh.
Proof
[show][hide]
The Geometric is memoryless. The survival function is P(X>n)=qn (probability of n consecutive failures). Then:
P(X>s+t∣X>s)=P(X>s)P(X>s+t)=qsqs+t=qt=P(X>t)
Uniqueness. Suppose X takes values in {1,2,…} and satisfies P(X>s+t∣X>s)=P(X>t) for all s,t≥0. Let g(n)=P(X>n), so g(0)=1. The memoryless property gives g(s+t)=g(s)g(t). The only function satisfying this Cauchy functional equation on the non-negative integers with g(0)=1 and 0<g(1)<1 is g(n)=g(1)n. Setting q=g(1) gives P(X>n)=qn, which is exactly the Geometric survival function with p=1−q. □
◼
The memoryless property means that Bernoulli trials have no “momentum” — the fact that you’ve failed 100 times doesn’t make the 101st trial any more likely to succeed. This is the mathematical content of “the coin has no memory.”
Interactive: Memoryless Property Explorer
Full PMF + Conditional (given X > 5)
Re-indexed: same shape as original
Verification: P(X > s+t | X > s) = P(X > t)
t
P(X > 5+t | X > 5)
P(X > t)
Match?
1
0.750000
0.750000
=
2
0.562500
0.562500
=
3
0.421875
0.421875
=
4
0.316406
0.316406
=
5
0.237305
0.237305
=
The past doesn't help predict the future — the trials are independent.
Example 3 Expected trials until first click
A display ad has a click-through rate of p=0.02. The number of impressions until the first click follows X∼Geometric(0.02). Expected impressions: E[X]=1/0.02=50. Standard deviation: σ=q/p2=0.98/0.0004≈49.5.
The memoryless property is practically important here: if an ad has been shown 100 times without a click, the expected number of additional impressions until first click is still 50 — the past doesn’t help predict the future, because each impression is an independent Bernoulli trial.
5.5 The Negative Binomial Distribution
The Geometric waits for the first success. The Negative Binomial generalizes: wait for the r-th success.
Definition 4 Negative Binomial Distribution
A random variable X has the Negative Binomial distribution with parameters r∈N and p∈(0,1), written X∼NegBin(r,p), if its PMF is
pX(k)=(r−1k−1)pr(1−p)k−r,k∈{r,r+1,r+2,…}
Here X counts the total number of trials until (and including) the r-th success.
The PMF logic: on trial k, the r-th success occurs. That means exactly r−1 successes in the first k−1 trials (there are (r−1k−1) ways to arrange these), each contributing pr−1qk−r, and then a final success on trial k contributing p. Total: (r−1k−1)prqk−r.
Note: NegBin(1,p)=Geometric(p). Setting r=1 gives (0k−1)pqk−1=pqk−1, which is exactly the Geometric PMF.
Theorem 9 Negative Binomial as Sum of Geometrics
If Y1,Y2,…,Yr∼iidGeometric(p) and X=Y1+Y2+⋯+Yr, then X∼NegBin(r,p).
Proof
[show][hide]
Think of Yi as the number of trials from the (i−1)-th success to the i-th success. By the memoryless property of the Geometric, after each success the process “resets” — the remaining waiting time is independent of the past. So Y1,…,Yr are independent, and X=Y1+⋯+Yr is the total number of trials until the r-th success.
To verify the PMF, we can use MGFs. The MGF of Yi is pet/(1−qet), so by independence:
MX(t)=(1−qetpet)r
This uniquely determines the NegBin(r,p) distribution. □
◼
Theorem 10 Negative Binomial Moments and MGF
If X∼NegBin(r,p), then:
E[X]=r/p
Var(X)=rq/p2
MX(t)=(1−qetpet)r, defined for t<−ln(1−p)
Proof
[show][hide]
Write X=Y1+⋯+Yr with Yi∼iidGeometric(p). By linearity:
E[X]=r⋅E[Y1]=r⋅p1=pr
By independence:
Var(X)=r⋅Var(Y1)=r⋅p2q=p2rq
The MGF follows from the product of r independent Geometric MGFs, as shown in Theorem 9. □
◼
A key feature of the Negative Binomial is that Var(X)=rq/p2=E[X]⋅q/p=E[X]⋅(E[X]/r), which means the variance is always larger than the mean when q>0. This overdispersion (variance exceeding the mean) makes the Negative Binomial the go-to alternative when Poisson is too restrictive.
Example 4 RNA-seq count modeling with overdispersion
In RNA-seq experiments, gene expression is measured as read counts. If counts followed a Poisson model, Var(Y)=E[Y]. But biological replicates consistently show Var(Y)>E[Y] — overdispersion due to biological variability between samples. Tools like DESeq2 model counts as Y∼NegBin(μ,r), where the dispersion parameter r captures the excess variability. As r→∞, NegBin→Poisson, so the Poisson model is a special case — and a testable hypothesis.
5.6 The Poisson Distribution
The Poisson distribution arises in a completely different way from the Bernoulli family. Instead of counting successes in a fixed number of trials, it counts events occurring at a constant average rate in a continuous interval — photons hitting a detector, customers arriving at a store, typos per page, mutations per genome.
Definition 5 Poisson Distribution
A random variable X has the Poisson distribution with parameter λ>0, written X∼Poisson(λ), if its PMF is
pX(k)=k!e−λλk,k∈{0,1,2,…}
The parameter λ is both the mean and the variance (equidispersion).
The PMF sums to 1 because ∑k=0∞λk/k!=eλ (the exponential power series from formalCalculus: Taylor Series ), so ∑k=0∞e−λλk/k!=e−λ⋅eλ=1.
But where does this PMF come from? The answer is the Poisson limit theorem — arguably the most beautiful limit in discrete probability.
Theorem 11 Poisson Limit Theorem
If Xn∼Binomial(n,λ/n) with λ>0 fixed, then for each k∈{0,1,2,…}:
limn→∞P(Xn=k)=k!e−λλk
That is, Binomial(n,λ/n)n→∞Poisson(λ).
Proof
[show][hide]
Fix k and compute P(Xn=k) with p=λ/n:
P(Xn=k)=(kn)(nλ)k(1−nλ)n−k
Rewrite the binomial coefficient:
=k!λk⋅nkn(n−1)⋯(n−k+1)⋅(1−nλ)n⋅(1−nλ)−k
Now take n→∞ with k and λ fixed. The three factors converge:
A website receives an average of λ=12 clicks per minute. Model the count in any minute as X∼Poisson(12). Then P(X=0)=e−12≈6×10−6 — essentially no chance of a minute with zero clicks. The probability of an unusually high burst: P(X≥20) can be computed by summing the PMF tail.
By the reproductive property, the count in a 5-minute window follows Poisson(60). The equidispersion property (Var=mean) gives σ=60≈7.7, so a count of 75 or more (≈2σ above the mean) would be suspicious — a signal worth investigating for bot traffic.
5.7 The Hypergeometric Distribution
Every distribution so far assumes either independent trials (Bernoulli, Binomial, Geometric, NegBin) or a rate process (Poisson). The Hypergeometric breaks the independence assumption: it models sampling without replacement from a finite population.
Definition 6 Hypergeometric Distribution
A random variable X has the Hypergeometric distribution with parameters N (population size), K (number of success states), and n (number of draws), written X∼Hypergeometric(N,K,n), if its PMF is
The PMF counts combinatorially: choose k successes from K available ((kK) ways), choose n−k failures from N−K available ((n−kN−K) ways), divide by total ways to choose n items from N ((nN)). The support bounds ensure non-negative binomial coefficients.
Theorem 15 Hypergeometric Expectation
If X∼Hypergeometric(N,K,n), then E[X]=nK/N.
Proof
[show][hide]
Define indicator variables Xi=1 if the i-th draw is a success. Then X=X1+⋯+Xn. By symmetry, P(Xi=1)=K/N for every i — each draw is equally likely to pick any of the N items, regardless of order. By linearity:
E[X]=∑i=1nE[Xi]=n⋅NK
Note this is the same as Binomial(n,K/N) — the mean doesn’t know whether we’re sampling with or without replacement. □
◼
Theorem 16 Hypergeometric Variance and Finite Population Correction
If X∼Hypergeometric(N,K,n), then
Var(X)=n⋅NK⋅NN−K⋅N−1N−n
The factor N−1N−n is the finite population correction (FPC).
Proof
[show][hide]
Using the indicator decomposition X=X1+⋯+Xn:
Var(X)=∑i=1nVar(Xi)+2∑i<jCov(Xi,Xj)
Each Xi is Bernoulli with p=K/N, so Var(Xi)=NK(1−NK).
For the covariance, P(Xi=1,Xj=1)=NK⋅N−1K−1, so:
Combining n variance terms and (2n) covariance terms yields the result. The FPC factor (N−n)/(N−1) reduces the variance compared to Binomial: sampling without replacement reduces uncertainty because each draw provides more information. □
◼
Remark The 5% rule for finite population correction
When n/N<0.05 (sampling less than 5% of the population), the FPC is (N−n)/(N−1)>0.95, and the Binomial approximation Var(X)≈npq is within 5% of the truth. This is why opinion polls of 1000 people can represent millions: n/N is tiny, so whether we sample with or without replacement barely matters.
As N→∞, using Stirling-type asymptotics on the factorials, the ratio of falling factorials converges:
N(N−1)⋯(N−k+1)(Np)(Np−1)⋯(Np−k+1)→pk
and similarly for the failure terms, giving (kn)pk(1−p)n−k in the limit. □
◼
Interactive: With vs. Without Replacement
Binomial(n=10, p=0.40) — with replacement
Hypergeometric(N=50, K=20, n=10) — without replacement
E[X]
Var(X)
Binomial
4.0000
2.4000
Hypergeometric
4.0000
1.9592
n/N = 0.2000FPC = (N−n)/(N−1) = 0.8163Var ratio = 0.8163
FPC matters — sampling 20.0% of the population
Example 6 Fisher's exact test for drug trials
A clinical trial tests whether a drug reduces infection. Out of N=20 patients, K=8 recover. If the drug were irrelevant, the n=10 treated patients’ recoveries would follow X∼Hypergeometric(20,8,10). If 7 of the 10 treated patients recovered, the p-value is P(X≥7) under this null — computed exactly from the Hypergeometric PMF, with no normal approximation needed. This is Fisher’s exact test: the gold standard for small-sample 2×2 contingency tables, and a standard tool in gene set enrichment analysis (GSEA).
5.8 The Discrete Uniform Distribution
The simplest discrete distribution: every outcome in a finite set is equally likely.
Definition 7 Discrete Uniform Distribution
A random variable X has the Discrete Uniform distribution on {a,a+1,…,b}, written X∼DiscreteUniform(a,b), if its PMF is
pX(k)=b−a+11=n1,k∈{a,a+1,…,b}
where n=b−a+1 is the number of outcomes.
Theorem 18 Discrete Uniform Moments
If X∼DiscreteUniform(a,b) with n=b−a+1, then:
E[X]=2a+b (the midpoint)
Var(X)=12n2−1
Proof
[show][hide]
Expectation. By symmetry of the uniform distribution around its midpoint:
E[X]=n1∑k=abk=n1⋅2n(a+b)=2a+b
using the arithmetic series formula.
Variance. Shift to Y=X−a∼DiscreteUniform(0,n−1) so E[Y]=(n−1)/2. Then:
Since Var(X)=Var(Y) (variance is shift-invariant), the result holds for any a,b. For a fair die (a=1,b=6,n=6): E[X]=3.5, Var(X)=35/12≈2.917. □
◼
Remark Maximum entropy and non-informative priors
Among all distributions on a finite set {a,…,b}, the Discrete Uniform maximizes entropy: H(X)=logn. Entropy H(X)=−∑p(k)logp(k) measures uncertainty, and the uniform distribution is the one that assumes the least — no value is favored over any other. This makes it the canonical non-informative prior in Bayesian statistics when all you know is the support. It’s also the foundation for random initialization in ML: shuffling training data, random feature selection in bagging, and hash-based tricks all use discrete uniform randomness.
5.9 Probability-Generating Functions
Topic 4 introduced the moment-generating function MX(t)=E[etX] as a tool for packaging moments and proving distributional results. For discrete random variables taking non-negative integer values, there’s an even more natural generating function: the probability-generating function.
Definition 8 Probability-Generating Function
Let X be a non-negative integer-valued random variable with PMF pX. The probability-generating function (PGF) of X is
GX(s)=E[sX]=∑k=0∞pX(k)sk,∣s∣≤1
The coefficients of the power series are the probabilities: pX(k)=GX(k)(0)/k!.
The PGF is related to the MGF by GX(s)=MX(lns) and MX(t)=GX(et). But the PGF has a key advantage for discrete variables: its coefficients are the probabilities. It’s a power series whose coefficient of sk is P(X=k).
Theorem 19 Moments from the PGF
If GX(s) is the PGF of X, then:
GX′(1)=E[X]
GX(r)(1)=E[X(X−1)⋯(X−r+1)] (the r-th factorial moment)
Var(X)=GX′′(1)+GX′(1)−[GX′(1)]2
Proof
[show][hide]
Differentiate GX(s)=∑kpX(k)sk term by term:
GX′(s)=∑k=1∞kpX(k)sk−1
Evaluating at s=1: GX′(1)=∑kkpX(k)=E[X].
For the second derivative:
GX′′(s)=∑k=2∞k(k−1)pX(k)sk−2
So GX′′(1)=E[X(X−1)]. Since Var(X)=E[X2]−(E[X])2=E[X(X−1)]+E[X]−(E[X])2:
Var(X)=GX′′(1)+GX′(1)−[GX′(1)]2
The general r-th derivative follows by induction. □
◼
Here are the PGFs for our seven distributions:
Distribution
PGF G(s)
Bernoulli(p)
q+ps
Binomial(n,p)
(q+ps)n
Geometric(p)
ps/(1−qs)
NegBin(r,p)
(ps/(1−qs))r
Poisson(λ)
eλ(s−1)
DiscreteUniform(0,n−1)
(1−sn)/(n(1−s))
Theorem 20 PGF Sum Property
If X and Y are independent non-negative integer-valued random variables, then
GX+Y(s)=GX(s)⋅GY(s)
Proof
[show][hide]
GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]⋅E[sY]=GX(s)⋅GY(s), where the third equality uses independence. □
◼
The real payoff of PGFs is the compound distribution formula — the reason PGFs exist as a separate tool rather than just being a notational variant of MGFs.
Theorem 21 Compound Distribution Formula
Let N be a non-negative integer-valued random variable, and let X1,X2,… be iid non-negative integer-valued random variables, independent of N. Define the random sum S=X1+X2+⋯+XN (with S=0 when N=0). Then
The last step recognizes ∑n[GX(s)]nP(N=n)=E[zN] evaluated at z=GX(s), which is GN(z)=GN(GX(s)). □
◼
Interactive: Probability-Generating Functions
G(s) = E[s^X]
Moments from PGF
Quantity
PGF
Closed-form
G'(1) = E[X]
0.4000
0.4000
G''(1) = E[X(X−1)]
0.0000
—
Var(X)
0.2400
0.2400
Var = G''(1) + G'(1) − [G'(1)]²
Example 7 Poisson thinning via compound distributions
A Poisson process produces N∼Poisson(λ) events per unit time. Each event is independently “kept” with probability p (and discarded with probability 1−p). The number of retained events is S=X1+⋯+XN where Xi∼Bernoulli(p).
This is the PGF of Poisson(λp)! Thinning a Poisson process by probability p gives another Poisson process with rate λp. This result is used in practice for network traffic modeling, rare event simulation, and the “thinning” algorithm for simulating non-homogeneous Poisson processes.
5.10 Relationships Between Distributions
The seven distributions are not isolated — they form a web of connections through special cases, sums, and limiting relationships. The diagram below captures these connections.
Special cases:
Bernoulli(p)=Binomial(1,p)
Geometric(p)=NegBin(1,p)
Sum relationships (independent, same p):
∑i=1nBernoulli(p)∼Binomial(n,p)
∑i=1rGeometric(p)∼NegBin(r,p)
Reproductive properties (independent, same p or λ):
Binomial(n,λ/n)→Poisson(λ) as n→∞ (Poisson limit theorem)
Hypergeometric(N,Np,n)→Binomial(n,p) as N→∞ (infinite population limit)
Remark Five of seven are exponential family members
Bernoulli, Binomial, Geometric, Negative Binomial, and Poisson all belong to the exponential family — distributions whose PMFs can be written as p(k∣θ)=h(k)exp(η(θ)⋅T(k)−A(θ)). The Hypergeometric does not belong because its support {max(0,n−N+K),…,min(n,K)} depends on the parameter N — violating the requirement that the support is fixed. The Discrete Uniform does not belong because its support {1,…,m} depends on the parameter m, also violating the fixed-support requirement. Exponential Families makes this precise and shows why exponential family membership matters for estimation, testing, and GLMs.
5.11 Connections to ML
Every distribution in this topic appears in machine learning — not as an abstract exercise, but as a modeling choice that shapes loss functions, estimators, and inference procedures.
Binary classification (Bernoulli): Logistic regression models Y∣X∼Bernoulli(σ(βTX)). Cross-entropy loss = negative Bernoulli log-likelihood. The logit link η=ln(p/(1−p)) comes directly from the exponential family natural parameter (see formalML: Logistic Regression ).
A/B testing (Binomial): The sample proportion p^=X/n is the MLE of the Binomial parameter. Confidence intervals use Var(p^)=pq/n. Power calculations for determining sample sizes reduce to Binomial tail probabilities.
Count regression (Poisson, NegBin): Poisson regression models Y∣X∼Poisson(exp(βTX)) via the log link. When data are overdispersed (Var(Y)>E[Y]), Negative Binomial regression provides the fix. The variance function Var(Y)=μ+μ2/r interpolates between Poisson (r→∞) and pure quadratic variance (r finite) — see Topic 22 §22.5 (Poisson regression) for the worked Poisson treatment and §22.9 Rem 20 for the Negative Binomial overdispersion fix.
Sampling and exact tests (Hypergeometric): Fisher’s exact test uses the Hypergeometric to compute exact p-values for 2×2 tables. Gene set enrichment analysis (GSEA) asks whether a gene set is overrepresented in a ranked list — a Hypergeometric calculation.
Random initialization (Discrete Uniform): Shuffling training data, selecting features in random forests, hash-based tricks for dimensionality reduction — all use discrete uniform randomness as a foundation (see formalML: Naive Bayes ).
Example 8 Distribution choice flowchart for count data
When modeling count data in ML, the choice of distribution matters:
Binary outcome (0 or 1)? → Bernoulli / Logistic regression
Fixed number of trials (n known), independent, same p? → Binomial
Event counts in a continuous interval, no upper bound? → Poisson (check equidispersion: if Var≫Mean, use Negative Binomial)
Waiting time until the r-th success? → Negative Binomial (Geometric if r=1)
Sampling without replacement from a finite population? → Hypergeometric
No prior information, finite outcomes? → Discrete Uniform
The choice determines the loss function, the link function, and the variance structure of your model. Getting it wrong — using Poisson when data are overdispersed, for instance — leads to overconfident inference and unreliable p-values.
Summary
We’ve cataloged seven discrete distributions, each arising from a distinct probabilistic mechanism. Here’s the reference table:
Distribution
PMF
E[X]
Var(X)
MGF MX(t)
Exp. Family?
Bern(p)
pkq1−k
p
pq
q+pet
Yes
Binom(n,p)
(kn)pkqn−k
np
npq
(q+pet)n
Yes
Geom(p)
pqk−1
1/p
q/p2
pet/(1−qet)
Yes
NB(r,p)
(r−1k−1)prqk−r
r/p
rq/p2
(pet/(1−qet))r
Yes
Pois(λ)
e−λλk/k!
λ
λ
eλ(et−1)
Yes
HGeom(N,K,n)
(nN)(kK)(n−kN−K)
nK/N
npq⋅FPC
(no closed form)
No
DUnif(a,b)
1/n
(a+b)/2
(n2−1)/12
n(1−et)eta(1−etn)
No
Key structural insights:
The Bernoulli is the atom from which Binomial and Geometric are built
The Poisson arises as the limit of the Binomial for rare events
The Hypergeometric reduces to the Binomial when the population is large
Five of seven distributions belong to the exponential family — the unifying framework of Exponential Families
PGFs provide a dedicated tool for discrete distributions, with the compound distribution formula as its signature application
What comes next. This topic cataloged the discrete distributions. The parallel treatment continues:
Continuous Distributions applies the same systematic approach to Normal, Exponential, Gamma, Beta, and Uniform — the continuous counterparts
Exponential Families unifies the five exponential family members here with their continuous counterparts, identifying natural parameters, sufficient statistics, and log-partition functions
Modes of Convergence makes the Poisson limit theorem rigorous using convergence in distribution, and previews the weak law of large numbers via Chebyshev