intermediate 60 min read · April 21, 2026

Bayesian Model Comparison & BMA

Topic 25 introduced the marginal likelihood m(y) as the normalizer of the posterior; Topic 26 gave us MCMC machinery that samples the posterior without ever needing m(y). Topic 27 asks the opposite question: when we do need m(y) — for Bayes factors, posterior model probabilities, Bayesian model averaging — how do we estimate it from MCMC draws that only know the posterior's shape? Four estimators (Laplace, bridge sampling, path sampling, nested sampling), three predictive scores (DIC, WAIC, PSIS-LOO), the Lindley paradox (featured), the Occam-factor ⇔ BIC equivalence that closes Topic 24 §24.4, local FDR as the Bayesian counterpart to Topic 20's BH procedure, and posterior-predictive checks — all built on Topic 26's draws.

27.1 The hidden object: marginal likelihood as Bayesian normalizer

Every Bayesian computation in Topics 25 and 26 ran through the same unnormalized identity:

p(θy)    L(θ)π(θ).p(\theta \mid \mathbf{y}) \;\propto\; L(\theta)\, \pi(\theta).

The proportionality constant — the integral

m(y)  =  ΘL(θ)π(θ)dθm(\mathbf{y}) \;=\; \int_\Theta L(\theta)\, \pi(\theta)\, d\theta

— didn’t matter for posterior inference. Metropolis–Hastings cancels m(y)m(\mathbf{y}) in the acceptance ratio. Gibbs samplers sample from full conditionals that depend only on the unnormalized joint density. HMC follows gradients of logL+logπ\log L + \log \pi and never touches the normalizer. Topic 26’s machinery delivers posterior expectations, credible intervals, and predictive draws — all without ever computing m(y)m(\mathbf{y}).

Topic 27 is about what happens when you need m(y)m(\mathbf{y}) anyway. Three tasks force the issue:

  1. Bayes factors. Comparing two models M0\mathcal{M}_0 and M1\mathcal{M}_1 via the posterior odds ratio requires the ratio m0(y)/m1(y)m_0(\mathbf{y}) / m_1(\mathbf{y}). The individual normalizers don’t cancel — they’re the whole game.
  2. Bayesian model averaging. Predictions weighted by posterior model probability require p(Mky)p(Mk)mk(y)p(\mathcal{M}_k \mid \mathbf{y}) \propto p(\mathcal{M}_k) m_k(\mathbf{y}). Again, the marginal likelihoods themselves are the computational object.
  3. Evidence-level hypothesis testing. “How much does this dataset support M1\mathcal{M}_1 over M0\mathcal{M}_0?” — a question Neyman–Pearson answers with a decision rule, and Bayesian analysis answers with a number derived from m(y)m(\mathbf{y}).
Figure 1. The marginal likelihood as a partition: left panel shows the unnormalized posterior L(θ)π(θ) as a curve over Θ, with the shaded area equal to m(y). Right panel shows the normalized posterior p(θ | y). The marginal likelihood is the constant that converts unnormalized to normalized — the Bayesian analog of the partition function Z in statistical physics.

The figure above captures the central idea: m(y)m(\mathbf{y}) is the partition function — the integral over parameter space that turns the unnormalized joint L(θ)π(θ)L(\theta)\pi(\theta) into a probability density. Physicists face the same object in the Gibbs distribution p(x)eβH(x)p(x) \propto e^{-\beta H(x)}; Bayesians face it in the posterior. The computational challenge is identical — integrate an exponential-family-ish object over a high-dimensional space — and the techniques developed in physics (bridge sampling originates in free-energy estimation) transfer directly.

Remark 1 Marginal likelihood vs posterior predictive — don't confuse them

Topic 25 §25.5 introduced two distinct quantities that both begin with the word “marginal”:

  • Marginal likelihood m(y)=L(θ)π(θ)dθm(\mathbf{y}) = \int L(\theta)\pi(\theta)\,d\theta: integrates the data likelihood against the prior. A single scalar summarizing the fit of the model.
  • Posterior predictive p(y~y)=p(y~θ)p(θy)dθp(\tilde y \mid \mathbf{y}) = \int p(\tilde y \mid \theta) p(\theta \mid \mathbf{y})\,d\theta: integrates a new-data likelihood against the posterior. A full density over new observations.

The marginal likelihood is the evidence supporting the model; the posterior predictive is the model’s prediction under that evidence. §27.2 formalizes the first, §27.7 develops the second.

Remark 2 Why this is harder than posterior sampling

MCMC works because it cancels m(y)m(\mathbf{y}). That cancellation is exactly what makes m(y)m(\mathbf{y}) hard to estimate from MCMC draws — the draws carry information about the posterior’s shape but not its scale. Every technique in §27.6 is a way to recover the lost scale: Laplace uses the analytic Gaussian-integral formula, bridge and path sampling tie posterior draws to a calibrated proposal, nested sampling transforms the integral into a 1-D quadrature over prior-mass levels. Each has failure modes; the one-size-fits-all estimator does not exist.

27.2 Models and marginal likelihood

Formalize the setting. Let M={M1,,MM}\mathcal{M} = \{\mathcal{M}_1, \ldots, \mathcal{M}_M\} be a candidate set of models — each Mk\mathcal{M}_k specifies a parameter space Θk\Theta_k, a likelihood Lk(θk)=p(yθk,Mk)L_k(\theta_k) = p(\mathbf{y} \mid \theta_k, \mathcal{M}_k), and a prior πk(θkMk)\pi_k(\theta_k \mid \mathcal{M}_k). Each model produces its own marginal likelihood:

mk(y)  =  ΘkLk(θk)πk(θkMk)dθk.m_k(\mathbf{y}) \;=\; \int_{\Theta_k} L_k(\theta_k)\, \pi_k(\theta_k \mid \mathcal{M}_k)\, d\theta_k.

A prior over models p(Mk)p(\mathcal{M}_k) (typically uniform, 1/M1/M) turns the collection into a full joint model. Bayes’ theorem then applies at the model level.

Theorem 1 Posterior model probabilities

Given a candidate set M\mathcal{M}, prior model probabilities p(Mk)k=1M\\{p(\mathcal{M}_k)\\}_{k=1}^M, and marginal likelihoods mk(y)k=1M\\{m_k(\mathbf{y})\\}_{k=1}^M, the posterior probability of model kk is

p(Mky)  =  p(Mk)mk(y)j=1Mp(Mj)mj(y).p(\mathcal{M}_k \mid \mathbf{y}) \;=\; \frac{p(\mathcal{M}_k)\, m_k(\mathbf{y})}{\sum_{j=1}^M p(\mathcal{M}_j)\, m_j(\mathbf{y})}.

The denominator is the total evidence p(y)=jp(Mj)mj(y)p(\mathbf{y}) = \sum_j p(\mathcal{M}_j) m_j(\mathbf{y}) — a number independent of kk, acting purely as a normalizer.

This is the standard Bayes-theorem application at the model level: the likelihood of model kk having generated the data is mk(y)m_k(\mathbf{y}) (by definition of the marginal likelihood); the prior is p(Mk)p(\mathcal{M}_k); the posterior is their normalized product. The function posteriorModelProbabilities in bayes.ts implements this softmax-like normalization in log space for numerical stability — a precaution that matters when logmk\log m_k values differ by tens or hundreds of units across a candidate set.

Example 1 Beta-Binomial m(y) in closed form

Consider yθBinomial(n,θ)y \mid \theta \sim \text{Binomial}(n, \theta) with θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta). The marginal likelihood is

m(y)  =  01(ny)θy(1θ)nyθα1(1θ)β1B(α,β)dθ.m(y) \;=\; \int_0^1 \binom{n}{y}\, \theta^y (1-\theta)^{n-y} \cdot \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha, \beta)}\, d\theta.

The integrand is proportional to Beta(α+y,β+ny)\text{Beta}(\alpha + y, \beta + n - y), so the integral evaluates to

m(y)  =  (ny)B(α+y,β+ny)B(α,β).m(y) \;=\; \binom{n}{y} \cdot \frac{B(\alpha + y, \beta + n - y)}{B(\alpha, \beta)}.

Taking logs: logm(y)=log(ny)+logB(α+y,β+ny)logB(α,β)\log m(y) = \log \binom{n}{y} + \log B(\alpha + y, \beta + n - y) - \log B(\alpha, \beta). For (n,y,α,β)=(20,12,1,1)(n, y, \alpha, \beta) = (20, 12, 1, 1) the answer is log(1/21)3.04452\log(1/21) \approx -3.04452 — the closed-form reference we use in §27.6 to verify the bridge / path / Laplace estimators (brief §6.2 T27.4). The function betaBinomialLogMarginal in bayes.ts computes this with stable lnGamma arithmetic.

Remark 3 M-closed, M-open, M-complete

Bernardo–Smith (1994) classify model-comparison settings by the status of truth. M-closed: the true data-generating process is in M\mathcal{M}. M-open: the truth is outside M\mathcal{M}; we’re choosing the best approximating model. M-complete: truth lies in a (possibly unenumerated) extension of M\mathcal{M}, but we treat M\mathcal{M} as a working set. BMA is principled in M-closed and M-complete (it averages over the posterior within M\mathcal{M}); for M-open problems, stacking (formalml) or predictive-weight optimization is often preferred because BMA weights concentrate on a single “best” model that may still be far from truth. Most applied settings are M-open; the distinction matters when interpreting BMA weights.

Remark 4 Marginal likelihood is not the likelihood

A common pitfall: calling m(y)m(\mathbf{y}) the “likelihood of the model.” It is not. The likelihood of the model at θ\theta is L(θ)=p(yθ)L(\theta) = p(\mathbf{y} \mid \theta) — a function of θ\theta. The marginal likelihood m(y)m(\mathbf{y}) is a scalar — the likelihood LL averaged against the prior. Two models with identical MLE fits (same peak of LL) can have very different mm if one has a concentrated prior that matches the data and the other has a diffuse prior that allocates probability mass to regions where LL is low.

Remark 5 Model-set sensitivity

p(Mky)p(\mathcal{M}_k \mid \mathbf{y}) depends on the entire set M\mathcal{M} through the denominator. Adding or removing candidate models — even ones with negligible weight — can shift the reported posterior probabilities. This is not a bug; it’s a statement about what Bayesian model comparison actually asks. The question “which model explains the data best?” is inherently relative: best compared to what? A model with p(M1y)=0.95p(\mathcal{M}_1 \mid \mathbf{y}) = 0.95 in a pair and p(M1y)=0.03p(\mathcal{M}_1 \mid \mathbf{y}) = 0.03 in an eight-model set is not contradictory — it answers different questions.

27.3 Notation

Topic 27 inherits Topic 25 §25.3 and Topic 26 §26.3 notation verbatim. The new symbols introduced here:

ObjectSymbolInterpretation
Candidate modelMk\mathcal{M}_kIndexed by k=1,,Mk = 1, \ldots, M over a candidate set M\mathcal{M}.
Marginal likelihoodmk(y)=m(yMk)m_k(\mathbf{y}) = m(\mathbf{y} \mid \mathcal{M}_k)Integral of LkπkL_k \pi_k over Θk\Theta_k.
Prior / posterior model probabilityp(Mk)p(\mathcal{M}_k) / p(Mky)p(\mathcal{M}_k \mid \mathbf{y})Prior weight / BMA weight wkBMAw_k^{\mathrm{BMA}}.
Bayes factorBFij=mi/mj\mathrm{BF}_{ij} = m_i / m_jData-driven update of Mi\mathcal{M}_i vs Mj\mathcal{M}_j prior odds.
Bridge functionh(θ)h(\theta)User-chosen function in the Meng–Wong identity (§27.6 Thm 5).
Normalizing constantZiZ_iGeneric — the integral qidθ\int q_i\,d\theta for an unnormalized qiq_i.
Inverse temperatureβ\betaPath-sampling parameter, 0β10 \le \beta \le 1.
Power posteriorpβ(θ)L(θ)βπ(θ)p_\beta(\theta) \propto L(\theta)^\beta \pi(\theta)Interpolates prior (β=0\beta=0) and posterior (β=1\beta=1).
Occam factorOccam(M)\mathrm{Occam}(\mathcal{M})Ratio of posterior volume to prior volume; asymptotic nk/2n^{-k/2}.
Effective parameter countpDICp_{\mathrm{DIC}} / pWAICp_{\mathrm{WAIC}}“Effective” parameters accounting for posterior concentration.
Expected log-predictive densityelpd\mathrm{elpd}iEp(θy)[logp(yiθ)]\sum_i \mathbb{E}_{p(\theta \mid \mathbf{y})}[\log p(y_i \mid \theta)]; the target of LOO / WAIC.
PSIS-LOO estimateelpd^loo\widehat{\mathrm{elpd}}_{\mathrm{loo}}Pareto-smoothed importance-sampling LOO.
Pareto diagnostick^\hat kPer-observation PSIS-reliability flag; k^>0.7\hat k > 0.7 is problematic.
Local false discovery ratefdr(z)\mathrm{fdr}(z)Two-groups posterior p(H0z)p(H_0 \mid z) (Efron 2010).
Posterior-predictive test statisticT(y,θ)T(\mathbf{y}, \theta)Discrepancy evaluated at observed or replicated data.
Bayesian p-valuepBp_BP(T(yrep,θ)T(y,θ)y)P(T(\mathbf{y}^{\text{rep}}, \theta) \geq T(\mathbf{y}, \theta) \mid \mathbf{y}).
Remark 6 Independence conventions

Independence and conditional independence continue to be written with the Topic 16 §16.9 convention  ⁣ ⁣ ⁣\perp\!\!\!\perp. The M-open / M-closed / M-complete trichotomy is used narratively (§27.2 Rem 3) without a dedicated symbol. When Mk\mathcal{M}_k and Mk\mathcal{M}_{k'} share parameters (nested models), we’ll say so explicitly — the shared parameter is integrated against the joint prior, not multiplied.

27.4 Bayes factors and Jeffreys’ scale

The Bayes factor is the ratio of marginal likelihoods:

BF10(y)  =  m1(y)m0(y)  =  p(yM1)p(yM0).\mathrm{BF}_{10}(\mathbf{y}) \;=\; \frac{m_1(\mathbf{y})}{m_0(\mathbf{y})} \;=\; \frac{p(\mathbf{y} \mid \mathcal{M}_1)}{p(\mathbf{y} \mid \mathcal{M}_0)}.

This is the data’s multiplicative update to the prior odds of M1\mathcal{M}_1 over M0\mathcal{M}_0:

Theorem 2 Bayes factor as posterior odds update

For any two models M0\mathcal{M}_0 and M1\mathcal{M}_1 with positive prior probabilities,

p(M1y)p(M0y)posterior odds  =  p(M1)p(M0)prior odds  ×  BF10(y)data’s update.\underbrace{\frac{p(\mathcal{M}_1 \mid \mathbf{y})}{p(\mathcal{M}_0 \mid \mathbf{y})}}_{\text{posterior odds}} \;=\; \underbrace{\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_0)}}_{\text{prior odds}} \;\times\; \underbrace{\mathrm{BF}_{10}(\mathbf{y})}_{\text{data's update}}.

The Bayes factor is thus prior-free in one sense — it’s the part of the comparison due to the data, independent of prior model probabilities — and prior-dependent in another: each mkm_k depends on the within-model prior πk\pi_k. Rem 7 below elaborates the second sense; Rem 12–14 follow the consequences.

The identity is immediate from p(Mky)p(Mk)mk(y)p(\mathcal{M}_k \mid \mathbf{y}) \propto p(\mathcal{M}_k) m_k(\mathbf{y}) (§27.2 Thm 1): take the ratio and the p(y)p(\mathbf{y}) normalizer cancels.

Jeffreys (1961, Appendix B) tabulated a heuristic scale for interpreting Bayes factors:

BF10\mathrm{BF}_{10} rangeStrength of evidence for M1\mathcal{M}_1
<1< 1Supports M0\mathcal{M}_0
11 to 3.23.2Barely worth mentioning
3.23.2 to 1010Substantial
1010 to 100100Strong
>100> 100Decisive
Figure 2. Jeffreys' scale for Bayes factor interpretation. Horizontal axis: BF_10 on log scale from 0.01 to 1000. Color bands mark 'supports H_0' (BF < 1), 'barely worth mentioning' (1 to 3.2), 'substantial' (3.2 to 10), 'strong' (10 to 100), 'decisive' (> 100). Numerical BF values from several hypothesis-testing scenarios are overlaid to show where typical applied analyses fall.
Example 2 One-sample Normal Bayes factor

Let y1,,ynθN(θ,1)y_1, \ldots, y_n \mid \theta \sim \mathcal{N}(\theta, 1) with H0:θ=0H_0: \theta = 0 and H1:θN(0,τ2)H_1: \theta \sim \mathcal{N}(0, \tau^2). Then yˉH0N(0,1/n)\bar y \mid H_0 \sim \mathcal{N}(0, 1/n) and yˉH1N(0,1/n+τ2)\bar y \mid H_1 \sim \mathcal{N}(0, 1/n + \tau^2) (by the convolution of likelihood and prior). The Bayes factor is

BF10  =  ϕ(yˉ;0,1/n+τ2)ϕ(yˉ;0,1/n)  =  11+nτ2exp ⁣(yˉ22[11/n11/n+τ2]).\mathrm{BF}_{10} \;=\; \frac{\phi(\bar y;\, 0,\, 1/n + \tau^2)}{\phi(\bar y;\, 0,\, 1/n)} \;=\; \frac{1}{\sqrt{1 + n\tau^2}}\, \exp\!\left(\frac{\bar y^2}{2}\left[\frac{1}{1/n} - \frac{1}{1/n + \tau^2}\right]\right).

This is the Lindley setup — the featured case of §27.5. For yˉ=0.3,n=100,τ=1\bar y = 0.3, n = 100, \tau = 1: z=yˉn=3z = \bar y \sqrt n = 3, r=nτ2=100r = n\tau^2 = 100, and BF108.57\mathrm{BF}_{10} \approx 8.57 (moderate evidence for M1\mathcal{M}_1; lindleyBayesFactor(3, 100, 1) in bayes.ts). Same zz with τ=100\tau = 100: BF100.09\mathrm{BF}_{10} \approx 0.09 (strong evidence for M0\mathcal{M}_0). The paradox lives between these two lines.

Remark 7 Bayes factors depend on within-model priors

The Bayes factor is prior-free over model probabilities but prior-dependent over parameter distributions. The πk\pi_k in mk(y)=Lk(θ)πk(θ)dθm_k(\mathbf{y}) = \int L_k(\theta) \pi_k(\theta)\,d\theta is a hard input: a diffuse πk\pi_k under M1\mathcal{M}_1 diluting probability mass over a large parameter volume yields a smaller m1m_1 than a concentrated πk\pi_k matching the data. This is not a nuisance — it’s the Occam factor (§27.6 Thm 4) acting as a model-complexity penalty. But it does mean you must think about your within-model priors when you use Bayes factors; vague “non-informative” priors are actively informative here.

Remark 8 Jeffreys' scale is a guide, not a law

The scale above is Jeffreys’ original proposal and remains common in applied work, but nothing in Bayesian probability requires you to use it — BF10=5\mathrm{BF}_{10} = 5 has a specific probabilistic meaning (multiplies the prior odds by 5) regardless of whether a convention calls it “substantial.” Treat the labels as rough communication tools; report the numerical BF\mathrm{BF} and let the reader apply their own threshold. Kass & Raftery 1995 §3.2 proposes a slightly coarser alternative scale (thresholds at 3, 20, 150); the distinction rarely matters in practice.

Remark 9 BF as the Bayesian counterpart to the LRT

Under H0:θ=θ0H_0: \theta = \theta_0 and H1:θθ0H_1: \theta \neq \theta_0 with prior π1\pi_1 on the alternative,

BF10  =  L(θ)L(θ0)π1(θ)dθ  =  Eπ1 ⁣[Λ(y;θ,θ0)],\mathrm{BF}_{10} \;=\; \int \frac{L(\theta)}{L(\theta_0)}\, \pi_1(\theta)\, d\theta \;=\; \mathbb{E}_{\pi_1}\!\left[\Lambda(\mathbf{y}; \theta, \theta_0)\right],

where Λ\Lambda is the Topic 18 likelihood ratio. The BF is thus the prior-averaged LRT — Neyman–Pearson picks a specific alternative, Bayes averages over the prior. This is why §17.10 Rem 15 and §18 Rem 25 point here: Bayes factors are the evidence-level analog of the LRT decision rule.

27.5 The Lindley paradox

Section §27.4’s Ex 2 already showed the paradox numerically: at z=3z = 3, changing the prior scale τ\tau from 1 to 100 flips the Bayes factor from 8.57\approx 8.57 (favoring M1\mathcal{M}_1) to 0.09\approx 0.09 (favoring M0\mathcal{M}_0) — even though the frequentist z-test flags z=3z = 3 as a rejection at p0.003p \approx 0.003 regardless of τ\tau. Lindley (1957) proved this is general.

Theorem 3 Lindley paradox (Lindley 1957)

Let Y1,,YniidN(θ,σ2)Y_1, \ldots, Y_n \overset{\mathrm{iid}}{\sim} \mathcal{N}(\theta, \sigma^2) with σ2\sigma^2 known. Consider H0:θ=0H_0: \theta = 0 against H1:θ0H_1: \theta \neq 0 with prior θH1N(0,τ2)\theta \mid H_1 \sim \mathcal{N}(0, \tau^2). Let z=Yˉn/σz = \bar Y \sqrt{n}/\sigma be the observed z-statistic. Then

BF10(z)  =  11+rexp ⁣(z22r1+r),r:=nτ2σ2.\mathrm{BF}_{10}(z) \;=\; \frac{1}{\sqrt{1 + r}}\, \exp\!\left(\frac{z^2}{2} \cdot \frac{r}{1+r}\right), \qquad r := \frac{n\tau^2}{\sigma^2}.

In particular, for any fixed zz and nn, BF10(z)0\mathrm{BF}_{10}(z) \to 0 as τ\tau \to \infty — the frequentist rejects H0H_0 (a function of zz alone) while the Bayesian overwhelmingly favors H0H_0.

Proof 1 Proof of Thm 3 (Lindley paradox) [show]

Step 1 — marginal under H0H_0. Under H0:θ=0H_0: \theta = 0, YˉN(0,σ2/n)\bar Y \sim \mathcal{N}(0, \sigma^2/n), so

m0(yˉ)  =  ϕ ⁣(yˉ;0,σ2/n),m_0(\bar y) \;=\; \phi\!\left(\bar y;\, 0,\, \sigma^2/n\right),

where ϕ(;μ,v)\phi(\cdot;\mu, v) denotes the N(μ,v)\mathcal{N}(\mu, v) density.

Step 2 — marginal under H1H_1. Under H1H_1, YˉθN(θ,σ2/n)\bar Y \mid \theta \sim \mathcal{N}(\theta, \sigma^2/n) and θN(0,τ2)\theta \sim \mathcal{N}(0, \tau^2). The marginal is the convolution

m1(yˉ)  =  ϕ(yˉ;θ,σ2/n)ϕ(θ;0,τ2)dθ  =  ϕ ⁣(yˉ;0,σ2/n+τ2),m_1(\bar y) \;=\; \int \phi(\bar y;\, \theta,\, \sigma^2/n)\, \phi(\theta;\, 0,\, \tau^2)\, d\theta \;=\; \phi\!\left(\bar y;\, 0,\, \sigma^2/n + \tau^2\right),

using the standard Gaussian convolution identity: the sum of a sampling-noise term with variance σ2/n\sigma^2/n and an independent prior draw with variance τ2\tau^2 is Gaussian with variance σ2/n+τ2\sigma^2/n + \tau^2.

Step 3 — ratio. The Bayes factor is

BF10  =  m1(yˉ)m0(yˉ)  =  σ2/nσ2/n+τ2exp ⁣(yˉ22[1σ2/n1σ2/n+τ2]).\mathrm{BF}_{10} \;=\; \frac{m_1(\bar y)}{m_0(\bar y)} \;=\; \sqrt{\frac{\sigma^2/n}{\sigma^2/n + \tau^2}} \,\cdot\, \exp\!\left(\frac{\bar y^2}{2} \left[\frac{1}{\sigma^2/n} - \frac{1}{\sigma^2/n + \tau^2}\right]\right).

Step 4 — substitute standardized quantities. Let r=nτ2/σ2r = n\tau^2/\sigma^2 and z2=nyˉ2/σ2z^2 = n\bar y^2/\sigma^2. Then σ2/n+τ2=(σ2/n)(1+r)\sigma^2/n + \tau^2 = (\sigma^2/n)(1+r), so

σ2/nσ2/n+τ2  =  11+r,yˉ2[1σ2/n1σ2/n+τ2]  =  z2r1+r.\sqrt{\frac{\sigma^2/n}{\sigma^2/n + \tau^2}} \;=\; \frac{1}{\sqrt{1+r}}, \qquad \bar y^2 \left[\frac{1}{\sigma^2/n} - \frac{1}{\sigma^2/n + \tau^2}\right] \;=\; z^2 \cdot \frac{r}{1+r}.

Substituting into Step 3 gives the stated formula.

Step 5 — limit. Fix z,n,σz, n, \sigma and let τ\tau \to \infty. Then rr \to \infty, r/(1+r)1r/(1+r) \to 1, so the exponential factor converges to ez2/2e^{z^2/2} — a finite constant bounded by the observed data. The prefactor 1/1+r01/\sqrt{1+r} \to 0. Hence the product BF100\mathrm{BF}_{10} \to 0.

∎ — using LIN1957; KAS1995 §3.4.

The paradox’s resolution — developed in Rem 10–12 — is that a prior “vague enough to be non-committal” on the estimation question is a prior that makes substantial commitments on the model-comparison question. The prior is part of the model.

Example 3 Numerical Lindley at n = 100, z = 3

Three values of τ\tau, same z=3z = 3 (p-value 0.0027\approx 0.0027, strong frequentist rejection):

  • τ=1\tau = 1: r=100r = 100. BF10=(1/101)exp(4.5100/101)0.099586.18.57\mathrm{BF}_{10} = (1/\sqrt{101}) \exp(4.5 \cdot 100/101) \approx 0.0995 \cdot 86.1 \approx 8.57.
  • τ=10\tau = 10: r=10,000r = 10{,}000. BF10=(1/10001)exp(4.510000/10001)0.0190.00.90\mathrm{BF}_{10} = (1/\sqrt{10001}) \exp(4.5 \cdot 10000/10001) \approx 0.01 \cdot 90.0 \approx 0.90.
  • τ=100\tau = 100: r=106r = 10^6. BF100.00190.00.0900\mathrm{BF}_{10} \approx 0.001 \cdot 90.0 \approx 0.0900.

Same data, same likelihood, same model — the prior-scale parameter alone flips the conclusion from “moderate evidence for M1\mathcal{M}_1” to “strong evidence for M0\mathcal{M}_0.” This is not a numerical accident; it is the diffuse-prior penalty in action. The LindleyParadoxExplorer component below traces the full BF10\mathrm{BF}_{10} curve across τ\tau at adjustable (z,n)(z, n).

Figure 3. Lindley paradox curve (featured): BF_10 plotted against the prior scale τ on log axes, for a fixed z-statistic z = 3 and sample size n = 100. The curve rises to a maximum around τ ≈ 1 (favoring H_1 with BF ≈ 8.57), then falls monotonically as τ increases, crossing BF = 1 near τ ≈ 10 and approaching 0 as τ → ∞. Horizontal dashed lines at Jeffreys' scale thresholds (BF = 3.2, 10, 100) show the regimes. Annotations mark the three numerical examples from Ex 3.
Preset:
BF = 1BF = 3.2BF = 10BF = 1000.1110100100010^-410^-310^-210^-110^010^110^210^3Prior scale τ (log)BF₁₀ (log scale)
Frequentist p-value
0.0027
two-sided z-test
Bayes factor BF₁₀
0.0900
favors H₀
Posterior odds (uniform prior)
1 : 11.11
P(M₁|y) / P(M₀|y) at p(M) = 1/2
Frequentist rejects H₀ at p ≈ 0.003, but BF₁₀ ≈ 0.09 favors H₀ by ~11:1 — the canonical paradox.
Remark 10 The paradox depends on point-null H₀

Lindley’s paradox is specific to the setting where H0H_0 assigns a point mass to a single parameter value (here θ=0\theta = 0). If instead H0:θ<ϵH_0: |\theta| < \epsilon (a small interval) — an “interval null” — the paradox dissolves: both m0m_0 and m1m_1 scale similarly as τ\tau grows, so their ratio stabilizes. Applied Bayesian practice often recommends interval nulls for this reason (cf. Berger & Sellke 1987). Topic 27 keeps the point-null framing because that’s what sharpens the contrast with frequentist hypothesis testing, but the pragmatic advice is: if your null is approximate, make the interval null explicit.

Remark 11 Bartlett's paradox — the n → ∞ version

A related phenomenon: fix yˉ\bar y (not zz) and let nn \to \infty. Then z=yˉnz = \bar y \sqrt n \to \infty too, and one might expect BF10\mathrm{BF}_{10} \to \infty always. Not so — for large τ\tau, the 1/1+r1/\sqrt{1+r} prefactor can dominate the exponential. Bartlett (1957, “A comment on D. V. Lindley’s statistical paradox”) notes this: the paradox arises because the prior-scale τ\tau and sample-scale n\sqrt n interact non-trivially. Keeping τ\tau fixed as nn grows is itself a modeling choice — often the wrong one.

Remark 12 Resolution: pick a prior that means something

The paradox doesn’t reveal a flaw in Bayesian inference; it reveals a demand for care in prior specification. A diffuse τ1\tau \gg 1 under H1H_1 says: “before seeing data, I believed θ\theta could plausibly be anywhere in [3τ,3τ][-3\tau, 3\tau] — a range of hundreds of standard deviations.” Under that prior, observing z=3z = 3 is modest evidence for M1\mathcal{M}_1 relative to the prior commitment, and M0\mathcal{M}_0‘s concentrated mass at 0 wins the comparison. The resolution: calibrate τ\tau to what you actually believe about the effect size. For a standardized coefficient in a well-designed study, τ1\tau \sim 1 is realistic. For a “diffuse” effect in an exploratory analysis, τ10\tau \sim 10 might be appropriate — but don’t go to τ=100\tau = 100 unless your substantive knowledge genuinely says so. This is the advice of applied Bayesian workflow: prior elicitation matters precisely because Bayes factors take priors seriously.

27.6 Marginal-likelihood computation

With Bayes factors motivated, we turn to the hard problem: computing m(y)m(\mathbf{y}) from MCMC draws. Four estimators, in rough order of effort:

  1. Laplace approximation — closed-form Gaussian integral around the posterior mode, O(1)O(1) cost after finding the mode.
  2. Bridge sampling — iterative fixed-point estimator using posterior draws and a proposal density.
  3. Path sampling (thermodynamic integration) — integrate the expected log-likelihood over a power-posterior temperature schedule.
  4. Nested sampling — transform the volumetric integral into a 1-D quadrature over prior-mass levels.

All four consume Topic 26’s MCMC draws; each has a failure mode illuminated in the examples below.

Theorem 4 Occam factor ⇔ BIC equivalence (asymptotic)

Let M\mathcal{M} be a regular parametric model with kk parameters and prior π\pi continuous and positive at the posterior mode θ^\hat\theta. Under the regularity conditions of Topic 14 §14.7 (existence of θ^MLE\hat\theta_{\mathrm{MLE}}; positive-definite observed information Jn=nI(θ^)+Op(n)\mathbf{J}_n = n \mathbf{I}(\hat\theta) + O_p(\sqrt n)),

2logm(y)  =  2logL(θ^MLE)+klogn+Op(1).-2 \log m(\mathbf{y}) \;=\; -2 \log L(\hat\theta_{\mathrm{MLE}}) \,+\, k \log n \,+\, O_p(1).

The right-hand side is BIC; the Op(1)O_p(1) remainder absorbs the Occam factor (prior volume, Hessian determinant, Gaussian-integral constants) that doesn’t scale with nn.

Proof 2 Proof of Thm 4 (Occam ⇔ BIC) [show]

Step 1 — start from the Laplace expansion. Topic 25 §25.8 Rem 16 (citing Topic 24 §24.4 Proof 2) gives

logm(y)  =  logL(θ^)+logπ(θ^)+k2log(2π)12logH(θ^)+O(1/n),\log m(\mathbf{y}) \;=\; \log L(\hat\theta) \,+\, \log \pi(\hat\theta) \,+\, \frac{k}{2}\log(2\pi) \,-\, \frac{1}{2}\log |\mathbf{H}(\hat\theta)| \,+\, O(1/n),

where H=2[logL+logπ]\mathbf{H} = -\nabla^2[\log L + \log \pi] is the Hessian of the unnormalized log-posterior at the mode.

Step 2 — dominant Hessian term. Under regularity, the prior’s second-derivative contribution 2logπ\nabla^2 \log \pi is O(1)O(1) (the prior does not scale with nn), while the log-likelihood Hessian satisfies 2logL=nI(θ^)+Op(n)-\nabla^2 \log L = n \mathbf{I}(\hat\theta) + O_p(\sqrt{n}). Hence

H(θ^)  =  nI(θ^)+Op(n),\mathbf{H}(\hat\theta) \;=\; n \mathbf{I}(\hat\theta) \,+\, O_p(\sqrt n),

so

logH(θ^)  =  klogn+logI(θ^)+Op(1/n).\log |\mathbf{H}(\hat\theta)| \;=\; k \log n \,+\, \log |\mathbf{I}(\hat\theta)| \,+\, O_p(1/\sqrt n).

Step 3 — MAP vs MLE. θ^θ^MLE\hat\theta \to \hat\theta_{\mathrm{MLE}} in probability (the prior’s score and Hessian are O(1)O(1) while the likelihood’s are O(n)O(n), so the MAP’s perturbation from the MLE is Op(1/n)O_p(1/n) — the MAP-side of the Bernstein–von Mises argument, Topic 25 §25.8), so logL(θ^)=logL(θ^MLE)+Op(1/n)\log L(\hat\theta) = \log L(\hat\theta_{\mathrm{MLE}}) + O_p(1/n).

Step 4 — substitute.

logm(y)  =  logL(θ^MLE)k2logn+logπ(θ^)+k2log(2π)12logI(θ^)Op(1)+op(1).\log m(\mathbf{y}) \;=\; \log L(\hat\theta_{\mathrm{MLE}}) \,-\, \frac{k}{2}\log n \,+\, \underbrace{\log \pi(\hat\theta) + \frac{k}{2}\log(2\pi) - \frac{1}{2}\log |\mathbf{I}(\hat\theta)|}_{O_p(1)} \,+\, o_p(1).

Step 5 — multiply by 2-2.

2logm(y)  =  2logL(θ^MLE)+klogn+Op(1)  =  BIC(y,M)+Op(1).-2 \log m(\mathbf{y}) \;=\; -2 \log L(\hat\theta_{\mathrm{MLE}}) \,+\, k \log n \,+\, O_p(1) \;=\; \mathrm{BIC}(\mathbf{y}, \mathcal{M}) \,+\, O_p(1).

∎ — using Topic 24 §24.4 Proof 2; Topic 25 §25.8 Rem 16; Topic 14 §14.11 Rem 9; KAS1995 §4.1.

Pedagogical reading. The 12logH-\frac{1}{2}\log|\mathbf{H}| term is the log posterior volume around the mode — a Gaussian with covariance H1\mathbf{H}^{-1} occupies volume H1/2\propto |\mathbf{H}|^{-1/2}. The prior volume is the integrated prior mass over the parameter space; for a bounded-support or weakly-informative prior this is O(1)O(1). The Occam factor

Occam(M)  :=  posterior volume around θ^prior volume    H1/2    nk/2\mathrm{Occam}(\mathcal{M}) \;:=\; \frac{\text{posterior volume around } \hat\theta}{\text{prior volume}} \;\propto\; |\mathbf{H}|^{-1/2} \;\propto\; n^{-k/2}

is the multiplicative penalty a model pays for having kk parameters — the factor by which the parameter space must shrink to concentrate on the data. Taking 2log-2\log turns this into the additive klognk \log n penalty of BIC. BIC is not a heuristic; it is the asymptotic Occam factor.

Example 4 Laplace approximation on Beta-Binomial

For (n,y,α,β)=(20,12,1,1)(n, y, \alpha, \beta) = (20, 12, 1, 1), the unnormalized log-posterior is logL+logπ=12logθ+8log(1θ)\log L + \log \pi = 12 \log\theta + 8 \log(1-\theta) (ignoring the binomial coefficient). The mode is θ^=12/20=0.6\hat\theta = 12/20 = 0.6, and the second derivative at the mode is 12/θ^28/(1θ^)2=12/0.368/0.16=33.3350=83.33-12/\hat\theta^2 - 8/(1-\hat\theta)^2 = -12/0.36 - 8/0.16 = -33.33 - 50 = -83.33. So H=83.33H = 83.33 and the Laplace estimate (after adding the binomial normalization) is

logmlog(2012)+12log0.6+8log0.4+12log(2π)12log83.333.06,\log m \approx \log \binom{20}{12} + 12 \log 0.6 + 8 \log 0.4 + \tfrac{1}{2}\log(2\pi) - \tfrac{1}{2}\log 83.33 \approx -3.06,

close to the closed-form 3.0445-3.0445 (Ex 1). The Laplace approximation is accurate to O(1/n)O(1/n) under regularity — impressive given its closed-form simplicity — but deteriorates for small nn or multimodal posteriors. marginalLikelihoodLaplace in bayes.ts handles the general multivariate case via Cholesky log-determinant.

Theorem 5 Bridge sampling identity (Meng–Wong 1996)

Let q1,q2q_1, q_2 be unnormalized densities on Θ\Theta with normalizing constants Zi=qi(θ)dθZ_i = \int q_i(\theta)\, d\theta and normalized densities pi=qi/Zip_i = q_i/Z_i. For any bridge function h:ΘRh: \Theta \to \mathbb{R} such that both expectations below exist and are nonzero,

Z1Z2  =  Ep2[q1(θ)h(θ)]Ep1[q2(θ)h(θ)].\frac{Z_1}{Z_2} \;=\; \frac{\mathbb{E}_{p_2}[\,q_1(\theta)\, h(\theta)\,]}{\mathbb{E}_{p_1}[\,q_2(\theta)\, h(\theta)\,]}.

Proof 3 Proof of Thm 5 (bridge identity) [show]

Step 1 — expand numerator. Write the expectation as an integral against the base density p2=q2/Z2p_2 = q_2/Z_2:

Ep2[q1h]  =  q1(θ)h(θ)p2(θ)dθ  =  1Z2q1(θ)q2(θ)h(θ)dθ.\mathbb{E}_{p_2}[q_1 h] \;=\; \int q_1(\theta)\, h(\theta)\, p_2(\theta)\, d\theta \;=\; \frac{1}{Z_2} \int q_1(\theta)\, q_2(\theta)\, h(\theta)\, d\theta.

Step 2 — expand denominator. Similarly

Ep1[q2h]  =  q2(θ)h(θ)p1(θ)dθ  =  1Z1q1(θ)q2(θ)h(θ)dθ.\mathbb{E}_{p_1}[q_2 h] \;=\; \int q_2(\theta)\, h(\theta)\, p_1(\theta)\, d\theta \;=\; \frac{1}{Z_1} \int q_1(\theta)\, q_2(\theta)\, h(\theta)\, d\theta.

Step 3 — ratio. The common integral q1q2hdθ\int q_1 q_2 h\, d\theta cancels:

Ep2[q1h]Ep1[q2h]  =  1/Z21/Z1  =  Z1Z2.\frac{\mathbb{E}_{p_2}[q_1 h]}{\mathbb{E}_{p_1}[q_2 h]} \;=\; \frac{1/Z_2}{1/Z_1} \;=\; \frac{Z_1}{Z_2}.

∎ — using MEN1996 Eq. 2.1; GEL1998 §3.

Application to marginal-likelihood estimation. Take q1(θ)=L(θ)π(θ)q_1(\theta) = L(\theta)\pi(\theta) (unnormalized posterior with Z1=m(y)Z_1 = m(\mathbf{y})) and q2(θ)=g(θ)q_2(\theta) = g(\theta) (a proposal density with known Z2=1Z_2 = 1). With n1n_1 posterior draws θs(1)\\{\theta_s^{(1)}\\} from p(θy)p(\theta \mid \mathbf{y}) (via Topic 26 MCMC) and n2n_2 proposal draws θs(2)\\{\theta_s^{(2)}\\} from gg, the estimator is

m^(y)  =  n21s=1n2L(θs(2))π(θs(2))h(θs(2))n11s=1n1g(θs(1))h(θs(1)).\hat m(\mathbf{y}) \;=\; \frac{n_2^{-1} \sum_{s=1}^{n_2} L(\theta_s^{(2)}) \pi(\theta_s^{(2)}) h(\theta_s^{(2)})}{n_1^{-1} \sum_{s=1}^{n_1} g(\theta_s^{(1)}) h(\theta_s^{(1)})}.

Corollary 5.1 (Meng–Wong optimal bridge, stated). The bridge function minimizing the relative mean-squared error of m^(y)\hat m(\mathbf{y}) is

h(θ)    1s1L(θ)π(θ)+s2m(y)g(θ),si  =  nin1+n2.h^\star(\theta) \;\propto\; \frac{1}{s_1\, L(\theta)\pi(\theta) + s_2\, m(\mathbf{y})\, g(\theta)}, \qquad s_i \;=\; \frac{n_i}{n_1 + n_2}.

Since hh^\star depends on the unknown m(y)m(\mathbf{y}), it is computed by fixed-point iteration: initialize m^(0)\hat m^{(0)} (e.g., importance-sampling estimate), update hh, recompute m^(t+1)\hat m^{(t+1)} from the theorem, iterate until m^(t+1)m^(t)<ε|\hat m^{(t+1)} - \hat m^{(t)}| < \varepsilon. The function bridgeSamplingEstimate in bayes.ts implements the equal-sample-size (s1=s2=1/2s_1 = s_2 = 1/2) case with logsumexp throughout for numerical stability. Full derivation of the optimality: MEN1996 §3, Theorem 1.

Example 5 Bridge sampling on Beta-Binomial (verification)

Run bridge sampling on (n,y,α,β)=(20,12,1,1)(n, y, \alpha, \beta) = (20, 12, 1, 1) with 8000 posterior draws from Beta(13, 9) and 8000 proposal draws from a deliberately-offset Beta(12.7, 9.1). The iterative fixed point converges in 3 iterations to logm^3.0446\log \hat m \approx -3.0446, matching the closed-form 3.04452-3.04452 (Ex 1) to 104\sim 10^{-4}. Over 30 independent MC seeds the standard deviation is 7×104\approx 7 \times 10^{-4} — two orders of magnitude tighter than naive importance sampling on the same sample budget. The BridgeSamplingConvergence component below traces the iteration on three preset problems.

Preset:
-3.05-3.00123closed formnaive ISIterationlog m̂(y)
Closed-form log m: -3.044522
Bridge estimate: -3.043491 (converged in 4 iter)
Naive IS: -3.042339
The T27.6 calibration case. Closed-form log m = log(1/21) ≈ −3.04452; bridge converges in ≤ 5 iter.
Figure 5. Bridge sampling convergence: iteration number (x-axis) vs log m̂ estimate (y-axis). Three curves for three Beta-Binomial preset problems. All three converge within ~5 iterations to the closed-form reference (dashed line). A fourth curve shows the naive IS estimator for comparison — higher variance, no convergence. Figure 4. Laplace vs exact log m for Beta-Binomial as n grows: at fixed k/n ratio, the Laplace approximation converges to the closed-form value as n increases. Error decays as O(1/n) (visible on log-log inset), matching the theoretical rate of the Stirling-style asymptotic expansion.
Theorem 6 Path sampling / thermodynamic integration (Gelman–Meng 1998, stated)

Let pβ(θ)L(θ)βπ(θ)p_\beta(\theta) \propto L(\theta)^\beta \pi(\theta) be the power posterior at inverse temperature β[0,1]\beta \in [0, 1]. Then

logm(y)  =  01Epβ ⁣[logL(θ)]dβ.\log m(\mathbf{y}) \;=\; \int_0^1 \mathbb{E}_{p_\beta}\!\left[\log L(\theta)\right]\, d\beta.

The estimator replaces the integral with trapezoidal quadrature over a discrete β\beta-grid 0=β0<β1<<βK=1\\{0 = \beta_0 < \beta_1 < \cdots < \beta_K = 1\\}, and each expectation with an MCMC average over draws from pβjp_{\beta_j}.

Proof sketch. Differentiate logm(β):=logL(θ)βπ(θ)dθ\log m(\beta) := \log \int L(\theta)^\beta \pi(\theta)\, d\theta with respect to β\beta under the integral; d/dβ[logm(β)]=Epβ[logL]d/d\beta [\log m(\beta)] = \mathbb{E}_{p_\beta}[\log L]. Integrate from 00 to 11; logm(0)=logπ=0\log m(0) = \log \int \pi = 0 (prior is normalized), logm(1)=logm(y)\log m(1) = \log m(\mathbf{y}). Full derivation: GEL1998 §3. The function pathSamplingEstimate in bayes.ts implements the trapezoidal quadrature.

Example 6 Path sampling on Beta-Binomial

Same inputs as Ex 5. Discretize β\beta on 0,0.1,0.2,,1\\{0, 0.1, 0.2, \ldots, 1\\} (11 points), sample 2000 draws per β\beta via MCMC targeting L(θ)βπ(θ)L(\theta)^\beta \pi(\theta), compute Epβ[logL]^\widehat{\mathbb{E}_{p_\beta}[\log L]} as the sample mean at each temperature, and trapezoidally integrate. Result: logm^3.05\log \hat m \approx -3.05, within 5×103\sim 5 \times 10^{-3} of the closed form — slightly looser than bridge at the same total sample budget because path sampling spends compute on off-target temperatures. Where path sampling wins: the integrand Epβ[logL]\mathbb{E}_{p_\beta}[\log L] is smooth in β\beta, so even a coarse grid gives a decent answer; where it loses: choosing the β\beta-schedule is a tuning decision with no universal rule (equidistant is fine for quick-and-dirty, geometric βj2j/K\beta_j \propto 2^{j/K} is better for skewed posteriors).

Figure 6. Path sampling integrand: E_{p_β}[log L(θ)] plotted against β ∈ [0, 1] for the Beta-Binomial Ex 6 problem. Smooth monotone curve; area under the curve is log m(y). Trapezoidal-quadrature approximation overlaid at 11 β-points; quadrature error < 0.01 on this problem.
Theorem 7 Nested sampling (Skilling 2006, stated)

Define the prior-mass function X(λ):=θ:L(θ)>λπ(θ)dθX(\lambda) := \int_{\\{\theta : L(\theta) > \lambda\\}} \pi(\theta)\,d\theta — the prior volume at likelihood level λ\lambda or higher. XX is monotone decreasing in λ\lambda with X(0)=1X(0) = 1 and X()=0X(\infty) = 0. Then

m(y)  =  0X(λ)dλ  =  01λ(X)dX,m(\mathbf{y}) \;=\; \int_0^\infty X(\lambda)\, d\lambda \;=\; \int_0^1 \lambda(X)\, dX,

where λ(X)=X1\lambda(X) = X^{-1} is the likelihood at prior-mass level XX.

Skilling’s estimator simulates iid live-point draws from the prior, removes the lowest-likelihood point, replaces it with a constrained-prior draw at higher likelihood, and records the removed point’s likelihood. After KK iterations with NN live points, the dead-point likelihoods LiL_i are associated with expected prior-mass shrinkage Xi=exp(i/N)X_i = \exp(-i/N), and

m^(y)    i=1KLi(Xi1Xi)+live-point remainder.\hat m(\mathbf{y}) \;\approx\; \sum_{i=1}^K L_i\, (X_{i-1} - X_i) \,+\, \text{live-point remainder}.

The function nestedSamplingEstimate in bayes.ts implements the log-space version of this sum. Nested sampling’s selling point is that it handles multimodal posteriors gracefully (the constrained-prior sampling naturally explores all modes above the current likelihood threshold) and produces the marginal likelihood as its primary output rather than a byproduct.

Figure 7. Nested sampling level sets: contour plot of a two-mode likelihood L(θ_1, θ_2) on a bounded prior. Colored bands show the live-point cloud at successive iterations — as iterations proceed, the constrained-prior sampling concentrates live points into higher-likelihood level sets. Right panel shows the 1-D integral log λ vs log X (prior mass) that the estimator evaluates.
Remark 13 Horseshoe and spike-and-slab priors

The four estimators above all assume a continuous prior π\pi. For sparsity-inducing priors — horseshoe (Carvalho–Polson–Scott 2010), spike-and-slab (Mitchell–Beauchamp 1988), Bayesian lasso — the prior has a sharp peak or mixture structure that Laplace handles poorly and bridge sampling requires careful proposal tuning for. Path sampling and nested sampling cope better but need fine β\beta-grids or dense live-point arrangements near the origin. §27.10 Rem 26 points to formalml for the full development.

Remark 14 Harmonic-mean estimator: known pathological

Newton & Raftery’s (1994) harmonic-mean estimator m^=1/Ep(θy)[1/L(θ)]\hat m = 1 / \mathbb{E}_{p(\theta \mid \mathbf{y})}[1/L(\theta)] converts posterior draws into a marginal-likelihood estimate with no proposal — just 1/L(θs)1/L(\theta_s) averaged over posterior draws θs\theta_s. Elegant and seductive, but almost always has infinite variance: the integrand 1/L(θ)1/L(\theta) has a heavy right tail under the posterior (since the posterior is proportional to LπL\pi, not 1/L1/L). Radford Neal famously called it “the worst Monte Carlo method ever proposed.” harmonicMeanEstimate in bayes.ts is kept for pedagogical contrast — use it to demonstrate the pathology, never in production.

Remark 15 Bridge vs path vs nested — rules of thumb
  • Bridge sampling is usually the default — stable, fast, handles non-Gaussian posteriors well, works for any dimension where MCMC works.
  • Path sampling is the right choice when you already have power-posterior draws (e.g., from parallel tempering) — the marginal-likelihood estimate is a cheap byproduct.
  • Nested sampling wins for strongly multimodal posteriors or when the marginal likelihood is the primary target (astrophysics model comparison is where nested sampling originated).
  • Laplace is the starting point: a 30-second sanity check you should run before investing in any MCMC-based estimator. If Laplace and bridge disagree by orders of magnitude, one of them is wrong and you need to figure out which.
Remark 16 The Stone–Dawid–Stone–Zidek paradox

An even sharper variant of Lindley’s paradox, due to Stone (1982), Dawid, Stone & Zidek (1973): with improper (non-integrable) priors, the marginal likelihood is undefined, and any Bayes factor between models with improper priors is arbitrary up to a constant. The consequence: never use improper priors for model comparison. Proper priors with sensible scale (Rem 12) are the robust default. Topic 27 doesn’t belabor this case because the brief prescribed scope ends at proper priors; formalml’s “Objective Bayes” topic develops the full story.

Remark 17 The forward map: what uses these estimators

Four downstream constructions consume m^(y)\hat m(\mathbf{y}): (i) posterior model probabilities via §27.2 Thm 1, (ii) Bayes factors via §27.4 Thm 2, (iii) BMA weights via §27.7, (iv) evidence-level hypothesis testing as in §27.5’s Lindley paradox. All four treat m^\hat m as a deterministic input, but MC variance in m^\hat m propagates to each — bridge’s 103\sim 10^{-3} log-scale SE on Beta-Binomial becomes 103\sim 10^{-3} SE in the log-BF, which at BF =8.57= 8.57 translates to ±0.01\pm 0.01 in BF space. Good enough for applied work; worth propagating via bootstrap for high-stakes comparisons.

27.7 Bayesian model averaging

When multiple models have nonzero posterior probability, the principled Bayesian answer is not “pick the best” but average:

p(y~y)  =  k=1Mp(y~Mk,y)p(Mky).p(\tilde y \mid \mathbf{y}) \;=\; \sum_{k=1}^M p(\tilde y \mid \mathcal{M}_k, \mathbf{y})\, p(\mathcal{M}_k \mid \mathbf{y}).

The predictive distribution for new data y~\tilde y is a mixture of per-model predictives weighted by posterior model probabilities. This is Bayesian model averaging (BMA), and it is the honest way to produce predictions under model uncertainty.

Theorem 8 BMA predictive under log-loss (stated)

Under the log-score loss L(y~,q)=logq(y~)\mathcal{L}(\tilde y, q) = -\log q(\tilde y), the optimal predictive distribution — the one minimizing expected loss under the joint posterior over (Mk,θk)(\mathcal{M}_k, \theta_k) — is the BMA mixture

q(y~)  =  k=1MwkBMAp(y~Mk,y),wkBMA=p(Mky).q^*(\tilde y) \;=\; \sum_{k=1}^M w_k^{\mathrm{BMA}}\, p(\tilde y \mid \mathcal{M}_k, \mathbf{y}), \qquad w_k^{\mathrm{BMA}} = p(\mathcal{M}_k \mid \mathbf{y}).

Any non-BMA predictive — hard model selection, stacking, equal-weight averaging — incurs strictly greater expected log-loss against the posterior-truth. Proof: standard Bayesian-decision-theory argument; see HOE1999 §2.

Thm 8’s caveat is M-closed: the theorem assumes the true data-generating process is in M\mathcal{M}. Under M-open (§27.2 Rem 3), BMA’s weights concentrate on whichever model is closest to truth — which may still be far from truth, leaving BMA over-confident in a single approximation. Stacking (formalml) directly optimizes predictive weights and often outperforms BMA under M-open.

Example 7 BMA over three nested polynomial models

Synthetic data: yi=sin(2πxi/10)+0.3xi+εiy_i = \sin(2\pi x_i / 10) + 0.3 x_i + \varepsilon_i, εiN(0,0.252)\varepsilon_i \sim \mathcal{N}(0, 0.25^2), xiU(0,10)x_i \sim U(0, 10), n=20n = 20. Fit three polynomial models — degree 1 (linear), degree 2 (quadratic), degree 3 (cubic) — and compare via BIC-approximated marginal likelihoods. For this specific sample, BIC strongly prefers degree 3 (posterior model probability 1.00\approx 1.00, weights (0,0,1)(0, 0, 1)), so BMA collapses to hard selection. The BMAPredictiveComparison component lets you flip to uniform weights (1/3,1/3,1/3)(1/3, 1/3, 1/3) or single-model weights and watch the predictive-fan change — uniform BMA produces a noticeably wider predictive band in the mid-range where the three models disagree most. The lesson: BMA’s predictive uncertainty reflects model-set uncertainty; hard selection under-reports it.

Preset:
012340246810xyDegree 1 meanDegree 2 meanDegree 3 meanBMA meantrue fobserved (n=20)
Posterior model probabilities ∝ exp(−BIC/2). On this sample BIC strongly prefers degree 3 (weights collapse to (0, 0, 1)).
Figure 8. BMA predictive for the three-polynomial example. Left panel: model weights as a bar chart (BIC-weighted vs uniform). Middle panel: predictive fan from hard-selected winner (degree 3); 95% predictive band. Right panel: BMA predictive fan using uniform weights (1/3, 1/3, 1/3); noticeably wider, reflecting cross-model uncertainty.
Remark 18 BMA as a decision-theoretic optimum, not a truth oracle

Thm 8 says BMA is the log-loss optimal predictive — it does not say BMA’s posterior model probabilities are calibrated measures of “which model is true.” A prior over models times a marginal likelihood gives a posterior over models; calling one of those probabilities “the probability that M1\mathcal{M}_1 is true” requires M-closed. In applied work, treat BMA weights as predictive weights (sum to 1, used for averaging) rather than truth probabilities (calibrated measures of model correctness). The distinction matters when you report “the probability that the effect is real is 0.95” — that claim is BMA’s weight, not a frequentist coverage guarantee.

Remark 19 Occam's window — the practical prune

Applied BMA over M10M \gg 10 candidate models is expensive and often fruitless — most models have negligible posterior probability. Madigan & Raftery (1994) propose Occam’s window: restrict the candidate set to models whose posterior probability is within a factor OO (typically 20 or 100) of the best model. This loses strict decision-theoretic optimality but is tractable. Most applied BMA software implements Occam’s window by default; bmaPredictive in bayes.ts takes pre-computed weights and assumes the pruning (if any) has already happened.

Remark 20 BMA vs stacking

Stacking (Breiman 1996, and Yao, Vehtari, Simpson & Gelman 2018 for the Bayesian version) directly optimizes predictive weights on held-out data rather than deriving them from marginal likelihoods. Under M-closed, stacking and BMA converge asymptotically; under M-open, stacking typically wins (it’s optimizing the thing you care about — predictive performance — rather than assuming the true model is enumerated). The BMAPredictiveComparison component sticks to BMA because Topic 27 is about model comparison under Bayesian principles; formalml’s stacking topic develops the M-open alternative.

27.8 Predictive scoring: DIC, WAIC, PSIS-LOO

Marginal likelihood comparison (Bayes factors, BMA) is one regime; predictive-score comparison is another. Predictive scores target a different question: how well will the model predict new data from the same DGP? This is the Topic 24 question in Bayesian dress. Three estimators dominate applied work, in increasing order of sophistication:

  1. DIC (Deviance Information Criterion, Spiegelhalter 2002) — DIC=2logL(θ^)+2pDIC\mathrm{DIC} = -2 \log L(\hat\theta) + 2 p_{\mathrm{DIC}}, where θ^\hat\theta is the posterior mean and pDICp_{\mathrm{DIC}} is the effective parameter count. Fast, requires only posterior-mean log-likelihood and an expectation over posterior draws.
  2. WAIC (Widely Applicable Information Criterion, Watanabe 2010) — pointwise lppdi=logEp(θy)[L(yiθ)]\mathrm{lppd}_i = \log \mathbb{E}_{p(\theta \mid \mathbf{y})}[L(y_i \mid \theta)], pWAIC,i=Varp(θy)[logL(yiθ)]p_{\mathrm{WAIC},i} = \mathrm{Var}_{p(\theta \mid \mathbf{y})}[\log L(y_i \mid \theta)], WAIC=2i(lppdipWAIC,i)\mathrm{WAIC} = -2 \sum_i (\mathrm{lppd}_i - p_{\mathrm{WAIC},i}). Unlike DIC, uses the full posterior distribution rather than just the mean.
  3. PSIS-LOO (Pareto-Smoothed Importance Sampling LOO, Vehtari 2017) — leave-one-out cross-validation via importance-weighting posterior draws by 1/L(yiθ)1/L(y_i \mid \theta), with Pareto smoothing of extreme weights. The gold standard for Bayesian predictive-score comparison when feasible.

All three target elpd=iEp(θy)[logp(yiθ)]\mathrm{elpd} = \sum_i \mathbb{E}_{p(\theta \mid \mathbf{y})}[\log p(y_i \mid \theta)], the expected log pointwise predictive density. Vehtari–Gelman–Gabry 2017 argues PSIS-LOO >> WAIC >> DIC in applied practice (more accurate, better uncertainty quantification), but all three are consistent under regularity and all three beat AIC/BIC for posterior-honest predictive comparison.

Example 8 DIC on Beta-Binomial

Back to the canonical tiny problem: n=5n = 5 Normal observations, σ=1\sigma = 1, posterior μyN(yˉ,σ2/n)=N(0.04,0.2)\mu \mid \mathbf{y} \sim \mathcal{N}(\bar y, \sigma^2/n) = \mathcal{N}(0.04, 0.2) with 2000 posterior draws. Compute the deviance D(μ)=2logϕ(yiμ,1)D(\mu) = -2 \sum \log \phi(y_i \mid \mu, 1) at each draw, take the posterior mean Dˉ\bar D, compare to D(μˉ)D(\bar\mu) at the posterior-mean μˉ=0.04\bar\mu = 0.04. The effective parameter count pDIC=DˉD(μˉ)1p_{\mathrm{DIC}} = \bar D - D(\bar\mu) \approx 1 (as expected for a one-parameter model under weak prior regularity), and DIC=D(μˉ)+2pDIC12.06\mathrm{DIC} = D(\bar\mu) + 2 p_{\mathrm{DIC}} \approx 12.06. The dic function in bayes.ts computes this in one line per input.

Example 9 WAIC vs PSIS-LOO on regression

Fit a simple linear regression yβN(Xβ,σ2)y \mid \beta \sim \mathcal{N}(X\beta, \sigma^2) with flat prior and compute both WAIC and PSIS-LOO on the same posterior draws. For a correctly-specified well-conditioned problem both land within 0.1% of each other; for a poorly-specified or outlier-heavy regression, PSIS-LOO can be 1\sim 1 larger (deviance scale) with high Pareto-k^\hat k diagnostics on the outliers. The waic and psisLoo functions in bayes.ts both accept the same pointwise log-likelihood matrix and return compatible outputs — a direct drop-in comparison with no re-sampling required.

Example 10 Pareto-k diagnostic

PSIS-LOO’s reliability hinges on the Pareto shape parameter k^\hat k per observation. Vehtari’s rule of thumb: k^<0.5\hat k < 0.5 means PSIS is reliable; 0.5k^<0.70.5 \le \hat k < 0.7 is borderline; k^0.7\hat k \ge 0.7 flags the observation as an outlier whose LOO estimate is unreliable — usually a sign of model misspecification on that point. The psisLoo function returns both the aggregate LOO value and a per-observation k^\hat k vector plus a nProblematic count. Operationally: always check the k^\hat k distribution before reporting LOO values.

Figure 9. WAIC vs PSIS-LOO comparison. Left panel: scatter of per-observation WAIC contribution vs LOO contribution across a regression example; near-diagonal for most points, with outliers falling above diagonal (LOO penalizes more heavily). Right panel: Pareto-k diagnostic distribution; most k values below 0.5, with a few above 0.7 flagged as problematic.
Remark 21 DIC's known issues

DIC has well-documented pathologies: pDICp_{\mathrm{DIC}} can be negative (spurious), doesn’t handle mixture models or hierarchical models well, and depends on the choice of posterior summary (μˉ\bar\mu vs posterior median, say). Spiegelhalter himself later (Spiegelhalter et al. 2014) recommended treating DIC as a legacy tool — useful for historical comparison with BUGS-era literature, not a default for new analyses. Modern Bayesian practice defaults to PSIS-LOO or WAIC.

Remark 22 WAIC ≈ LOO — a statement, not a proof

Watanabe (2010) proved WAIC=LOO+Op(1/n2)\mathrm{WAIC} = \mathrm{LOO} + O_p(1/n^2) under singular learning theory — an asymptotic equivalence that holds under substantially weaker regularity than DIC/AIC/BIC (no requirement of Fisher-information regularity). The proof requires machinery outside Topic 27’s scope; the result is stated-and-cited here. Practically: WAIC and LOO agree to 3–4 significant figures on most well-behaved problems.

Remark 23 Predictive-score vs marginal-likelihood comparison — when each wins
  • Bayes factors answer: “Under the prior over models, how did the data update my beliefs?” Prior-informative question; sensitive to within-model priors (Rem 7).
  • Predictive scores answer: “Which model predicts new data from the same DGP best?” Data-informative question; insensitive to within-model priors as long as they’re proper and the posterior is informed.

Both are valid, neither universally best. Applied workflow (GEL2013 §7): report both, note where they agree and disagree, and interpret the disagreement. Predictive scores for “which model to use for forecasts”; Bayes factors for “which model best explains the observed data.”

27.9 Local FDR, Bayesian multiplicity, and PPCs

Topic 20 developed the Benjamini–Hochberg procedure for controlling the false discovery rate across many simultaneous tests. That’s a frequentist procedure: it controls the expected proportion of false discoveries under H0H_0. The Bayesian counterpart, developed in Efron (2010), is the local false discovery rate:

fdr(z)  =  P(M0z)  =  (1π1)f0(z)f(z),\mathrm{fdr}(z) \;=\; P(\mathcal{M}_0 \mid z) \;=\; \frac{(1-\pi_1)\, f_0(z)}{f(z)},

where f0f_0 is the null density (typically N(0,1)\mathcal{N}(0, 1)), f1f_1 the alternative, π1\pi_1 the prior probability of alternative, and f=(1π1)f0+π1f1f = (1-\pi_1) f_0 + \pi_1 f_1 the mixture density of observed z-scores. The two-groups model assumes observed z-scores come from a mixture: null scores from f0f_0, alternative scores from f1f_1, and the task is to recover per-observation posterior probabilities of being from the null.

Example 11 BH vs local FDR on sparse signal

Simulate 1000 z-scores with π1=0.1\pi_1 = 0.1, μalt=2.5\mu_{\mathrm{alt}} = 2.5, σalt=1\sigma_{\mathrm{alt}} = 1 — Efron 2010’s running example. Apply BH at q=0.1q = 0.1 (expect 10%\le 10\% false discoveries in the rejection set) and local FDR at fdr(z)<0.2\mathrm{fdr}(z) < 0.2 threshold. The two methods produce overlapping but distinct rejection sets: BH rejects a contiguous tail z>z|z| > z^*; local FDR rejects individual z-scores wherever the posterior P(M0z)<0.2P(\mathcal{M}_0 \mid z) < 0.2, which can leave moderate-z scores unrejected while rejecting extreme outliers. The LocalFDRExplorer component visualizes both on adjustable mixtures. Closed-form checks at specific z: fdr(3.0)0.102\mathrm{fdr}(3.0) \approx 0.102, fdr(2.0)0.580\mathrm{fdr}(2.0) \approx 0.580 (brief §6.2 T27.13/T27.14).

Preset:
-4-20246z-scoredensitynull ((1−π₁) φ)alt (π₁ f₁)mixtureBH cutoff
BH |z| cutoff: ±2.806
Local-FDR z cutoff (right tail): 2.683
10% signal, μ_alt = 2.5 — the notebook Cell 12 default; BH and local-FDR disagree around z ≈ 2.
Figure 10. BH vs local FDR on the Efron two-groups example. Upper panel: histogram of 1000 z-scores with theoretical null density (dashed) and fitted mixture density (solid) overlaid. Lower panel: rejection regions under BH (shaded blue in the tails) vs local FDR (shaded orange wherever fdr(z) < 0.2). Disagreement zone near z ≈ 2 highlighted.
Example 12 Posterior predictive check — Bayesian p-value

Gelman, Meng & Stern (1996) define the Bayesian posterior predictive p-value as

pB  =  P ⁣(T(yrep,θ)T(y,θ)  |  y),p_B \;=\; P\!\left(T(\mathbf{y}^{\mathrm{rep}}, \theta) \geq T(\mathbf{y}, \theta) \;\middle|\; \mathbf{y}\right),

averaged over the posterior of θ\theta. For θ\theta-independent statistics (sample mean, max, min, quantiles), this simplifies to the empirical proportion of replicated datasets exceeding the observed TT. Interpretation: pB0.5p_B \approx 0.5 means the observed TT is typical under the posterior; extreme pBp_B (near 0 or 1) signals model misfit. Under a well-specified model, pBp_B is approximately uniform — but crucially not sub-uniform like a frequentist p-value, so pB<0.05p_B < 0.05 does not have the same Type I error guarantee. Use PPCs as a model-criticism tool, not as a hypothesis test.

Remark 24 Empirical null vs theoretical null

Efron’s two-groups model assumes f0=N(0,1)f_0 = \mathcal{N}(0, 1) (the theoretical null) — what you’d expect under standard z-test assumptions. In practice, real-world z-score distributions often have a wider null due to miscalibration, unmodeled correlations, or dependence structure — Efron calls this the empirical null. Fitting (μ0,σ0)(\mu_0, \sigma_0) from the central 50% of the z-distribution (central-matching) recovers the empirical null; substituting it for N(0,1)\mathcal{N}(0, 1) often changes rejection sets dramatically. The localFdr function accepts a theoreticalNull: false flag to switch on empirical-null estimation.

Remark 25 PPCs are not hypothesis tests

The Bayesian posterior predictive p-value looks like a frequentist p-value but behaves differently. GEL1996 warned: pBp_B is “prior-predictive calibrated” rather than “frequentist-calibrated” — under a correctly specified model with a prior, pBp_B is uniform on [0,1][0, 1] averaged over the prior, not for a fixed true θ\theta. In particular, pB<0.05p_B < 0.05 does not control Type I error at 5%. PPCs are graphical and diagnostic — use them to locate model inadequacy (tail behavior, skewness, dependence), not to make accept/reject decisions.

27.10 Forward-look

Topic 27 delivers Track 7’s computational and inferential capstone. Four threads extend into the remaining curriculum:

Remark 26 Topic 28 — Hierarchical Bayes & Partial Pooling

Topic 28 closes Track 7 by combining Topic 26’s MCMC machinery with Topic 27’s model-comparison framework for hierarchical models — partial pooling across groups, hyperparameter inference, hierarchical BMA. The 8-schools example from §26.9 gets its proper BMA treatment there: different group-level priors compared via Bayes factors, weighted by posterior model probabilities, predictive-averaged for out-of-sample forecasts.

Remark 27 Stacking and M-open alternatives (formalml)

When truth lies outside M\mathcal{M} (the M-open case, §27.2 Rem 3), BMA’s weights concentrate on the best approximating model — which may still be a bad approximation. Stacking, developed by Wolpert (1992) and extended to Bayesian form by Yao et al. (2018), directly optimizes predictive weights on held-out data. Generally outperforms BMA on M-open problems; often underperforms on strictly M-closed problems. The choice: know which regime you’re in, report both, interpret the disagreement.

Remark 28 Variational Bayes for model selection (formalml)

When MCMC + bridge sampling is too expensive (large models, big data), the variational lower bound (ELBO) gives a tractable approximation to logm(y)\log m(\mathbf{y}). Variational Bayes for model selection develops the KL-projection arguments that justify the approximation, the mean-field decomposition that makes it tractable, and the automatic-differentiation tools that make it routine. The trade-off: ELBO is always an under-estimate of the true logm\log m, so ELBO-based model selection can be biased. Modern normalizing-flow variational families close the gap.

Remark 29 Reversible-jump MCMC (formalml)

When the model space is discrete and high-dimensional (e.g., variable selection with 2p2^p candidate models), separate MCMC + bridge-sampling runs for each model are infeasible. Reversible-jump MCMC (Green 1995) samples directly from the joint posterior over (Mk,θk)(\mathcal{M}_k, \theta_k), providing BMA weights as a byproduct. Implementation is delicate — proposal distributions across models of different dimension require careful Jacobian bookkeeping — but the payoff is enormous on structured model spaces.

Remark 30 Bernardo reference priors and objective Bayes (formalml)

The “prior matters” lesson of §27.5 motivates an entire subfield: Bernardo reference priors systematically construct priors that maximize the expected KL-divergence between prior and posterior — priors that “let the data speak as much as possible.” Under regularity, reference priors coincide with Jeffreys priors in 1-D and generalize to multi-parameter settings via sequential conditional reference. Bayes factors under reference priors have stability properties the Lindley paradox notably lacks.

Topic 27 discharges the forward promises made in Topics 17, 18, 20, 24, 25, and 26 — every hypothesis-testing counterpart, every marginal-likelihood estimator, every local-FDR framework, every BMA pointer. Topic 28 closes Track 7 by applying this machinery to hierarchical structures, where model uncertainty and parameter uncertainty compound and BMA becomes the natural framework rather than an added layer.


References

  1. Robert E. Kass & Adrian E. Raftery. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773–795.
  2. Xiao-Li Meng & Wing Hung Wong. (1996). Simulating Ratios of Normalizing Constants via a Simple Identity: A Theoretical Exploration. Statistica Sinica, 6(4), 831–860.
  3. Andrew Gelman & Xiao-Li Meng. (1998). Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science, 13(2), 163–185.
  4. John Skilling. (2006). Nested Sampling for General Bayesian Computation. Bayesian Analysis, 1(4), 833–859.
  5. Sumio Watanabe. (2010). Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research, 11, 3571–3594.
  6. Aki Vehtari, Andrew Gelman & Jonah Gabry. (2017). Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
  7. Bradley Efron. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press.
  8. David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin & Angelika van der Linde. (2002). Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society: Series B, 64(4), 583–639.
  9. Dennis V. Lindley. (1957). A Statistical Paradox. Biometrika, 44(1–2), 187–192.
  10. Harold Jeffreys. (1961). Theory of Probability (3rd ed.). Oxford University Press.
  11. Jennifer A. Hoeting, David Madigan, Adrian E. Raftery & Chris T. Volinsky. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–401.
  12. Andrew Gelman, Xiao-Li Meng & Hal Stern. (1996). Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica Sinica, 6(4), 733–760.
  13. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari & Donald B. Rubin. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.