Bayesian Model Comparison & BMA
Topic 25 introduced the marginal likelihood m(y) as the normalizer of the posterior; Topic 26 gave us MCMC machinery that samples the posterior without ever needing m(y). Topic 27 asks the opposite question: when we do need m(y) — for Bayes factors, posterior model probabilities, Bayesian model averaging — how do we estimate it from MCMC draws that only know the posterior's shape? Four estimators (Laplace, bridge sampling, path sampling, nested sampling), three predictive scores (DIC, WAIC, PSIS-LOO), the Lindley paradox (featured), the Occam-factor ⇔ BIC equivalence that closes Topic 24 §24.4, local FDR as the Bayesian counterpart to Topic 20's BH procedure, and posterior-predictive checks — all built on Topic 26's draws.
27.1 The hidden object: marginal likelihood as Bayesian normalizer
Every Bayesian computation in Topics 25 and 26 ran through the same unnormalized identity:
The proportionality constant — the integral
— didn’t matter for posterior inference. Metropolis–Hastings cancels in the acceptance ratio. Gibbs samplers sample from full conditionals that depend only on the unnormalized joint density. HMC follows gradients of and never touches the normalizer. Topic 26’s machinery delivers posterior expectations, credible intervals, and predictive draws — all without ever computing .
Topic 27 is about what happens when you need anyway. Three tasks force the issue:
- Bayes factors. Comparing two models and via the posterior odds ratio requires the ratio . The individual normalizers don’t cancel — they’re the whole game.
- Bayesian model averaging. Predictions weighted by posterior model probability require . Again, the marginal likelihoods themselves are the computational object.
- Evidence-level hypothesis testing. “How much does this dataset support over ?” — a question Neyman–Pearson answers with a decision rule, and Bayesian analysis answers with a number derived from .
The figure above captures the central idea: is the partition function — the integral over parameter space that turns the unnormalized joint into a probability density. Physicists face the same object in the Gibbs distribution ; Bayesians face it in the posterior. The computational challenge is identical — integrate an exponential-family-ish object over a high-dimensional space — and the techniques developed in physics (bridge sampling originates in free-energy estimation) transfer directly.
Topic 25 §25.5 introduced two distinct quantities that both begin with the word “marginal”:
- Marginal likelihood : integrates the data likelihood against the prior. A single scalar summarizing the fit of the model.
- Posterior predictive : integrates a new-data likelihood against the posterior. A full density over new observations.
The marginal likelihood is the evidence supporting the model; the posterior predictive is the model’s prediction under that evidence. §27.2 formalizes the first, §27.7 develops the second.
MCMC works because it cancels . That cancellation is exactly what makes hard to estimate from MCMC draws — the draws carry information about the posterior’s shape but not its scale. Every technique in §27.6 is a way to recover the lost scale: Laplace uses the analytic Gaussian-integral formula, bridge and path sampling tie posterior draws to a calibrated proposal, nested sampling transforms the integral into a 1-D quadrature over prior-mass levels. Each has failure modes; the one-size-fits-all estimator does not exist.
27.2 Models and marginal likelihood
Formalize the setting. Let be a candidate set of models — each specifies a parameter space , a likelihood , and a prior . Each model produces its own marginal likelihood:
A prior over models (typically uniform, ) turns the collection into a full joint model. Bayes’ theorem then applies at the model level.
Given a candidate set , prior model probabilities , and marginal likelihoods , the posterior probability of model is
The denominator is the total evidence — a number independent of , acting purely as a normalizer.
This is the standard Bayes-theorem application at the model level: the likelihood of model having generated the data is (by definition of the marginal likelihood); the prior is ; the posterior is their normalized product. The function posteriorModelProbabilities in bayes.ts implements this softmax-like normalization in log space for numerical stability — a precaution that matters when values differ by tens or hundreds of units across a candidate set.
Consider with . The marginal likelihood is
The integrand is proportional to , so the integral evaluates to
Taking logs: . For the answer is — the closed-form reference we use in §27.6 to verify the bridge / path / Laplace estimators (brief §6.2 T27.4). The function betaBinomialLogMarginal in bayes.ts computes this with stable lnGamma arithmetic.
Bernardo–Smith (1994) classify model-comparison settings by the status of truth. M-closed: the true data-generating process is in . M-open: the truth is outside ; we’re choosing the best approximating model. M-complete: truth lies in a (possibly unenumerated) extension of , but we treat as a working set. BMA is principled in M-closed and M-complete (it averages over the posterior within ); for M-open problems, stacking (formalml) or predictive-weight optimization is often preferred because BMA weights concentrate on a single “best” model that may still be far from truth. Most applied settings are M-open; the distinction matters when interpreting BMA weights.
A common pitfall: calling the “likelihood of the model.” It is not. The likelihood of the model at is — a function of . The marginal likelihood is a scalar — the likelihood averaged against the prior. Two models with identical MLE fits (same peak of ) can have very different if one has a concentrated prior that matches the data and the other has a diffuse prior that allocates probability mass to regions where is low.
depends on the entire set through the denominator. Adding or removing candidate models — even ones with negligible weight — can shift the reported posterior probabilities. This is not a bug; it’s a statement about what Bayesian model comparison actually asks. The question “which model explains the data best?” is inherently relative: best compared to what? A model with in a pair and in an eight-model set is not contradictory — it answers different questions.
27.3 Notation
Topic 27 inherits Topic 25 §25.3 and Topic 26 §26.3 notation verbatim. The new symbols introduced here:
| Object | Symbol | Interpretation |
|---|---|---|
| Candidate model | Indexed by over a candidate set . | |
| Marginal likelihood | Integral of over . | |
| Prior / posterior model probability | / | Prior weight / BMA weight . |
| Bayes factor | Data-driven update of vs prior odds. | |
| Bridge function | User-chosen function in the Meng–Wong identity (§27.6 Thm 5). | |
| Normalizing constant | Generic — the integral for an unnormalized . | |
| Inverse temperature | Path-sampling parameter, . | |
| Power posterior | Interpolates prior () and posterior (). | |
| Occam factor | Ratio of posterior volume to prior volume; asymptotic . | |
| Effective parameter count | / | “Effective” parameters accounting for posterior concentration. |
| Expected log-predictive density | ; the target of LOO / WAIC. | |
| PSIS-LOO estimate | Pareto-smoothed importance-sampling LOO. | |
| Pareto diagnostic | Per-observation PSIS-reliability flag; is problematic. | |
| Local false discovery rate | Two-groups posterior (Efron 2010). | |
| Posterior-predictive test statistic | Discrepancy evaluated at observed or replicated data. | |
| Bayesian p-value | . |
Independence and conditional independence continue to be written with the Topic 16 §16.9 convention . The M-open / M-closed / M-complete trichotomy is used narratively (§27.2 Rem 3) without a dedicated symbol. When and share parameters (nested models), we’ll say so explicitly — the shared parameter is integrated against the joint prior, not multiplied.
27.4 Bayes factors and Jeffreys’ scale
The Bayes factor is the ratio of marginal likelihoods:
This is the data’s multiplicative update to the prior odds of over :
For any two models and with positive prior probabilities,
The Bayes factor is thus prior-free in one sense — it’s the part of the comparison due to the data, independent of prior model probabilities — and prior-dependent in another: each depends on the within-model prior . Rem 7 below elaborates the second sense; Rem 12–14 follow the consequences.
The identity is immediate from (§27.2 Thm 1): take the ratio and the normalizer cancels.
Jeffreys (1961, Appendix B) tabulated a heuristic scale for interpreting Bayes factors:
| range | Strength of evidence for |
|---|---|
| Supports | |
| to | Barely worth mentioning |
| to | Substantial |
| to | Strong |
| Decisive |
Let with and . Then and (by the convolution of likelihood and prior). The Bayes factor is
This is the Lindley setup — the featured case of §27.5. For : , , and (moderate evidence for ; lindleyBayesFactor(3, 100, 1) in bayes.ts). Same with : (strong evidence for ). The paradox lives between these two lines.
The Bayes factor is prior-free over model probabilities but prior-dependent over parameter distributions. The in is a hard input: a diffuse under diluting probability mass over a large parameter volume yields a smaller than a concentrated matching the data. This is not a nuisance — it’s the Occam factor (§27.6 Thm 4) acting as a model-complexity penalty. But it does mean you must think about your within-model priors when you use Bayes factors; vague “non-informative” priors are actively informative here.
The scale above is Jeffreys’ original proposal and remains common in applied work, but nothing in Bayesian probability requires you to use it — has a specific probabilistic meaning (multiplies the prior odds by 5) regardless of whether a convention calls it “substantial.” Treat the labels as rough communication tools; report the numerical and let the reader apply their own threshold. Kass & Raftery 1995 §3.2 proposes a slightly coarser alternative scale (thresholds at 3, 20, 150); the distinction rarely matters in practice.
Under and with prior on the alternative,
where is the Topic 18 likelihood ratio. The BF is thus the prior-averaged LRT — Neyman–Pearson picks a specific alternative, Bayes averages over the prior. This is why §17.10 Rem 15 and §18 Rem 25 point here: Bayes factors are the evidence-level analog of the LRT decision rule.
27.5 The Lindley paradox
Section §27.4’s Ex 2 already showed the paradox numerically: at , changing the prior scale from 1 to 100 flips the Bayes factor from (favoring ) to (favoring ) — even though the frequentist z-test flags as a rejection at regardless of . Lindley (1957) proved this is general.
Let with known. Consider against with prior . Let be the observed z-statistic. Then
In particular, for any fixed and , as — the frequentist rejects (a function of alone) while the Bayesian overwhelmingly favors .
Proof 1 Proof of Thm 3 (Lindley paradox) [show]
Step 1 — marginal under . Under , , so
where denotes the density.
Step 2 — marginal under . Under , and . The marginal is the convolution
using the standard Gaussian convolution identity: the sum of a sampling-noise term with variance and an independent prior draw with variance is Gaussian with variance .
Step 3 — ratio. The Bayes factor is
Step 4 — substitute standardized quantities. Let and . Then , so
Substituting into Step 3 gives the stated formula.
Step 5 — limit. Fix and let . Then , , so the exponential factor converges to — a finite constant bounded by the observed data. The prefactor . Hence the product .
∎ — using LIN1957; KAS1995 §3.4.
The paradox’s resolution — developed in Rem 10–12 — is that a prior “vague enough to be non-committal” on the estimation question is a prior that makes substantial commitments on the model-comparison question. The prior is part of the model.
Three values of , same (p-value , strong frequentist rejection):
- : . .
- : . .
- : . .
Same data, same likelihood, same model — the prior-scale parameter alone flips the conclusion from “moderate evidence for ” to “strong evidence for .” This is not a numerical accident; it is the diffuse-prior penalty in action. The LindleyParadoxExplorer component below traces the full curve across at adjustable .
Lindley’s paradox is specific to the setting where assigns a point mass to a single parameter value (here ). If instead (a small interval) — an “interval null” — the paradox dissolves: both and scale similarly as grows, so their ratio stabilizes. Applied Bayesian practice often recommends interval nulls for this reason (cf. Berger & Sellke 1987). Topic 27 keeps the point-null framing because that’s what sharpens the contrast with frequentist hypothesis testing, but the pragmatic advice is: if your null is approximate, make the interval null explicit.
A related phenomenon: fix (not ) and let . Then too, and one might expect always. Not so — for large , the prefactor can dominate the exponential. Bartlett (1957, “A comment on D. V. Lindley’s statistical paradox”) notes this: the paradox arises because the prior-scale and sample-scale interact non-trivially. Keeping fixed as grows is itself a modeling choice — often the wrong one.
The paradox doesn’t reveal a flaw in Bayesian inference; it reveals a demand for care in prior specification. A diffuse under says: “before seeing data, I believed could plausibly be anywhere in — a range of hundreds of standard deviations.” Under that prior, observing is modest evidence for relative to the prior commitment, and ‘s concentrated mass at 0 wins the comparison. The resolution: calibrate to what you actually believe about the effect size. For a standardized coefficient in a well-designed study, is realistic. For a “diffuse” effect in an exploratory analysis, might be appropriate — but don’t go to unless your substantive knowledge genuinely says so. This is the advice of applied Bayesian workflow: prior elicitation matters precisely because Bayes factors take priors seriously.
27.6 Marginal-likelihood computation
With Bayes factors motivated, we turn to the hard problem: computing from MCMC draws. Four estimators, in rough order of effort:
- Laplace approximation — closed-form Gaussian integral around the posterior mode, cost after finding the mode.
- Bridge sampling — iterative fixed-point estimator using posterior draws and a proposal density.
- Path sampling (thermodynamic integration) — integrate the expected log-likelihood over a power-posterior temperature schedule.
- Nested sampling — transform the volumetric integral into a 1-D quadrature over prior-mass levels.
All four consume Topic 26’s MCMC draws; each has a failure mode illuminated in the examples below.
Let be a regular parametric model with parameters and prior continuous and positive at the posterior mode . Under the regularity conditions of Topic 14 §14.7 (existence of ; positive-definite observed information ),
The right-hand side is BIC; the remainder absorbs the Occam factor (prior volume, Hessian determinant, Gaussian-integral constants) that doesn’t scale with .
Proof 2 Proof of Thm 4 (Occam ⇔ BIC) [show]
Step 1 — start from the Laplace expansion. Topic 25 §25.8 Rem 16 (citing Topic 24 §24.4 Proof 2) gives
where is the Hessian of the unnormalized log-posterior at the mode.
Step 2 — dominant Hessian term. Under regularity, the prior’s second-derivative contribution is (the prior does not scale with ), while the log-likelihood Hessian satisfies . Hence
so
Step 3 — MAP vs MLE. in probability (the prior’s score and Hessian are while the likelihood’s are , so the MAP’s perturbation from the MLE is — the MAP-side of the Bernstein–von Mises argument, Topic 25 §25.8), so .
Step 4 — substitute.
Step 5 — multiply by .
∎ — using Topic 24 §24.4 Proof 2; Topic 25 §25.8 Rem 16; Topic 14 §14.11 Rem 9; KAS1995 §4.1.
Pedagogical reading. The term is the log posterior volume around the mode — a Gaussian with covariance occupies volume . The prior volume is the integrated prior mass over the parameter space; for a bounded-support or weakly-informative prior this is . The Occam factor
is the multiplicative penalty a model pays for having parameters — the factor by which the parameter space must shrink to concentrate on the data. Taking turns this into the additive penalty of BIC. BIC is not a heuristic; it is the asymptotic Occam factor.
For , the unnormalized log-posterior is (ignoring the binomial coefficient). The mode is , and the second derivative at the mode is . So and the Laplace estimate (after adding the binomial normalization) is
close to the closed-form (Ex 1). The Laplace approximation is accurate to under regularity — impressive given its closed-form simplicity — but deteriorates for small or multimodal posteriors. marginalLikelihoodLaplace in bayes.ts handles the general multivariate case via Cholesky log-determinant.
Let be unnormalized densities on with normalizing constants and normalized densities . For any bridge function such that both expectations below exist and are nonzero,
Proof 3 Proof of Thm 5 (bridge identity) [show]
Step 1 — expand numerator. Write the expectation as an integral against the base density :
Step 2 — expand denominator. Similarly
Step 3 — ratio. The common integral cancels:
∎ — using MEN1996 Eq. 2.1; GEL1998 §3.
Application to marginal-likelihood estimation. Take (unnormalized posterior with ) and (a proposal density with known ). With posterior draws from (via Topic 26 MCMC) and proposal draws from , the estimator is
Corollary 5.1 (Meng–Wong optimal bridge, stated). The bridge function minimizing the relative mean-squared error of is
Since depends on the unknown , it is computed by fixed-point iteration: initialize (e.g., importance-sampling estimate), update , recompute from the theorem, iterate until . The function bridgeSamplingEstimate in bayes.ts implements the equal-sample-size () case with logsumexp throughout for numerical stability. Full derivation of the optimality: MEN1996 §3, Theorem 1.
Run bridge sampling on with 8000 posterior draws from Beta(13, 9) and 8000 proposal draws from a deliberately-offset Beta(12.7, 9.1). The iterative fixed point converges in 3 iterations to , matching the closed-form (Ex 1) to . Over 30 independent MC seeds the standard deviation is — two orders of magnitude tighter than naive importance sampling on the same sample budget. The BridgeSamplingConvergence component below traces the iteration on three preset problems.
Let be the power posterior at inverse temperature . Then
The estimator replaces the integral with trapezoidal quadrature over a discrete -grid , and each expectation with an MCMC average over draws from .
Proof sketch. Differentiate with respect to under the integral; . Integrate from to ; (prior is normalized), . Full derivation: GEL1998 §3. The function pathSamplingEstimate in bayes.ts implements the trapezoidal quadrature.
Same inputs as Ex 5. Discretize on (11 points), sample 2000 draws per via MCMC targeting , compute as the sample mean at each temperature, and trapezoidally integrate. Result: , within of the closed form — slightly looser than bridge at the same total sample budget because path sampling spends compute on off-target temperatures. Where path sampling wins: the integrand is smooth in , so even a coarse grid gives a decent answer; where it loses: choosing the -schedule is a tuning decision with no universal rule (equidistant is fine for quick-and-dirty, geometric is better for skewed posteriors).
Define the prior-mass function — the prior volume at likelihood level or higher. is monotone decreasing in with and . Then
where is the likelihood at prior-mass level .
Skilling’s estimator simulates iid live-point draws from the prior, removes the lowest-likelihood point, replaces it with a constrained-prior draw at higher likelihood, and records the removed point’s likelihood. After iterations with live points, the dead-point likelihoods are associated with expected prior-mass shrinkage , and
The function nestedSamplingEstimate in bayes.ts implements the log-space version of this sum. Nested sampling’s selling point is that it handles multimodal posteriors gracefully (the constrained-prior sampling naturally explores all modes above the current likelihood threshold) and produces the marginal likelihood as its primary output rather than a byproduct.
The four estimators above all assume a continuous prior . For sparsity-inducing priors — horseshoe (Carvalho–Polson–Scott 2010), spike-and-slab (Mitchell–Beauchamp 1988), Bayesian lasso — the prior has a sharp peak or mixture structure that Laplace handles poorly and bridge sampling requires careful proposal tuning for. Path sampling and nested sampling cope better but need fine -grids or dense live-point arrangements near the origin. §27.10 Rem 26 points to formalml for the full development.
Newton & Raftery’s (1994) harmonic-mean estimator converts posterior draws into a marginal-likelihood estimate with no proposal — just averaged over posterior draws . Elegant and seductive, but almost always has infinite variance: the integrand has a heavy right tail under the posterior (since the posterior is proportional to , not ). Radford Neal famously called it “the worst Monte Carlo method ever proposed.” harmonicMeanEstimate in bayes.ts is kept for pedagogical contrast — use it to demonstrate the pathology, never in production.
- Bridge sampling is usually the default — stable, fast, handles non-Gaussian posteriors well, works for any dimension where MCMC works.
- Path sampling is the right choice when you already have power-posterior draws (e.g., from parallel tempering) — the marginal-likelihood estimate is a cheap byproduct.
- Nested sampling wins for strongly multimodal posteriors or when the marginal likelihood is the primary target (astrophysics model comparison is where nested sampling originated).
- Laplace is the starting point: a 30-second sanity check you should run before investing in any MCMC-based estimator. If Laplace and bridge disagree by orders of magnitude, one of them is wrong and you need to figure out which.
An even sharper variant of Lindley’s paradox, due to Stone (1982), Dawid, Stone & Zidek (1973): with improper (non-integrable) priors, the marginal likelihood is undefined, and any Bayes factor between models with improper priors is arbitrary up to a constant. The consequence: never use improper priors for model comparison. Proper priors with sensible scale (Rem 12) are the robust default. Topic 27 doesn’t belabor this case because the brief prescribed scope ends at proper priors; formalml’s “Objective Bayes” topic develops the full story.
Four downstream constructions consume : (i) posterior model probabilities via §27.2 Thm 1, (ii) Bayes factors via §27.4 Thm 2, (iii) BMA weights via §27.7, (iv) evidence-level hypothesis testing as in §27.5’s Lindley paradox. All four treat as a deterministic input, but MC variance in propagates to each — bridge’s log-scale SE on Beta-Binomial becomes SE in the log-BF, which at BF translates to in BF space. Good enough for applied work; worth propagating via bootstrap for high-stakes comparisons.
27.7 Bayesian model averaging
When multiple models have nonzero posterior probability, the principled Bayesian answer is not “pick the best” but average:
The predictive distribution for new data is a mixture of per-model predictives weighted by posterior model probabilities. This is Bayesian model averaging (BMA), and it is the honest way to produce predictions under model uncertainty.
Under the log-score loss , the optimal predictive distribution — the one minimizing expected loss under the joint posterior over — is the BMA mixture
Any non-BMA predictive — hard model selection, stacking, equal-weight averaging — incurs strictly greater expected log-loss against the posterior-truth. Proof: standard Bayesian-decision-theory argument; see HOE1999 §2.
Thm 8’s caveat is M-closed: the theorem assumes the true data-generating process is in . Under M-open (§27.2 Rem 3), BMA’s weights concentrate on whichever model is closest to truth — which may still be far from truth, leaving BMA over-confident in a single approximation. Stacking (formalml) directly optimizes predictive weights and often outperforms BMA under M-open.
Synthetic data: , , , . Fit three polynomial models — degree 1 (linear), degree 2 (quadratic), degree 3 (cubic) — and compare via BIC-approximated marginal likelihoods. For this specific sample, BIC strongly prefers degree 3 (posterior model probability , weights ), so BMA collapses to hard selection. The BMAPredictiveComparison component lets you flip to uniform weights or single-model weights and watch the predictive-fan change — uniform BMA produces a noticeably wider predictive band in the mid-range where the three models disagree most. The lesson: BMA’s predictive uncertainty reflects model-set uncertainty; hard selection under-reports it.
Thm 8 says BMA is the log-loss optimal predictive — it does not say BMA’s posterior model probabilities are calibrated measures of “which model is true.” A prior over models times a marginal likelihood gives a posterior over models; calling one of those probabilities “the probability that is true” requires M-closed. In applied work, treat BMA weights as predictive weights (sum to 1, used for averaging) rather than truth probabilities (calibrated measures of model correctness). The distinction matters when you report “the probability that the effect is real is 0.95” — that claim is BMA’s weight, not a frequentist coverage guarantee.
Applied BMA over candidate models is expensive and often fruitless — most models have negligible posterior probability. Madigan & Raftery (1994) propose Occam’s window: restrict the candidate set to models whose posterior probability is within a factor (typically 20 or 100) of the best model. This loses strict decision-theoretic optimality but is tractable. Most applied BMA software implements Occam’s window by default; bmaPredictive in bayes.ts takes pre-computed weights and assumes the pruning (if any) has already happened.
Stacking (Breiman 1996, and Yao, Vehtari, Simpson & Gelman 2018 for the Bayesian version) directly optimizes predictive weights on held-out data rather than deriving them from marginal likelihoods. Under M-closed, stacking and BMA converge asymptotically; under M-open, stacking typically wins (it’s optimizing the thing you care about — predictive performance — rather than assuming the true model is enumerated). The BMAPredictiveComparison component sticks to BMA because Topic 27 is about model comparison under Bayesian principles; formalml’s stacking topic develops the M-open alternative.
27.8 Predictive scoring: DIC, WAIC, PSIS-LOO
Marginal likelihood comparison (Bayes factors, BMA) is one regime; predictive-score comparison is another. Predictive scores target a different question: how well will the model predict new data from the same DGP? This is the Topic 24 question in Bayesian dress. Three estimators dominate applied work, in increasing order of sophistication:
- DIC (Deviance Information Criterion, Spiegelhalter 2002) — , where is the posterior mean and is the effective parameter count. Fast, requires only posterior-mean log-likelihood and an expectation over posterior draws.
- WAIC (Widely Applicable Information Criterion, Watanabe 2010) — pointwise , , . Unlike DIC, uses the full posterior distribution rather than just the mean.
- PSIS-LOO (Pareto-Smoothed Importance Sampling LOO, Vehtari 2017) — leave-one-out cross-validation via importance-weighting posterior draws by , with Pareto smoothing of extreme weights. The gold standard for Bayesian predictive-score comparison when feasible.
All three target , the expected log pointwise predictive density. Vehtari–Gelman–Gabry 2017 argues PSIS-LOO WAIC DIC in applied practice (more accurate, better uncertainty quantification), but all three are consistent under regularity and all three beat AIC/BIC for posterior-honest predictive comparison.
Back to the canonical tiny problem: Normal observations, , posterior with 2000 posterior draws. Compute the deviance at each draw, take the posterior mean , compare to at the posterior-mean . The effective parameter count (as expected for a one-parameter model under weak prior regularity), and . The dic function in bayes.ts computes this in one line per input.
Fit a simple linear regression with flat prior and compute both WAIC and PSIS-LOO on the same posterior draws. For a correctly-specified well-conditioned problem both land within 0.1% of each other; for a poorly-specified or outlier-heavy regression, PSIS-LOO can be larger (deviance scale) with high Pareto- diagnostics on the outliers. The waic and psisLoo functions in bayes.ts both accept the same pointwise log-likelihood matrix and return compatible outputs — a direct drop-in comparison with no re-sampling required.
PSIS-LOO’s reliability hinges on the Pareto shape parameter per observation. Vehtari’s rule of thumb: means PSIS is reliable; is borderline; flags the observation as an outlier whose LOO estimate is unreliable — usually a sign of model misspecification on that point. The psisLoo function returns both the aggregate LOO value and a per-observation vector plus a nProblematic count. Operationally: always check the distribution before reporting LOO values.
DIC has well-documented pathologies: can be negative (spurious), doesn’t handle mixture models or hierarchical models well, and depends on the choice of posterior summary ( vs posterior median, say). Spiegelhalter himself later (Spiegelhalter et al. 2014) recommended treating DIC as a legacy tool — useful for historical comparison with BUGS-era literature, not a default for new analyses. Modern Bayesian practice defaults to PSIS-LOO or WAIC.
Watanabe (2010) proved under singular learning theory — an asymptotic equivalence that holds under substantially weaker regularity than DIC/AIC/BIC (no requirement of Fisher-information regularity). The proof requires machinery outside Topic 27’s scope; the result is stated-and-cited here. Practically: WAIC and LOO agree to 3–4 significant figures on most well-behaved problems.
- Bayes factors answer: “Under the prior over models, how did the data update my beliefs?” Prior-informative question; sensitive to within-model priors (Rem 7).
- Predictive scores answer: “Which model predicts new data from the same DGP best?” Data-informative question; insensitive to within-model priors as long as they’re proper and the posterior is informed.
Both are valid, neither universally best. Applied workflow (GEL2013 §7): report both, note where they agree and disagree, and interpret the disagreement. Predictive scores for “which model to use for forecasts”; Bayes factors for “which model best explains the observed data.”
27.9 Local FDR, Bayesian multiplicity, and PPCs
Topic 20 developed the Benjamini–Hochberg procedure for controlling the false discovery rate across many simultaneous tests. That’s a frequentist procedure: it controls the expected proportion of false discoveries under . The Bayesian counterpart, developed in Efron (2010), is the local false discovery rate:
where is the null density (typically ), the alternative, the prior probability of alternative, and the mixture density of observed z-scores. The two-groups model assumes observed z-scores come from a mixture: null scores from , alternative scores from , and the task is to recover per-observation posterior probabilities of being from the null.
Simulate 1000 z-scores with , , — Efron 2010’s running example. Apply BH at (expect false discoveries in the rejection set) and local FDR at threshold. The two methods produce overlapping but distinct rejection sets: BH rejects a contiguous tail ; local FDR rejects individual z-scores wherever the posterior , which can leave moderate-z scores unrejected while rejecting extreme outliers. The LocalFDRExplorer component visualizes both on adjustable mixtures. Closed-form checks at specific z: , (brief §6.2 T27.13/T27.14).
Gelman, Meng & Stern (1996) define the Bayesian posterior predictive p-value as
averaged over the posterior of . For -independent statistics (sample mean, max, min, quantiles), this simplifies to the empirical proportion of replicated datasets exceeding the observed . Interpretation: means the observed is typical under the posterior; extreme (near 0 or 1) signals model misfit. Under a well-specified model, is approximately uniform — but crucially not sub-uniform like a frequentist p-value, so does not have the same Type I error guarantee. Use PPCs as a model-criticism tool, not as a hypothesis test.
Efron’s two-groups model assumes (the theoretical null) — what you’d expect under standard z-test assumptions. In practice, real-world z-score distributions often have a wider null due to miscalibration, unmodeled correlations, or dependence structure — Efron calls this the empirical null. Fitting from the central 50% of the z-distribution (central-matching) recovers the empirical null; substituting it for often changes rejection sets dramatically. The localFdr function accepts a theoreticalNull: false flag to switch on empirical-null estimation.
The Bayesian posterior predictive p-value looks like a frequentist p-value but behaves differently. GEL1996 warned: is “prior-predictive calibrated” rather than “frequentist-calibrated” — under a correctly specified model with a prior, is uniform on averaged over the prior, not for a fixed true . In particular, does not control Type I error at 5%. PPCs are graphical and diagnostic — use them to locate model inadequacy (tail behavior, skewness, dependence), not to make accept/reject decisions.
27.10 Forward-look
Topic 27 delivers Track 7’s computational and inferential capstone. Four threads extend into the remaining curriculum:
Topic 28 closes Track 7 by combining Topic 26’s MCMC machinery with Topic 27’s model-comparison framework for hierarchical models — partial pooling across groups, hyperparameter inference, hierarchical BMA. The 8-schools example from §26.9 gets its proper BMA treatment there: different group-level priors compared via Bayes factors, weighted by posterior model probabilities, predictive-averaged for out-of-sample forecasts.
When truth lies outside (the M-open case, §27.2 Rem 3), BMA’s weights concentrate on the best approximating model — which may still be a bad approximation. Stacking, developed by Wolpert (1992) and extended to Bayesian form by Yao et al. (2018), directly optimizes predictive weights on held-out data. Generally outperforms BMA on M-open problems; often underperforms on strictly M-closed problems. The choice: know which regime you’re in, report both, interpret the disagreement.
When MCMC + bridge sampling is too expensive (large models, big data), the variational lower bound (ELBO) gives a tractable approximation to . Variational Bayes for model selection develops the KL-projection arguments that justify the approximation, the mean-field decomposition that makes it tractable, and the automatic-differentiation tools that make it routine. The trade-off: ELBO is always an under-estimate of the true , so ELBO-based model selection can be biased. Modern normalizing-flow variational families close the gap.
When the model space is discrete and high-dimensional (e.g., variable selection with candidate models), separate MCMC + bridge-sampling runs for each model are infeasible. Reversible-jump MCMC (Green 1995) samples directly from the joint posterior over , providing BMA weights as a byproduct. Implementation is delicate — proposal distributions across models of different dimension require careful Jacobian bookkeeping — but the payoff is enormous on structured model spaces.
The “prior matters” lesson of §27.5 motivates an entire subfield: Bernardo reference priors systematically construct priors that maximize the expected KL-divergence between prior and posterior — priors that “let the data speak as much as possible.” Under regularity, reference priors coincide with Jeffreys priors in 1-D and generalize to multi-parameter settings via sequential conditional reference. Bayes factors under reference priors have stability properties the Lindley paradox notably lacks.
Topic 27 discharges the forward promises made in Topics 17, 18, 20, 24, 25, and 26 — every hypothesis-testing counterpart, every marginal-likelihood estimator, every local-FDR framework, every BMA pointer. Topic 28 closes Track 7 by applying this machinery to hierarchical structures, where model uncertainty and parameter uncertainty compound and BMA becomes the natural framework rather than an added layer.
References
- Robert E. Kass & Adrian E. Raftery. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773–795.
- Xiao-Li Meng & Wing Hung Wong. (1996). Simulating Ratios of Normalizing Constants via a Simple Identity: A Theoretical Exploration. Statistica Sinica, 6(4), 831–860.
- Andrew Gelman & Xiao-Li Meng. (1998). Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science, 13(2), 163–185.
- John Skilling. (2006). Nested Sampling for General Bayesian Computation. Bayesian Analysis, 1(4), 833–859.
- Sumio Watanabe. (2010). Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory. Journal of Machine Learning Research, 11, 3571–3594.
- Aki Vehtari, Andrew Gelman & Jonah Gabry. (2017). Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC. Statistics and Computing, 27(5), 1413–1432.
- Bradley Efron. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press.
- David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin & Angelika van der Linde. (2002). Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society: Series B, 64(4), 583–639.
- Dennis V. Lindley. (1957). A Statistical Paradox. Biometrika, 44(1–2), 187–192.
- Harold Jeffreys. (1961). Theory of Probability (3rd ed.). Oxford University Press.
- Jennifer A. Hoeting, David Madigan, Adrian E. Raftery & Chris T. Volinsky. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4), 382–401.
- Andrew Gelman, Xiao-Li Meng & Hal Stern. (1996). Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica Sinica, 6(4), 733–760.
- Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari & Donald B. Rubin. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.