## What does the “Evidence” Accomplish?

Here are two contradictory statements about the evidence (aka marginal likelihood), by two experienced practitioners of Bayesian inference.

the evidence value is … a quantity of central importance.” – John Skilling

You never really want to compute the Bayes evidence (fully marginalized likelihood).” – David Hogg

Who is right? If a quantity is of central importance, why would you never want to compute it? To resolve this, we’ll need the definition. If I fit model $M_1$, which has parameters $\theta_1$, to some data $x$, the evidence is

$p(x|M_1) = \int p(\theta_1|M_1)p(x|\theta_1, M_1) \, d\theta_1$.

Similarly, for a different model $M_2$ fitted to the same data, the evidence is

$p(x|M_2) = \int p(\theta_2|M_2)p(x|\theta_2, M_2) \, d\theta_2$.

The reason the evidence is of central importance is that you use it to compute the posterior probabilities of $M_1$ and $M_2$:

$\frac{p(M_1|x)}{p(M_2|x)} = \frac{p(M_1)}{p(M_2)}\times\frac{p(x|M_1)}{p(x|M_2)}$.

So, if the question is “which model is more plausible”, the answer is going to involve the evidences. Why would someone be against it? Well, there are two main reasons. One is that it is computationally intensive and there might be some approximation that is cheaper (no argument there). The other reason is that it is possible to get absurd answers by using the evidence blindly without careful thought. No argument there either.

The main issue is that $\int p(\theta|M)p(x|\theta, M) \, d\theta$ is quite sensitive to the choice of $p(\theta|M)$. The absurdity (which is sometimes called the Jeffreys-Lindley “paradox”) occurs if you make $p(\theta|M)$ really wide for one of the models, but not so wide for the other model.

The resolution of the Jeffreys-Lindley paradox is not to say the evidence is somehow not the right thing to calculate. Simple mathematics shows that it is the right thing. The resolution is found by not having silly priors. This implies actually testing the consequences of your priors instead of just saying “uniform will do”.

Doing Bayesian model selection is equivalent to having a bigger hypothesis space which consists of $M_1$ and $M_2$ sticky-taped together. A broad prior on $\theta_1$ but not $\theta_2$ would imply that $M_2$ makes much more specific predictions about the data than $M_1$ does. You should only do that if it’s actually a true statement about your prior beliefs!

This issue shows up a lot when people are interested in whether a parameter takes a specific value or not. For example, is the universe flat ($M_1: \Omega = 1$) or not ($M_2: \Omega \neq 1$)? Doing the model selection is usually like having a prior

$p(\Omega) = \frac{1}{2}\delta(\Omega - 1) + \frac{1}{2}f(\Omega)$.

$f(\Omega)$ should almost never be a uniform distribution. In most applications it should be something with heavy tails, like a Cauchy distribution (as Jeffreys used on these kinds of problems). Here is one problem with the uniform distribution. If $f$ is a Uniform(0.9, 1.1) distribution then $P(0.999 \leq \Omega \leq 1.001) = 1/1000$ which is absurdly low. This is like saying “if the universe is not perfectly flat then it is probably not even close to flat”, and consequently “if the universe is not perfectly flat then the data I would expect to see are totally different from what I’d expect if the universe is perfectly flat”.

Conclusion: the evidence values don’t give us a “paradox”, they behave exactly as they should.