What does the “Evidence” Accomplish?

Here are two contradictory statements about the evidence (aka marginal likelihood), by two experienced practitioners of Bayesian inference.

the evidence value is … a quantity of central importance.” – John Skilling

You never really want to compute the Bayes evidence (fully marginalized likelihood).” – David Hogg

Who is right? If a quantity is of central importance, why would you never want to compute it? To resolve this, we’ll need the definition. If I fit model M_1, which has parameters \theta_1, to some data x, the evidence is

p(x|M_1) = \int p(\theta_1|M_1)p(x|\theta_1, M_1) \, d\theta_1.

Similarly, for a different model M_2 fitted to the same data, the evidence is

p(x|M_2) = \int p(\theta_2|M_2)p(x|\theta_2, M_2) \, d\theta_2.

The reason the evidence is of central importance is that you use it to compute the posterior probabilities of M_1 and M_2:

\frac{p(M_1|x)}{p(M_2|x)} = \frac{p(M_1)}{p(M_2)}\times\frac{p(x|M_1)}{p(x|M_2)}.

So, if the question is “which model is more plausible”, the answer is going to involve the evidences. Why would someone be against it? Well, there are two main reasons. One is that it is computationally intensive and there might be some approximation that is cheaper (no argument there). The other reason is that it is possible to get absurd answers by using the evidence blindly without careful thought. No argument there either.

The main issue is that \int p(\theta|M)p(x|\theta, M) \, d\theta is quite sensitive to the choice of p(\theta|M). The absurdity (which is sometimes called the Jeffreys-Lindley “paradox”) occurs if you make p(\theta|M) really wide for one of the models, but not so wide for the other model.

The resolution of the Jeffreys-Lindley paradox is not to say the evidence is somehow not the right thing to calculate. Simple mathematics shows that it is the right thing. The resolution is found by not having silly priors. This implies actually testing the consequences of your priors instead of just saying “uniform will do”.

Doing Bayesian model selection is equivalent to having a bigger hypothesis space which consists of M_1 and M_2 sticky-taped together. A broad prior on \theta_1 but not \theta_2 would imply that M_2 makes much more specific predictions about the data than M_1 does. You should only do that if it’s actually a true statement about your prior beliefs!

This issue shows up a lot when people are interested in whether a parameter takes a specific value or not. For example, is the universe flat (M_1: \Omega = 1) or not (M_2: \Omega \neq 1)? Doing the model selection is usually like having a prior

p(\Omega) = \frac{1}{2}\delta(\Omega - 1) + \frac{1}{2}f(\Omega).

f(\Omega) should almost never be a uniform distribution. In most applications it should be something with heavy tails, like a Cauchy distribution (as Jeffreys used on these kinds of problems). Here is one problem with the uniform distribution. If f is a Uniform(0.9, 1.1) distribution then P(0.999 \leq \Omega \leq 1.001) = 1/1000 which is absurdly low. This is like saying “if the universe is not perfectly flat then it is probably not even close to flat”, and consequently “if the universe is not perfectly flat then the data I would expect to see are totally different from what I’d expect if the universe is perfectly flat”.

Conclusion: the evidence values don’t give us a “paradox”, they behave exactly as they should.


About Brendon J. Brewer

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Inference. Bookmark the permalink.

3 Responses to What does the “Evidence” Accomplish?

  1. Dan F-M says:

    Your first quote is missing a bit of context. I think that it’s supposed to mean the opposite of what it sounds like (given the way you’ve written it).

  2. You’re right, it’s possible to misinterpret the quote because Skilling is stating things that he disagrees with (optional by-product). I am sure he has said similar things in a more quotable way elsewhere, but I haven’t found any better quotes yet. If I do, I’ll edit this post.

  3. I hate to admit that there *are* circumstances in which you *might* want to compute the fully marginalized likelihood. Just not for *model selection*!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s