This is my first proper blog post for a while. Apologies for the gap. I have been busy with visits from three of my favourite colleagues (Kevin Knuth, Daniela Huppenkothen, and Dan Foreman-Mackey), followed by teaching an undergraduate course for which I had to learn HTML+CSS, XML, and databases (aside: SQL is cool and I wish I had learned it earlier). Somewhere in there, Lianne and I managed to buy our first house as well. Hopefully that’s enough excuses!

Earlier this year, the physics department had a visit from prominent astrostatistician Daniel Mortlock, who gave a good introductory talk about “Bayesian model selection”. He gave the standard version of the story where the goal is to calculate posterior model probabilities (as opposed to a literal *selection* of a model, which is a decision theory problem). During the presentation, he claimed that you shouldn’t use this theory to calculate the posterior probability of a hypothesis *you only thought of because of the data*. I thought this was a weird claim, so I disputed it, which was fun, but didn’t resolve the issue on the spot.

Here’s why I think Mortlock’s advice is wrong. Probabilities measure how plausible a proposition is, in the context of another proposition being known. Equivalently, they measure the degree to which one proposition implies another. For example, a posterior probability is the probability of statement given and , or the degree to which implies in the context of . To calculate it, you use Bayes’ rule. The posterior probability of equals the prior times the likelihood divided by the marginal likelihood. There’s no term in the equation for when or why you thought of .

Still, I can see why Mortlock would have given his recommendation; it was a warning against the Bayesian equivalent of “p-hacking“. Every dataset will contain *some* meaningless anomalies, and it’s possible to construct an analysis that makes an anomaly appear meaningful when it isn’t.

A super-simple example will help here (I’ve used this example before, and it’s basically Ed Jaynes’ “sure thing hypothesis”). Consider a lottery with a million tickets. Consider the hypotheses : The lottery was fair, and : the lottery was rigged to make ticket number 227, 354 win. And let be the proposition that 227, 354 indeed won. The likelihoods are and . Wow! A strong likelihood ratio in favour of . With prior probabilities of 0.5 each, the posterior probabilities of and are 1/1,000,001 and 1,000,000/1,000,001 respectively. Whoa. The lottery was almost certainly rigged!

Common sense says this conclusion is silly, and Mortlock’s warning would have prevented it. Okay, but is there a better way to prevent it? There is. We can assign more sensible prior probabilities. is silly because it would have implied , i.e. that we had some reason to suspect ticket number 227, 354 (and assign a 50% probability to it winning) before we knew that was the outcome. If, for example, we had considered a set of “rigged lottery” hypotheses , one for each ticket, and divided half the prior probability among them, then we’d have gotten the “obvious” result, that is uninformative about whether the lottery was fair or not.

The take home message here is that you can use Bayesian inference to calculate the plausibility of whatever hypotheses you want, no matter when you thought of them. The only risk is that you might be inclined to assign bad prior probabilities that sneakily include information from the data. The prior probabilities describe the extent to which hypotheses are implied by the prior information. If they do that, you’ll be fine.

Jaynes is the master at bringing sensible thinking to physical problems and resolving apparent weaknesses or paradoxes based on faulty thinking. Great post! Thanks.

Hi Brendon, thanks for the great post. The only thing I’d maybe change is the suggestion that the equal weighting of the two hypotheses in the lottery example is merely silly. Assuming that the information “I” is what you’ve provided (that there is a lottery involving a million tickets and that it may or may not be rigged), doesn’t the principle of indifference _mandate_ that P(H_i|I) be equal for all H_i where the indices label hypotheses that differ only by a permutation of ticket labels?

I agree the principle of indifference is compelling in this example. In more complicated scenarios, it’s less obvious whether there is such a symmetry in the prior information, and if so, on what “level” the symmetry applies (e.g. sometimes it seems reasonable to assert that a marginal distribution should have a certain symmetry, rather than the full joint distribution having the symmetry).