Right now I’m in Canberra for the MaxEnt2013 meeting. It’s been an interesting and highly idiosyncratic conference so far, although I’m a bit tired because the schedule is very packed.
On Wednesday, John Skilling gave an interesting presentation criticising “entropic priors” and “information geometry”, two different concepts that have been getting a little bit of attention in recent years. In this post I will defend entropic priors, but not information geometry, because I know their rationale, whereas I’ve never been convinced why information geometry might be a good idea.
Suppose there are two quantities, and , and you are uncertain about them, which can be described by a prior . You might not even be aware of any connection between the two, so it could even be independent, so . Then you get some more information. Some oracle tells you that there is a relationship between these quantities, and expresses this by saying
“I know what your conditional distributions should be! You’d better update from to some other , so that you have the right conditionals that I am imploring you to use”.
How should you choose your new distribution ? Well, since the conditionals are given to you by the oracle, the only freedom left is to fiddle with the marginal . The right way to update probabilities given this kind of information is to use ME, which implements “minimal updating”: you should stay as close as possible to your prior , while also incorporating the constraint given by the oracle. If you solve this problem you will find that your marginal should be changed from to
where , which is the entropy of the conditional with respect to the old one . Intuitively, this prior upweights values that give wide predictions about , since exponentials of entropies are pretty much just volumes.
This is a seductive idea, because if you think of as the parameters in a Bayesian inference problem, and as data, then it appears we have a method for choosing a prior that “takes into account the information contained in the choice of the likelihood function ”. This is true. The derivation does what it says on the box. However, the premises are a bit weird: it’s unusual that you’d have a good (privileged) prior over data sets before you learned what experiment you were doing . It’s usually more sensible to just do the normal Bayesian thing and just assign in a way that agrees with your judgments.
If you think the entropic prior is weird, here is a straightforward Bayesian situation that has the same properties. On the left we have a flat prior for two quantities and that are both known to be integers between 1 and 5 inclusive. If we learn that then we’ll update to the posterior on the right (black = more probable), and hey presto, the marginal distribution for is different by a “volume” factor!
ME updating will give the same answer if the constraint is . But we can also do softer versions of this, such as , i.e. given , will have an exponential distribution with scale proportional to . Then you get this result:
which looks perfectly sensible. Marginally, is more probable though, because of the “entropic prior”. But it should be, by analogy with the first example.
Here are two final points I’d like to make:
- In Skilling’s talk where he showed the weird behaviour of the entropic prior, this was mostly caused by the extra “information geometry” factor and not the entropy factor.
- Note that this version of an entropic prior has no relation to the “prior over images” from 1980s-style “MaxEnt image reconstruction”. That prior does not have any special status and there is no good reason to use it! I wish it wasn’t in introductory textbooks, waiting to confuse everybody.