## Entropic priors are okay (but you probably shouldn’t use them)

Right now I’m in Canberra for the MaxEnt2013 meeting. It’s been an interesting and highly idiosyncratic conference so far, although I’m a bit tired because the schedule is very packed.

On Wednesday, John Skilling gave an interesting presentation criticising “entropic priors” and “information geometry”, two different concepts that have been getting a little bit of attention in recent years. In this post I will defend entropic priors, but not information geometry, because I know their rationale, whereas I’ve never been convinced why information geometry might be a good idea.

Suppose there are two quantities, $\theta$ and $x$, and you are uncertain about them, which can be described by a prior $\pi(\theta, x)$. You might not even be aware of any connection between the two, so it could even be independent, so $\pi(\theta, x) = \pi(\theta)\pi(x)$. Then you get some more information. Some oracle tells you that there is a relationship between these quantities, and expresses this by saying

“I know what your conditional distributions $p(x|\theta)$ should be! You’d better update from $\pi(\theta, x)$ to some other $p(\theta, x)$, so that you have the right conditionals $p(x|\theta)$ that I am imploring you to use”.

How should you choose your new distribution $p(\theta, x)$? Well, since the conditionals $p(x|\theta)$ are given to you by the oracle, the only freedom left is to fiddle with the marginal $p(\theta)$. The right way to update probabilities given this kind of information is to use ME, which implements “minimal updating”: you should stay as close as possible to your prior $\pi(\theta, x)$, while also incorporating the constraint given by the oracle. If you solve this problem you will find that your marginal should be changed from $\pi(\theta)$ to $p(\theta) \propto \pi(\theta)\exp\left[S(\theta)\right]$

where $S(\theta) = \int p(x|\theta) \log \left(\frac{p(x|\theta)}{\pi(x)}\right) \, dx$, which is the entropy of the conditional $p(x|\theta)$ with respect to the old one $\pi(x)$. Intuitively, this prior upweights $\theta$ values that give wide predictions about $x$, since exponentials of entropies are pretty much just volumes.

This is a seductive idea, because if you think of $\theta$ as the parameters in a Bayesian inference problem, and $x$ as data, then it appears we have a method for choosing a prior $p(\theta)$ that “takes into account the information contained in the choice of the likelihood function $p(x|\theta)$”. This is true. The derivation does what it says on the box. However, the premises are a bit weird: it’s unusual that you’d have a good (privileged) prior over data sets $\pi(x)$ before you learned what experiment you were doing $p(x|\theta)$. It’s usually more sensible to just do the normal Bayesian thing and just assign $p(\theta)$ in a way that agrees with your judgments.

If you think the entropic prior is weird, here is a straightforward Bayesian situation that has the same properties. On the left we have a flat prior for two quantities $x$ and $y$ that are both known to be integers between 1 and 5 inclusive. If we learn that $y \leq x$ then we’ll update to the posterior on the right (black = more probable), and hey presto, the marginal distribution for $x$ is different by a “volume” factor! ME updating will give the same answer if the constraint is $p(y | x) \sim \textnormal{Uniform(1, x)}$. But we can also do softer versions of this, such as $p(y | x) \propto \exp(-2(y-1)/x)$, i.e. given $x$, $y$ will have an exponential distribution with scale proportional to $x$. Then you get this result: which looks perfectly sensible. Marginally, $x=5$ is more probable though, because of the “entropic prior”. But it should be, by analogy with the first example.

Here are two final points I’d like to make:

• In Skilling’s talk where he showed the weird behaviour of the entropic prior, this was mostly caused by the extra “information geometry” factor and not the entropy factor.
• Note that this version of an entropic prior has no relation to the “prior over images” from 1980s-style “MaxEnt image reconstruction”. That prior does not have any special status and there is no good reason to use it! I wish it wasn’t in introductory textbooks, waiting to confuse everybody. I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Entropy, Inference, Information and tagged , . Bookmark the permalink.

### 3 Responses to Entropic priors are okay (but you probably shouldn’t use them)

1. scriminus says:

Sounds like an interesting meeting! Do you know if they are making slides public?
I’ve read a bit about information geometry too, but struggled to get beyond the theory. I’d be curious to see an example of it being used in the wild.

2. Brendon J. Brewer says:

MaxEnt is great fun. You get a great mix of awesome insight and strangeness. This one was a bit stressful due to being on the organising committee, and having an overly ambitious schedule. But it’s worth going to one if you get the chance.

I don’t know much about information geometry either except for when people use it as a justification for a Jeffreys prior (the square root of det G type). There was a guy giving a talk about how he used it in some data analysis problems, but my impression was that standard Bayesian analysis would have been more appropriate.

3. Prasanna K Gyawali (@pk_gyawali) says:

Can you suggest some simple resources for starter in entropic prior? I have seen people using it to maximize the information content of parameters. Can’t this be a prior instead of N(0,I) when I want to do variational inference or etc. Would love to hear your thoughts.