The prior isn’t the only prior

One of my favorite pastimes is railing against certain word choices when they may imply things that aren’t true. An example of this is the usage of the word “prior” in Bayesian inference. If a quantity $\theta$ is unknown, then the prior $p(\theta)$ describes how a reasoner’s plausibility is spread among all the possible values that might be true, before taking into account the information in some data $x$.

While it is true that the prior describes prior information, this terminology seems to suggest it is the only place where prior information enters an analysis. This causes people to say things like “I’ll use Bayesian inference if I have prior information, but if I don’t, I’ll use something else”. If someone says that, it’s a red flag suggesting they don’t know what they’re talking about. Another part of the recipe that describes prior information is the “sampling distribution” or likelihood, $p(x|\theta)$. Some things $p(x|\theta)$ is not are listed below:

• The process that generated the data
• The pdf that your data kinda looks like when you plot a histogram

What $p(x|\theta)$ really represents is a reasoner’s prior beliefs about the data, imagining $\theta$ is known, as a function of the possible $\theta$ values. I will attempt to make this more obvious with a basic example.

Consider estimating a quantity $\theta$ from $N$ noisy measurements of it,
called $\mathbf{x} = \{x_1, x_2, ..., x_N\}$. For the sampling distribution let’s use independent normal distributions $p(x_i|\theta) \sim \textnormal{Normal}(\theta, 1)$ for each measurement. Once you specify a prior $p(\theta)$ it is straightforward to calculate the posterior $p(\theta | \mathbf{x})$. With an improper flat prior the posterior is normal/gaussian centered at the arithmetic mean of the $\mathbf{x}$ values, and with a standard deviation of $1/\sqrt{N}$. This is the standard presentation of this problem.

I will now redo this example in a different way which exposes the logical status of $p(x|\theta)$ as a description of prior information. Define the “errors” as $\epsilon_i = x_i - \theta$. These are fixed numbers, just the difference between each measurement and the true $\theta$ value. There is nothing stopping us from considering the $\boldsymbol{\epsilon} = \{\epsilon_i\}$ as an additional parameters we can infer from the data, along with $\theta$. Bayes’ rule gives us the posterior

$p(\theta, \boldsymbol{\epsilon} | \mathbf{x}) \propto p(\theta, \boldsymbol{\epsilon})p(\mathbf{x} | \theta, \boldsymbol{\epsilon})$

For consistency with the standard analysis, we can use a flat prior for $\theta$ and $\textnormal{Normal}(0, 1)$ priors for the $\epsilon$s (the “normality” assumption is now clearly a prior). To choose the sampling distribution, imagine we knew $\theta$ and $\boldsymbol{\epsilon}$. If we knew these things, we’d be sure about the data since $x_i = \theta + \epsilon_i$ Our probability distribution for the data is a delta function:

$p(\mathbf{x} | \theta, \boldsymbol{\epsilon}) = \prod_{i=1}^N\delta\left[x_i -(\theta + \epsilon_i)\right]$

Multiplying the prior by the likelihood we get

$p(\theta, \boldsymbol{\epsilon} | \mathbf{x}) \propto \prod_{i=1}^N \exp(-\frac{1}{2}\epsilon_i^2)\delta\left[x_i - (\theta + \epsilon_i)\right]$

If you marginalise out the $\epsilon$s you just end up with the usual posterior distribution for $\theta$.

$p(\theta | \mathbf{x}) \propto \int \prod_{i=1}^N \exp(-\frac{1}{2}\epsilon_i^2)\delta\left[x_i - (\theta + \epsilon_i)\right] \, d^N \epsilon_i$

$p(\theta | \mathbf{x}) \propto \exp\left[-\frac{1}{2}\sum_{i=1}^N(\theta - x_i)^2\right]$

The fact that you get the same result when the normal distribution is clearly a prior distribution is instructive: it suggests the sampling distribution really is a prior, just like the usual prior is. Because of this, I dislike it when I read statements suggesting that the sampling distribution is the “data generation mechanism” or something like that.

Interestingly the posterior distribution for the $\epsilon$s is very different from the prior. The $\epsilon$s were independent in the prior, implying learning $\epsilon_3$ would tell you nothing about $\epsilon_2$. But the posterior for the $\epsilon$s is very dependent. If you were to learn $\epsilon_3$ then that would tell you all other $\epsilon$s with certainty.

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Inference, Information. Bookmark the permalink.

9 Responses to The prior isn’t the only prior

1. This is an excellent point, which is, as pointed out, often overlooked. It is also important to stress that the choice of the shape of the likelihood function is probably the most fundamental element in the analysis, but that too is often overlooked, assuming the standard normal which so many seem to feel so comfortable with and therefore attracted to, but that is often not adequate, and unfortunately used as a rather blind assumption by bayesians and frequentists alike, I’m afraid. Silvia is a good example of this.

2. ” If you were to learn \epsilon_3 then that would tell you all other \epsilons with certainty.”
Surely this is only true if there’s only one parameter \theta?

These “errors” \epsilon are only going to mean something sensible when you have a nice simple kind of “sampling distribution”/likelihood/generating function/whatever. Why not just go ahead and replace \epsilons with the data itself? Then the joint prior p(\theta,x) is both the traditional prior as well as the likelihood.

I’d much rather class these all as “Assumptions”:
A1: This model (involving parameters \theta) is true
A2: Before knowing the data x, we have some PDF on \thetas (prior)
A3: If \thetas are known, then this is the PDF for measuring data x (likelihood)

• Sorry, actually A3 should probably just be absorbed into A1. Besides that I’m just arguing about terminology: that what you call “Prior Information”, I call “Assumptions”.

• “Why not just go ahead and replace \epsilons with the data itself? Then the joint prior p(\theta,x) is both the traditional prior as well as the likelihood.”

Agreed. p(\theta, x) is the joint prior. It models prior beliefs about theta and x before knowing the value of x. The epsilons aren’t necessary for recognising this, but I thought it was a cool example.

“what you call “Prior Information”, I call “Assumptions”.”

IMO the best thing about ‘assumptions’ is that it has the advantage of emphasising the fact that they’re usually asserted, rather than derived explicitly from prior information. But they are assumptions *about what is known prior to the data*, not about an ontological frequency distribution, about the actual data itself, or anything like that.

• Compromise and call them `Prior Assumptions’?

“The epsilons aren’t necessary for recognising this, but I thought it was a cool example.”
Sure, in fact, starting with an example in a simpler context is definitely a good choice. I was just trying to understand how to generalise this. Can’t really generalise an error term…

3. ” I was just trying to understand how to generalise this. Can’t really generalise an error term…”

The general idea is discussed in here: http://arxiv.org/abs/0808.0012 pages 33-35 (and it comes back later on). I’m also working on a draft blog post on this point (it’s one of my favourite hobby horses :)).