One of my favorite pastimes is railing against certain word choices when they may imply things that aren’t true. An example of this is the usage of the word “prior” in Bayesian inference. If a quantity is unknown, then the prior describes how a reasoner’s plausibility is spread among all the possible values that might be true, before taking into account the information in some data .
While it is true that the prior describes prior information, this terminology seems to suggest it is the only place where prior information enters an analysis. This causes people to say things like “I’ll use Bayesian inference if I have prior information, but if I don’t, I’ll use something else”. If someone says that, it’s a red flag suggesting they don’t know what they’re talking about. Another part of the recipe that describes prior information is the “sampling distribution” or likelihood, . Some things is not are listed below:
- The process that generated the data
- The pdf that your data kinda looks like when you plot a histogram
What really represents is a reasoner’s prior beliefs about the data, imagining is known, as a function of the possible values. I will attempt to make this more obvious with a basic example.
Consider estimating a quantity from noisy measurements of it,
called . For the sampling distribution let’s use independent normal distributions for each measurement. Once you specify a prior it is straightforward to calculate the posterior . With an improper flat prior the posterior is normal/gaussian centered at the arithmetic mean of the values, and with a standard deviation of . This is the standard presentation of this problem.
I will now redo this example in a different way which exposes the logical status of as a description of prior information. Define the “errors” as . These are fixed numbers, just the difference between each measurement and the true value. There is nothing stopping us from considering the as an additional parameters we can infer from the data, along with . Bayes’ rule gives us the posterior
For consistency with the standard analysis, we can use a flat prior for and priors for the s (the “normality” assumption is now clearly a prior). To choose the sampling distribution, imagine we knew and . If we knew these things, we’d be sure about the data since Our probability distribution for the data is a delta function:
Multiplying the prior by the likelihood we get
If you marginalise out the s you just end up with the usual posterior distribution for .
The fact that you get the same result when the normal distribution is clearly a prior distribution is instructive: it suggests the sampling distribution really is a prior, just like the usual prior is. Because of this, I dislike it when I read statements suggesting that the sampling distribution is the “data generation mechanism” or something like that.
Interestingly the posterior distribution for the s is very different from the prior. The s were independent in the prior, implying learning would tell you nothing about . But the posterior for the s is very dependent. If you were to learn then that would tell you all other s with certainty.