I tend to think (and complain) about terminology a lot, as you may have noticed. This is particularly important when a given term has some chance of conveying an incorrect message. When speaking informally with experts, we can be sloppier, and it won’t matter, but I still enjoy (for example) trying to convince David Hogg to stop saying generative model. I’m still unhappy with the standard terms for the object, where is data and is parameters. “Sampling distribution” implies the data is literally sampled from some population, a popular metaphor that is very strained in most problems. On the other hand, “likelihood” is often reserved for the function of you get when you plug in the observed value of , rather than the family of probability distributions over datasets that was there originally. I’ve been leaning towards “conditional prior for the data”, since that’s what it is.
To demonstrate sloppy language, imagine I wanted to call something simple, say, “the distribution of the data”. The problem here is the number of other things that could also easily deserve that name. I will now play something like a probabilistic version of one of those “how many squares are there?” puzzles.
Suppose we have a standard hierarchical model situation with hyperparameters , parameters , nuisance parameters , and data . The probability distributions we need are , , and . Here are some things I could concievably call the distribution of the data:
* , usually known as the sampling distribution. This models prior beliefs about the data that you would have if you knew the parameters and the nuisance parameters.
* , the particular sampling distribution that happens to correspond to the true values of the parameters and the nuisance parameters.
* , as above but with the nuisance parameters marginalised out.
* , the marginal prior for the data. This models prior beliefs about the data.
* , a delta function at the observed value of the data. This models posterior beliefs about the data.
* , the empirical measure of the data: a deterministic functional of the data, basically an infinite-resolution histogram. Note that this is a frequency distribution over , whereas the above are all probability distributions over .
* , where , the empirical measure of the greater population from which the data was drawn (if we were literally in a sampling situation, which we often are not).
I could go on. The distribution of the data is a vague and basically meaningless term, and if someone uses it, only the context will tell you which of these they’re actually talking about. Strangely, only one of these (the second last) is actually a property of the data, the others being properties of states of knowledge or hypothetical larger data sets.
Pay attention to this when you’re reading papers or listening to a talk. You may eventually start to notice how many people (including professional statisticians) will equate one with another without any justification, or without even realising that they’re doing it.