How many “distributions of the data” are there?

I tend to think (and complain) about terminology a lot, as you may have noticed. This is particularly important when a given term has some chance of conveying an incorrect message. When speaking informally with experts, we can be sloppier, and it won’t matter, but I still enjoy (for example) trying to convince David Hogg to stop saying generative model. I’m still unhappy with the standard terms for the $p(D|\theta)$ object, where $D$ is data and $\theta$ is parameters. “Sampling distribution” implies the data is literally sampled from some population, a popular metaphor that is very strained in most problems. On the other hand, “likelihood” is often reserved for the function of $\theta$ you get when you plug in the observed value of $D$, rather than the family of probability distributions over datasets that was there originally. I’ve been leaning towards “conditional prior for the data”, since that’s what it is.

To demonstrate sloppy language, imagine I wanted to call $p(D|\theta)$ something simple, say, “the distribution of the data”. The problem here is the number of other things that could also easily deserve that name. I will now play something like a probabilistic version of one of those “how many squares are there?” puzzles.

Suppose we have a standard hierarchical model situation with hyperparameters $\alpha$, parameters $\theta$, nuisance parameters $\eta$, and data $D = \{x_1, x_2, ..., x_n\}$. The probability distributions we need are $p(\alpha)$, $p(\theta, \eta | \alpha)$, and $p(D | \theta, \eta)$. Here are some things I could concievably call the distribution of the data:

* $p(D|\theta, \eta)$, usually known as the sampling distribution. This models prior beliefs about the data that you would have if you knew the parameters and the nuisance parameters.
* $p(D|\theta=\theta_{\rm true}, \eta=\eta_{\rm true})$, the particular sampling distribution that happens to correspond to the true values of the parameters and the nuisance parameters.
* $p(D|\theta)$, as above but with the nuisance parameters marginalised out.
* $p(D) = \int p(\alpha)p(\theta, \eta | \alpha)p(D | \theta, \eta) \, d\alpha \, d\theta \, d\eta$, the marginal prior for the data. This models prior beliefs about the data.
* $\delta(D - D_{\rm observed})$, a delta function at the observed value of the data. This models posterior beliefs about the data.
* $\sum_{i=1}^n \delta(x - x_i)$, the empirical measure of the data: a deterministic functional of the data, basically an infinite-resolution histogram. Note that this is a frequency distribution over $\mathbb{R}$, whereas the above are all probability distributions over $\mathbb{R}^n$.
* $\sum_{i=1}^N \delta(x - x_i)$, where $N > n$, the empirical measure of the greater population from which the data was drawn (if we were literally in a sampling situation, which we often are not).

I could go on. The distribution of the data is a vague and basically meaningless term, and if someone uses it, only the context will tell you which of these they’re actually talking about. Strangely, only one of these (the second last) is actually a property of the data, the others being properties of states of knowledge or hypothetical larger data sets.

Pay attention to this when you’re reading papers or listening to a talk. You may eventually start to notice how many people (including professional statisticians) will equate one with another without any justification, or without even realising that they’re doing it.

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Inference. Bookmark the permalink.

7 Responses to How many “distributions of the data” are there?

1. Right on schedule, I received a talk abstract in my inbox whose first sentence refers to the distribution of the data (sense 6 in my list: the talk’s about histograms).

2. I have some sympathy of the term “generative model” as a model for the “data generating process”, the latter being well-established in the Bayesian literature (e.g. http://drsmorey.org/bibtex/upload/OHagan:1997.pdf ).

• But doesn’t this confuse the notion of model with that of the data? Are models intellectual constructs that we come up with to try to explain data that we measure? Using a generative model to mean a data generating process seems to me quite confusing.

• In theoretical statistics papers the argument usual begins by choice simply by supposing we have data to generated by a given process, whereas in applications we might suppose this but then need to justify that this is a sensible statistical model for the actual data with posterior checks. Remembering always that is it just a model …

• I don’t like generative model. p(data | theta) is literally a prior state of knowledge. It is related to the data generation mechanism, in that the choice of what p(data | theta) should be is usually dictated by what we know about the mechanism. But it’s not identical to the mechanism — probabilities can’t generate anything.

3. Is it necessary to interpret the likelihood function as something that implies the existence of a larger hypothetical set from which the data are only a subset? Isn’t it reasonable to consider that no matter what kind of measurement we may make, they will always be distributed in some way, whether we know what that is or not. And if we have prior information about the measurements we are making, and in particular some indications about the nature of the statistical process that gives rise to these data, then we can justify using a distribution function that best describes the distribution that we expect (or even have seen in previous experiments) for these measurements, and from it, calculate the likelihood function using the set of measurements we have made. Is to make an assumption about the function according to which our measurements will be distributed, the same as assuming the existence of a larger population from which we draw a sample? Isn’t this just assuming that the physical process that gives rise to these data is of a certain kind as far as its statistical properties are concerned? Aren’t we are perfectly free to make as many measurements of a given process as we like? Does this imply that we are assuming that this data set is a subset of a hypothetical population? I’m not totally decided on these matters, but these are some thoughts that came up when reading your post.

• “Is it necessary to interpret the likelihood function as something that implies the existence of a larger hypothetical set from which the data are only a subset?”

It’s definitely not necessary! p(data | theta) describes (conditional) prior beliefs about the particular data set that you are working with, and that is all. You can do all this without any notion of hypothetical bigger datasets. If your dataset really is a subset of some bigger population, p(data | theta) may equal the frequency distribution of the bigger population, or it may not (getting from the frequency distribution of a population to the probability distribution for the dataset invokes the principle of indifference).