What is a statistical model?

What is a statistical model?

This question was posed recently by the excellent “Stats Fact” Twitter account, which linked to a paper that was too complicated for me to understand, involving category theory. Coincidentally, the same question also came up on a thread in the Astrostatistics group on Facebook, during a discussion of posterior predictive checks and other similar practices. There was a strong consensus that we should participate in such “model checking” to ensure the statistical model used is a good representation of the data.

My own views are different. Firstly, I don’t think statistical models are representations of the data at all (barring one exception, which I will discuss later). Instead, they are representations of the prior information that our analysis is assuming — and this is a simple thing to see, not something that requires advanced pure mathematics skills. Therefore, if we want to find out whether our choices are appropriate, we need to look at the prior information, not the data. Of course, in practice the prior information isn’t usually specified explicitly, so this is difficult to do. Secondly, I don’t think model checking is useful in the way most people imagine it is. These procedures don’t tell us whether or not we should trust the results of an inference, which is what we usually want to know.

To make things more concrete, let’s look at a simple example where we want to infer an unknown quantity $\theta$ from data $D$. We are therefore interested in the plausibility of propositions about the values of $\theta$ and $D$, and those plausibilities are related according to the rules of probability theory. To keep things simple, suppose there are only three possibilities for $\theta$ and three for $D$. Therefore there are nine (=3×3) mutually exclusive possibilities overall, of the form $(\theta = \theta_i) \wedge (D = D_j)$ where $i, j \in \{1, 2, 3\}$. Before we learn $D$, we not know the value of $\theta$, but we also don’t know the value of $D$ (thanks Ariel). To model this we assign prior probabilities over the grid of nine possibilities. Here’s an example probability assignment:

For argument’s sake, let’s assume $\theta$ is on the x-axis and $D$ is on the y-axis (i.e. each column corresponds to one of the three possible $\theta$ values and each row corresponds to a possible $D$ value). By the laws of probability theory, this joint distribution implies a bunch of other probabilities. For example, the marginal distribution of $\theta$ can be found by summing the columns, i.e. the marginal probabilities are $\{0.137, 0.325, 0.539\}$. This marginal distribution is what we usually mean by the term prior. If we take the joint distribution and divide by the marginal distribution for $\theta$, we get the conditional distributions $p(D|\theta)$ (each column represents a probability distribution for $D$):

If you look for people using the term “statistical model”, they are usually using it to refer to the choice of $p(D|\theta)$, the probability distribution for the data given the parameters, the conditional distributions shown above (aka the sampling distribution or sometimes [a bit misleadingly] the likelihood). However, these distributions are implied by the joint distribution $p(\theta, D)$ which is explicitly a model of uncertainty prior to knowing the data, not the data itself.

In Bayesian inference, data plays one role and only one role. When we learn the value of $D$ our knowledge of $\theta$ switches from the marginal distribution to the conditional distribution corresponding to the actual value of $D$. Equivalently, we can continue with the 3×3 grid of possibilities, and just delete all the false possibilities. So if we learned that $D$ took the value corresponding to row three, our state of knowledge would change, and the procedure is to set all the probabilities from rows 1 and 2 to zero, and renormalise the result:

This is the posterior distribution, but it’s not the posterior for $\theta$, it’s the posterior for $\theta$ and $D$ jointly. The posterior for $\theta$ is a marginal distribution here, which is trivially equal to the values in row three on the right. Anyway, the point here is that the thing people usually call the “statistical model” is really part of the prior information, so we shouldn’t go looking for it in the data. Incidentally, this updating procedure is equivalent to a certain way of using MaxEnt, which means that Bayesian updating is a way of staying as close to the prior as possible: a fun fact that is the opposite of the way many people think about statistics.

Posterior predictive checks

Posterior predictive checks should not be interpreted as validating or invalidating the prior distribution by using the data. Such ideas make no logical sense except in the case where the prior distribution assigned probability zero to the observed data. However, this doesn’t render them totally useless. Posterior predictive checks (and similar practices) can play a role in bringing to light consequences of your prior that you may not have realised were there. For example, assuming a “gaussian noise” sampling distribution implies a very high prior probability that the noise vector in your one dataset “looks like white noise”. You may or may not want that property in your prior. If you can’t decide, the place to look is in your prior information, not at adhockeries based on your current dataset.

Models “of the data”

There’s one situation where I’m willing to concede that a statistical model represents data itself, rather than prior information. Suppose I have a large set of numbers and I want to summarise them, like we do in “descriptive statistics”. I could present the mean and standard deviation of the data, or other quantities. Another thing I could present is the result of a maximum likelihood fit of a distribution (maybe one without sufficient statistics, like a Cauchy distribution) to the data. What I’d be doing there is merely presenting a summary, and not doing an inference or trying to inductively reason from the data. It can also be viewed as a kind of reverse Monte Carlo approximation — instead of replacing a probability distribution with an empirical measure, you’re doing the opposite.