Data are Nuisance Parameters

In inference, the term “nuisance parameters” refers to some quantity you feel like you need to put in your model, but that you don’t actually care about. For example, you might be fitting some data D=\{y_i\} with a straight line y_i=mx_i + b, but you might only care about the slope m of the straight line. It’s totally possible to do the inference with only m and the data present, but it’s conceptually easier to put b in and then marginalize it out later, like so:

p(m|D) = \int p(m, b | D) db

Any inference involves a choice of a prior p(\theta) describing uncertainty about parameter(s) \theta, as well as a choice (which should also be called a prior) p(D|\theta) which describes uncertainty about what data will be observed, but that the data has some connection to the parameters.

In undergrad I took an information theory course, which I loved. I particularly liked the fact that you could take the log of a probability, call it information, and then sound like you understood profound truths. But some things didn’t make sense. In the class we defined the “information content of an outcome i” as -\log(p_i). For example, if I learn the outcome of a die roll I get \log\left(6\right) nats of information. However, in data analysis, it’s natural to ask questions like “how much information is in the data”, and if you apply this then you get nonsensical answers (most data sets had a ridiculously tiny probability before you got them, implying that all data sets contain a lot of information). Conclusion: the -\log(p_i) definition of information is useless.

The real definition of information is in terms of relative entropy (or KL divergence), as shown by Shannon, Jaynes, etc. If I started with a prior p_0(x) and updated to a posterior p_1(x), then the amount of information gained is H = \int p_1(x)\log\left(\frac{p_1(x)}{p_0(x)}\right) \, dx (this is essentially the number of times the prior had to be compressed by a factor e to get to the posterior). When we have nuisance parameters, there is an ambiguity. We could compute the information either including or excluding the nuisance parameters (i.e. we could calculate either the joint or the marginal compression ratio). In other words we could ask “how much did I learn about the interesting parameters?” or “how much did I learn about all the parameters, interesting and otherwise?”.

I learnt from Ariel Caticha (SUNY Albany) that inference boils down to forming the “joint prior” p(\theta)p(D|\theta) and then deleting all the D values that are known to be false (i.e. all but the actual data). So we can consider the prior and the posterior in the joint space of possible parameters and possible data sets. If you marginalize over data sets you get the regular posterior for the parameters. If you compute the “information” in this joint space, you will find that there is a lot of it. This is a measure of the information content of the data, but it’s a measure of how much the data tells you about the parameters and the data! You could also compute the amount of information the data contains about the data, and you’ll get -\log[p(D)] which is the silly “information” definition from the undergrad course. Both of these definitions are useless, unless your scientific question was about the noise in pixel (545, 253) on your detector! The real information content that we care about is the compression of the marginal prior p(\theta) to the marginal posterior p(\theta|D), because the purpose of obtaining data is to learn about parameters. Therefore data are just like nuisance parameters: we don’t care how much information we learned about the data itself, only how it affects our state of knowledge about the parameters.


About Brendon J. Brewer

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Entropy, Inference, Information. Bookmark the permalink.

3 Responses to Data are Nuisance Parameters

  1. bayes, james bayes. says:

    you had me at nuisance.

  2. Phil Marshall says:

    WHO USES NATS? Seriously, am excited about you having a blog. Following! 🙂

  3. Nats are natural, unlike those artificial “bits”. 😉

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s