In inference, the term “nuisance parameters” refers to some quantity you feel like you need to put in your model, but that you don’t actually care about. For example, you might be fitting some data with a straight line , but you might only care about the slope of the straight line. It’s totally possible to do the inference with only and the data present, but it’s conceptually easier to put in and then marginalize it out later, like so:
Any inference involves a choice of a prior describing uncertainty about parameter(s) , as well as a choice (which should also be called a prior) which describes uncertainty about what data will be observed, but that the data has some connection to the parameters.
In undergrad I took an information theory course, which I loved. I particularly liked the fact that you could take the log of a probability, call it information, and then sound like you understood profound truths. But some things didn’t make sense. In the class we defined the “information content of an outcome ” as . For example, if I learn the outcome of a die roll I get nats of information. However, in data analysis, it’s natural to ask questions like “how much information is in the data”, and if you apply this then you get nonsensical answers (most data sets had a ridiculously tiny probability before you got them, implying that all data sets contain a lot of information). Conclusion: the definition of information is useless.
The real definition of information is in terms of relative entropy (or KL divergence), as shown by Shannon, Jaynes, etc. If I started with a prior and updated to a posterior , then the amount of information gained is (this is essentially the number of times the prior had to be compressed by a factor to get to the posterior). When we have nuisance parameters, there is an ambiguity. We could compute the information either including or excluding the nuisance parameters (i.e. we could calculate either the joint or the marginal compression ratio). In other words we could ask “how much did I learn about the interesting parameters?” or “how much did I learn about all the parameters, interesting and otherwise?”.
I learnt from Ariel Caticha (SUNY Albany) that inference boils down to forming the “joint prior” and then deleting all the values that are known to be false (i.e. all but the actual data). So we can consider the prior and the posterior in the joint space of possible parameters and possible data sets. If you marginalize over data sets you get the regular posterior for the parameters. If you compute the “information” in this joint space, you will find that there is a lot of it. This is a measure of the information content of the data, but it’s a measure of how much the data tells you about the parameters and the data! You could also compute the amount of information the data contains about the data, and you’ll get which is the silly “information” definition from the undergrad course. Both of these definitions are useless, unless your scientific question was about the noise in pixel (545, 253) on your detector! The real information content that we care about is the compression of the marginal prior to the marginal posterior , because the purpose of obtaining data is to learn about parameters. Therefore data are just like nuisance parameters: we don’t care how much information we learned about the data itself, only how it affects our state of knowledge about the parameters.