I just met up with a new student of mine, and gave her some warmup questions to get familiar with some of the things we’ll be working on. This involved “differential entropy”, which is basically Shannon entropy but for continuous distributions.
An intuitive interpretation of this quantity is, loosely speaking, the generalisation of “log-volume” to non-uniform distributions. If you changed parameterisation from x to x’, you’d stretch the axes and end up with different volumes, so is not invariant under changes of coordinates. When using this quantity, a notion of volume is implicitly brought in, based on a flat measure that would not remain flat under arbitrary coordinate changes. In principle, you should give the measure explicitly (or use relative entropy/KL divergence instead), but it’s no big deal if you don’t, as long as you know what you’re doing.
For some reason, in this particular situation only, people wring their hands over the lack of invariance and say this makes the differential entropy wrong or bad or something. For example, the Wikipedia article states
Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions. Unfortunately, Shannon did not derive this formula, and rather just assumed it was the correct continuous analogue of discrete entropy, but it is not. The actual continuous version of discrete entropy is the limiting density of discrete points (LDDP).
(Aside: this LDDP thing comes from Jaynes, who I think was awesome, but that doesn’t mean that he was always right or that people who’ve read him are always right)
Later on, the Wikipedia article suggests something is strange about differential entropy because it can be negative, whereas discrete Shannon entropy is non-negative. Well, volumes can be less than 1, whereas counts of possibilities cannot be. Scandalous!
This is probably a side effect of the common (and unnecessary) shroud of mystery surrounding information theory. Nobody would be tempted to edit the Wikipedia page on circles to say things like “the area of a circle is not really , because if you do a nonlinear transformation the area will change.” On second thoughts, many maths Wikipedia pages do degenerate into fibre bundles rather quickly, so maybe I shouldn’t say nobody would be tempted.