This week I gave a presentation in the astronomy group here at the University of Auckland, about some work I’ve been doing over the past year. As usual that work involves fitting fairly complex models to datasets.

One question that I got related to overfitting. I find it a little odd all the warnings we hear about overfitting, and all the methods we supposedly have to use to avoid it. These messages completely clash with my own experience which is that overfitting basically never happens and you don’t have to *do* anything to avoid it. In fact, the first time I ever fitted a gravitational lens model to an astronomical image, I had a big problem with *underfitting* caused by my naive prior. I had used a Uniform(0, 1E6) prior applied independently to some pixels, and it turns out that implies a very strong commitment to the sky being bright, which it isn’t.

Most examples of ‘overfitting’ are caused by attempts to solve inference problems with optimisation methods. If an optimisation-based method breaks (overfits), that’s telling you something important. Inference is not an optimisation problem, so you’re using the wrong tool for the job.

### Like this:

Like Loading...

*Related*

## About Brendon J. Brewer

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.

While I imagine overfitting is possible in Bayesian contexts with extraordinarily flat priors, in general your experience is consistent with both mine, and this author. Kruschke, in a 2012 comment on a paper by Gelman and Shalizi, refers to a “Bayesian Occam’s razor effect”, a term and insight he attributes to MacKay in

Information theory, inference & learning algorithms,Cambridge University Press, 2003. Finally, Gelman writes about how he avoids overfitting.

Some practitioners avoid Bayesian methods because of what they see as excessive computational demands. But, in my view, along the lines of No Free Lunch theorems, without the Bayesian approach, it’s entirely possible to be trapped in subsets of the sample space which don’t capture important but unlikely behaviors or features, yet come out feeling the scores say your model is good. Fraser complains about this early in his otherwise excellent

Hidden Markov Models and Dynamical Systemsin Section 1.2.1 when he laments that an HMM used for simulating and so extrapolating 100-year-floods may not be robust, citing the example of a laser model which had stable orbits in its behavior but missed an unstable orbit which was actually seen.That all sounds complicated (not to you, but to a more general audience here) but consider relative frequency (or MLE) predictions of compositions of colored balls in 3-color bins where one of the colors is rare. Initial samples may not pick up the rare color at all and, so, predictors of composition will neglect it. But a multinomial with a Dirichlet conjugate prior will always put some weight on that logically possible rare color. Sure, maybe the bin has none of those colors because it itself is a sample of a larger population, but if having a single instance of that color has high value, its nice to know that it remains possible and should be considered in decision making.