Overfitting

This week I gave a presentation in the astronomy group here at the University of Auckland, about some work I’ve been doing over the past year. As usual that work involves fitting fairly complex models to datasets.

One question that I got related to overfitting. I find it a little odd all the warnings we hear about overfitting, and all the methods we supposedly have to use to avoid it. These messages completely clash with my own experience which is that overfitting basically never happens and you don’t have to do anything to avoid it. In fact, the first time I ever fitted a gravitational lens model to an astronomical image, I had a big problem with underfitting caused by my naive prior. I had used a Uniform(0, 1E6) prior applied independently to some pixels, and it turns out that implies a very strong commitment to the sky being bright, which it isn’t.

Most examples of ‘overfitting’ are caused by attempts to solve inference problems with optimisation methods. If an optimisation-based method breaks (overfits), that’s telling you something important. Inference is not an optimisation problem, so you’re using the wrong tool for the job.

Advertisements

About Brendon J. Brewer

I am a senior lecturer in the Department of Statistics at The University of Auckland. Any opinions expressed here are mine and are not endorsed by my employer.
This entry was posted in Inference. Bookmark the permalink.

One Response to Overfitting

  1. While I imagine overfitting is possible in Bayesian contexts with extraordinarily flat priors, in general your experience is consistent with both mine, and this author. Kruschke, in a 2012 comment on a paper by Gelman and Shalizi, refers to a “Bayesian Occam’s razor effect”, a term and insight he attributes to MacKay in Information theory, inference & learning algorithms,
    Cambridge University Press, 2003. Finally, Gelman writes about how he avoids overfitting.

    Some practitioners avoid Bayesian methods because of what they see as excessive computational demands. But, in my view, along the lines of No Free Lunch theorems, without the Bayesian approach, it’s entirely possible to be trapped in subsets of the sample space which don’t capture important but unlikely behaviors or features, yet come out feeling the scores say your model is good. Fraser complains about this early in his otherwise excellent Hidden Markov Models and Dynamical Systems in Section 1.2.1 when he laments that an HMM used for simulating and so extrapolating 100-year-floods may not be robust, citing the example of a laser model which had stable orbits in its behavior but missed an unstable orbit which was actually seen.

    That all sounds complicated (not to you, but to a more general audience here) but consider relative frequency (or MLE) predictions of compositions of colored balls in 3-color bins where one of the colors is rare. Initial samples may not pick up the rare color at all and, so, predictors of composition will neglect it. But a multinomial with a Dirichlet conjugate prior will always put some weight on that logically possible rare color. Sure, maybe the bin has none of those colors because it itself is a sample of a larger population, but if having a single instance of that color has high value, its nice to know that it remains possible and should be considered in decision making.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s