If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.
For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.
Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.
With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.
The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.
For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.
So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.
Related: Probability modeling
Or, one could use tri-fold sampling…?
“If you train a model on a set of data, it should fit that data well.”
That depends on where you live. In the hard sciences, where models can be fairly exact, this is absolutely true. But in the softer sciences, and this includes clinical trials, this often doesn’t hold. It’s not uncommon for epidemiological studies to have R^2 values on the order of 0.2; my understanding is that in the world of, say, physical chemistry such a result would be unpublishable. So, it’s actually very instructive to go back and assess the goodness-of-fit of the model on the training set, and it’s often remarkable how poorly the predictive value is, even on the data used to construct the model.
Isn’t this precisely the motivation for splitting data not into 2 groups but 3: train, validation, and test? (i.e. the ‘proper’ way)
Run 2-fold cross-validation using the train and validation sets, choose the best (hyper)parameter, and then simply evaluate on test. You can’t take the test results and then relearn parameters since that’d be cheating and would formally relabel the test set as a validation set.
(Of course, 5-fold or 10-fold cross validation are better, but same idea).
This is very like a conversation I had with Michael Nielsen a few months back. Information about datasets is stored subtly in all kinds of places you might not think to look. It might even be stored diffusely in the “best practices” or folklore of an entire community. For example there’s some standard dataset that’s used for developing optical flow algorithms. When you publish an optical flow algorithm you publish figures showing how well your algorithm performs on that set. This means that an entire community of people think they’re developing good flow algorithms (and their associated parameters) but are actually learning quite a few properties specific to this dataset.
Happens a lot in finance too.
If you take this idea far enough, you realize that explicitly testing anything is impossible.
You optimize your parameters on a training set. You optimize the hyperparameters on a validation set. You optimize model type on a further validation set. Journal editors optimize over approaches based on test set performance. Experts develop their mental shorcuts and heuristics over many previous research projects. Every learning is over-learning on a sufficiently high level.
The only real test is application and the “natural selection” of methods in action. This is not unlike the problem of induction and the no free lunch theorems. If you build a self-driving car that performs well, that’s useful. But you can never say that because it’s been driving for years without any problems it is a good method. Maybe its approach was overfitted to the particular area or climate etc.
So I’d say, use that algorithm in the wild and see if it’s performing well. Then you can generalize from that to the full extent that we can ever hope to generalize from anything.
Good discussion! Inspired me to do this:, what I call out-of-sample data snooping..
http://eranraviv.com/sample-data-snooping/
Several people have mentioned breaking down the data into some larger number of subsets, but this isn’t going to prevent the problem from occurring. If I use observations 1:500 for training and observations 501:1000 for testing to try lots of functional forms, I’m really just searching for the feature set that works the best in 501:1000 when trained on 1:500. If I then try the model on observations 1001:1500 to measure out of sample error, it doesn’t change my previous step. It was still a specification search.
But, what is the alternative? If my loss function is error variance in the data I have at the time, I’m going to create a model that fits that data really well, probably better than the model will fit new data. If my loss function is some mix of error variance in the data I have at the time, plus other stuff (e.g. deviation from my preexisting beliefs, deviation from philosophical commitments like a preference for parsimony) I’ll wind up with a different model which is just overfitted in terms of a different loss function.
Since I can’t tell what features are uinque to the data in front of me and what features are characteristics of the process which generated the data I have, I’m pretty certain to overfit to something. The choice I’m facing at that point isn’t whether to overfit, it’s what to overfit my model to.
James: I like your last line. I’d add that in general we face the choice of whether to overstate our confidence or have some humility.
John:
Nice post. Just 2 comments:
1. I’d prefer you didn’t use a polynomial example, but rather something more “realistic” such as y = A*exp(-a*t) + B*exp(-b*t). I just hate how in certain fields such as physics and economics, polynomials are the default model, even though we just about never see anything that is usefully modeled by a polynomial of degree higher than 2.
2. Cross-validation is a funny thing. When people tune their models using cross-validation they sometimes think that because it’s an optimum that it’s the best. Two things I like to say, in an attempt to shake people out of this attitude:
(a) The cross-validation estimate is itself a statistic, i.e. it is a function of data, it has a standard error etc.
(b) We have a sample and we’re interested in a population. Cross-validation tells us what performs best on the sample, or maybe on the hold-out sample, but our goal is to use what works best on the population. A cross-validation estimate might have good statistical properties for the goal of prediction for the population, or maybe it won’t.
Just cos it’s “cross-validation,” that doesn’t necessarily make it a good estimate. An estimate is an estimate, and it can and should be evaluated based on its statistical properties. We can accept cross-validation as a useful heuristic for estimation (just as Bayes is another useful heuristic) without buying into it as necessarily best.
I think the above is consistent with your post.
I was quite surprised how susceptible some kernel learning methods are to over-fitting the model selection criterion (whether cross-validation or evidence maximisation) and wrote up some illustrative experiments here (using kernel ridge regression / least-squares support vector machine):
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (http://jmlr.csail.mit.edu/papers/v11/cawley10a.html)
At the very least it is a good idea to use something like nested cross-validation to get a performance estimate for the whole model-fitting procedure (including optimising hyper-parameters, feature selection, etc).
[comment copied from http://andrewgelman.com/2015/06/02/cross-validation-magic/#comment-220394%5D
Dear Dr. Cook,
Thanks for you posting!
I have a question, under which circumstances that the k-fold cross-validation error rate is so much higher than a testing set?
For example, I have an n>>p data set, and I randomly split this set into train set and test set. Then I did a 5-fold cross validation on the train set and found the estimated error rate is very low. However, I apply the same parameter settings using the entire train set to build a model and test it on the test set. I found my test set error rate significantly higher than the CV error rate.
I don’t understand why this happen. I wonder did you ever encounter this issue? And would you please help me understanding where I was wrong?
Thanks!