I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here’s a tweet from Giuseppe Paleologo this morning:
Isn’t it ironic that almost all known results in asymptotic statistics don’t scale well with data?
There are several things people could mean when they say that complex models don’t scale well.
First, they may mean that the implementation of complex models doesn’t scale. The computational effort required to fit the model increases disproportionately with the amount of data.
Second, they could mean that complex models aren’t necessary. A complex model might do even better than a simple model, but simple models work well enough given lots of data.
A third possibility, less charitable than the first two, is that the complex models are a bad fit, and this becomes apparent given enough data. The data calls the model’s bluff. If a statistical model performs poorly with lots of data, it must have performed poorly with a small amount of data too, but you couldn’t tell. It’s simple over-fitting.
I believe that’s what Giuseppe had in mind in his remark above. When I replied that the problem is modeling error, he said “Yes, big time.” The results of asymptotic statistics scale beautifully when the model is correct. But giving a poorly fitting model more data isn’t going to make it perform better.
The wrong conclusion would be to say that complex models work well for small data. I think the conclusion is that you can’t tell that complex models are not working well with small data. It’s a researcher’s paradise. You can fit a sequence of ever more complex models, getting a publication out of each. Evaluate your model using simulations based on your assumptions and you can avoid the accountability of the real world.
If the robustness of simple models is important with huge data sets, it’s even more important with small data sets.
Model complexity should increase with data, not decrease. I don’t mean that it should necessarily increase, but that it could. With more data, you have the ability to test the fit of more complex models. When people say that simple models scale better, they may mean that they haven’t been able to do better, that the data has exposed the problems with other things they’ve tried.
I’ve usually understood that the complexity being referred to was conceptual, rather than complexity as degrees of freedom.
When you have little data, you have to be very careful how to use it, be Bayesian and use justifiable priors, so the model looks complicated.
If you have lots of data, you can use something conceptually simple like logistic regression, throw in everything you can think of as a feature, and fit the model without over-fitting. The complexity as number of parameters is high, but it’s a logistics regression model, easy to wrap your brain around – “simple”.
I’m curious of your opinion of AIC for being a guide in places like this? Personally I rarely calculate an actual AIC but I do like to keep it in mind as to how much extra better fit I need to get as I increase the complexity of a model. Which sometimes means coming up with a hand wavy number for degrees of freedom represented by added complexity.
The AIC and various *ICs have the virtue of including an explicit penalty for model complexity. And it’s reassuring when your model is optimal by some objective standard. But the choice of models to compare and the choice of information criterion are not as objective, so there can be an inflated impression of objectivity.
Sometimes AIC, BIC, DIC, etc. give you a different perspective, and that may be when they’re most valuable. Maybe you think two models are a toss up, but then you see that some *IC seems to believe there’s a greater difference between models than you’d expect. In the process of finding out why that is, you learn more about your data.
Since I have been slashdotted by John, I thought I could expand my tweet in a more coherent thought. My criticism of asymptotic statistics *in practical application* is primarily of the first kind: most published methods are more and more useful as more data becomes available, but their computational requirements are not linear or quasilinear. This statements applies to almost all non-trivial statistics, but I would like to mention one that is really central to Econometrics: the Generalized Method of Moments (GMM). Asymptotically, GMM addresses issues of serial dependence, heteroskedasticity, and of dependent residuals. In practice however, it’s extremely hard to run it on large datasets, say, for example, high-frequency data, because it would require decomposition and multiplication of very large matrices.
I would also add that model complexity doesn’t equate computational complexity. See the work of Leon Bottou (and A.Tsybakov) on scalable estimation. But in his approach, computational complexity in integral part of the analysis, not grafted on it at the end (say, by stochastic approximation pf matrix products). Simple models in the presence of infinite data are still subject to approximation error, even if the estimation error is zero. In this respect, I disagree with Norvig who stated that simple models and lots of data beat complex models.
After Dr Paleologo, I also disagree with Norvig. This was a topic at a meeting of the Boston Chapter of the ASA which drew quite a bit of discussion. A common premise of a bunch of the so-called “data mining” community is that massive distributed search over huge amounts of data can supplant statistical sophistication. I disagree with that. But that’s getting off track.
The other kind of “model complexity” which may come to bear in some cases is amount of physical reality that the model-builder wants to capture. Whether it is oceanic turbulence, or vibration modes of a triatomic molecule, the physical scale, the forces, the time scale also are choices to be considered. Some are determined, by experience, or by modeling to contribute little to the phenomenon being studied with data, or they may contribute below the level of accuracy with which the data are captured. If the model has physical complexity which the data cannot convey, it is sometimes necessary to “drop terms” from the model, lest those components be erroneously forced by the noise in the data. Alternatively, the data might be smoothed, but, with good reason, people are sometimes uncomfortable doing that. I think they would rather pull out components of data and expose models to those separately, operating under the constraint that combining the components again yields the original data signal to high fidelity.
AIC, BIC, and other model selection criteria per Burnham and Anderson (*) can be useful guides, especially where, e.g., physics does not dictate model components, but they argue, and I agree, and Dr Cook, I think, says, its important that models all have their day in the court of data, and there are ways of scoring them when they do.
——
(*) K. P. Burnham, D. R. Anderson, Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd edition, 2002.