**Overfitting** happens when a model does too good a job of matching a particular data set and so does a poor job on new data. The way traditional statistical models address the danger of overfitting is to limit the number of parameters. For example, you might fit a straight line (two parameters) to 100 data points, rather than using a 99-degree polynomial that could match the input data exactly and probably do a terrible job on new data. You find the best fit you can to a model that doesn’t have enough flexibility to match the data too closely.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to **deliberately do a mediocre job** of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find **an optimally suboptimal solution**, i.e. the intentionally suboptimal fit that performs the best in validation.

Using a reasonable fitting procedure like Maximum Likelihood on multiple models can be combined with Akaike information criterion (AIC) to give a good view of overfitting in some cases. Frequently, an expanded set of parameters will destroy information.

Have you seen this talk? https://youtu.be/bLqJHjXihK8

Yes, that’s a great talk. I think he’s really on to something. That line of reasoning could lead to more deliberate way to train a neural net.

Can’t you just look at regularization and observe the training vs evaluation errors?

Yes, but it’s a kind of

passiveregularization. It’s not like using an explicit penalty on parameter size as in ridge regression. It’s hoping that doing a mediocre job of fitting the model will provide a useful form of regularization. And often it does!It’s like creating an abstract painting by commissioning someone to create a Renaissance-style painting, but interrupting them half way through. Depending on how the artist works, you might get something you like. But you might get a better abstract painting by commissioning an abstract painting.

This is a well-known phenomenon described by Geoffrey Hinton as “beautiful free lunch”. More details in the context of linear regression in book Hands-On Machine Learning with Scikit-Learn and TensorFlow.

It seems like you’d be interested in this paper from earlier this year: “Understanding deep learning requires rethinking generalization” https://arxiv.org/abs/1611.03530

I used to feel the same way, and used to go around bashing early stopping.

If you analyze early stopping in the context of a linear model, it’s very similar to L2 regularization.

I’m new to ai so this might be a misunderstanding — but overfitting sounds like memorization to me. I mean if you treated it as a memory system couldn’t you use some sort of fitting criteria to compare the similarity of two overfitted inputs? Wouldn’t that be similar to how some methods compare datapoints based on distance, or am I out of my depth here?

Memorization would be the extreme of overfitting: your model would match your training data perfectly, but be useless for future input.

Now if you compare new input to old, that’s the basis of nearest neighbor clustering.

As Phillip above mentions, you can analyse the regularisation effects of early stopping algorithms mathematically, prove theorems etc. So perhaps not as bad as sometimes thought (I also used to look down on this approach but not anymore).

Re: memorization, that is the common wisdom, but the paper I linked earlier (“Understanding deep learning requires rethinking generalization”) made me re-evaluate it. Common network architectures are fully capable of completely memorizing their training sets *and yet* generalize to new inputs anyway. As a community, we’re not sure why.