The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.
An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.
If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.
With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.
Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.
This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.
My younger brother did his math MS testing the robustness of a watershed fishery model (fish in a stream). His approach was to first determine which parts of the model were most sensitive to random input data perturbations, then repeat the analysis with various combinations of input perturbations.
The sensitivities he found were then individually tested against incoming data, to see which perturbations existed in nature, and their effect(s). This approach permitted some very subtle interactions to be detected and quantified, then correlated with the model.
When perturbation analysis revealed sensitivities, my brother was free to tailor synthetic input data sets with the goal of intentionally breaking the model. If successful, the data was then compared to the available input data to determine the likelihood of such events in nature.
Not only did the model survive the analysis, it also became “the most thoroughly tested model of its kind” (whatever that means).
One key aspect was that the model was based on years of observational data (and years of computer time), with more data arriving about 4 times each year, greatly reducing the risks associated with partitioning existing data for use in test sets.
About 3 years after my brother’s graduation, the stream being modeled experienced an unprecedented drought. The perturbation testing of the individual inputs had indeed individually tested the affected input parameters to this degree, but not as a complete set.
Based on the robustness testing, the model was trusted enough to be used as a guide to predict the stream’s recovery, and to evaluate what actions, if any, would be most helpful, and, more importantly, which should be avoided.
The model did predict very well the path of the recovery in the relatively small area it covered. My take-away from this is that no only was the initial model itself of unusually great quality, but that it was the testing my brother did that built trust in the model. The model itself was groundbreaking (the most complex and comprehensive micro-ecological model of its time), which was what directly motivated the innovative testing my brother developed and performed.
What’s “fair to the model” may not be the best perspective: Perhaps “identify, quantify and test the model’s weaknesses” may be more useful.
Your description of your brother’s model had several phrases that warm my heart: “robustness,” “years of observational data”, “thoroughly tested.”
Those things are missing from the kind of experimental design that I’m criticizing.
If I understand rightly, this seems analogous to the problem of perception and learning in general.
Making sense of what you see requires not only models to fit to the data, but for your eyes and attention to be focused on the right part of the scene to gather the necessary data to confirm or refute these models. Different models will require different data, guiding our attention providing us with a feedback loop between collecting data and forming models.
In the brain, this loop is rapid and continuous. When conducting research studies, less so.