The two-sample t-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.
This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the F-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.
Sensitivity and robustness
Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the F-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the F-test to perform terribly under simulation, relative to its recommended alternatives Bartlett’s test and Levene’s test. That’s not exactly what I found.
For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the F-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.
I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the F-test. Then I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.
For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.
Normal distribution calibration
Here are the results from the normal distribution simulations.
| Test | Alpha | Type I | Type II |
| F | 0.13 | 0.0390 | 0.1863 |
| Bartlett | 0.04 | 0.0396 | 0.1906 |
| Levene | 0.06 | 0.0439 | 0.2607 |
Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.
The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.
To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.
Heavy-tail simulation results
Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student t distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.
| Test | Alpha | Type I | Type II |
| F | 0.13 | 0.2417 | 0.2852 |
| Bartlett | 0.04 | 0.2165 | 0.2859 |
| Levene | 0.06 | 0.0448 | 0.4537 |
The operating characteristics degraded when drawing samples from a heavy-tailed distribution, t with 4 degrees of freedom, but they didn’t degrade uniformly.
Compared to the F-test, the Bartlett test had slightly better Type I error and the same Type II error.
The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.
Conclusion: The F-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the F-test, depending on the relative importance you place on Type I and Type II error.