The two-sample *t*-test is a way to test whether two data sets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less than ideal circumstances.

This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the *F*-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.

## Sensitivity and robustness

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the *F*-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the *F*-test to perform terribly under simulation, relative to its recommended alternatives **Bartlett’s test** and **Levene’s test**. That’s not exactly what I found.

## Simulation design

For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the *F*-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the *F*-test. Then I used the same designs for data coming from a fat-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.

## Normal distribution calibration

Here are the results from the normal distribution simulations.

|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.0390 | 0.1863 | | Bartlett | 0.04 | 0.0396 | 0.1906 | | Levene | 0.06 | 0.0439 | 0.2607 | |----------+-------+--------+---------|

Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

## Fat-tail simulation results

Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student *t* distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.2417 | 0.2852 | | Bartlett | 0.04 | 0.2165 | 0.2859 | | Levene | 0.06 | 0.0448 | 0.4537 | |----------+-------+--------+---------|

The operating characteristics degraded when drawing samples from a fat-tailed distribution, *t* with 4 degrees of freedom, but they didn’t degrade uniformly.

Compared to the *F*-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

**Conclusion**: The *F*-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the *F*-test, depending on the relative importance you place on Type I and Type II error.

John,

Thanks for the clear post. Presumably this is the vanilla Levene test with the mean rather than median or trimmed mean (Brown & Forsythe)?

James

It is using the median. I’m using the implementation in SciPy with default options.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html

How come you had to tune the alpha level under the normal condition? Shouldn’t the Type I error = alpha in that case?

I didn’t start with a significance level and solve for a sample size. Instead I arbitrarily fixed a sample size and found a p-value threshold that gave me approximately the error rates I was after.

It’s well known that the Bartlett is sensitive to departures from normality. A good reference is here Non-Normality and Tests on Variance by Box (https://www.jstor.org/stable/pdf/2333350.pdf). Following Box, I’ve found that the Bartlett test can be ‘fixed’ in some cases by adjusting for the excess kurtosis of the distribution. Under certain conditions, the Bartlett test has distribution (1+k/2)X, where X is a Chi-Square distribution on appropriate degrees of freedom, and k is the excess kurtosis of the distribution. (Excess kurtosis is zero for the normal distribution.) One can then ‘fix’ the Bartlett by dividing test statistics by (1+k/2).