Kolmogorov-Smirnov vs Shapiro-Wilks goodness of fit

I was reading Rupert Miller’s book Beyond ANOVA when I ran across this line:

I never use the Kolmogorov-Smirnov test (or one of its cousins) or the χ² test as a preliminary test of normality. … I have a feeling they are more likely to detect irregularities in the middle of the distribution than in the tails.

Rupert wrote these words in 1986 when it would have been difficult to test is hunch. Now it’s easy, and so I wrote up a little simulation to test whether his feeling was justified. I’m sure this has been done before, but it’s easy (now—it would not have been in 1986) and so I wanted to do it myself.

I’ll compare the Kolmogorov-Smirnov test, a popular test for goodness-of-fit, with the Shapiro-Wilks test that Miller preferred. I’ll run each test 10,000 times on non-normal data and count how often each test produces a p-value less than 0.05.

To produce departures from normality in the tails, I’ll look at samples from a Student t distribution. This distribution has one parameter, the number of degrees of freedom. The fewer degrees of freedom, the thicker the tails and so the further from normality in the tails.

Then I’ll look at a mixture of a normal and uniform distribution. This will have thin tails like a normal distribution, but will be flatter in the middle.

If Miller was right, we should expect the Shapiro-Wilks to be more sensitive for fat-tailed t distributions, and the K-S test to be more sensitive for mixtures.

First we import some library functions we’ll need and define our two random sample generators.

from numpy import where
from scipy.stats import *

def mixture(p, size=100):
    u = uniform.rvs(size=size)
    v = uniform.rvs(size=size)
    n = norm.rvs(size=size)
    x = where(u < p, v, n)
    return x

def fat_tail(df, size=100):
    return t.rvs(df, size=size)

Next is the heart of the code. It takes in a sample generator and compares the two tests, Kolmogorov-Smirnov and Shapiro-Wilks, on 10,000 samples of 100 points each. It returns what proportion of the time each test detected the anomaly at the 0.05 level.

def test(generator, parameter):

    ks_count = 0
    sw_count = 0

    N = 10_000
    for _ in range(N):
        x = generator(parameter, 100)

        stat, p = kstest(x, "norm")
        if p < 0.05:
            ks_count += 1
    
        stat, p = shapiro(x)
        if p < 0.05:
            sw_count += 1
    
    return (ks_count/N, sw_count/N)

Finally, we call the test runner with a variety of distributions.

for df in [100, 10, 5, 2]:
    print(test(fat_tail, df))

for p in [0.05, 0.10, 0.15, 0.2]:
    print(test(mixture,p))

Note that the t distribution with 100 degrees of freedom is essentially normal, at least as far as a sample of 100 points can tell, and so we should expect both tests to report a lack of fit around 5% of the time since we’re using 0.05 as our cutoff.

Here’s what we get for the fat-tailed samples.

(0.0483, 0.0554)
(0.0565, 0.2277)
(0.1207, 0.8799)
(0.8718, 1.0000)

So with 100 degrees of freedom, we do indeed reject the null hypothesis of normality about 5% of the time. As the degrees of freedom decrease, and the fatness of the tails increases, both tests reject the null hypothesis of normality more often. However, in each chase the Shapiro-Wilks test picks up on the non-normality more often than the K-S test, about four times as often with 10 degrees of freedom and about seven times as often with 5 degrees of freedom. So Miller was right about the tails.

Now for the middle. Here’s what we get for mixture distributions.

(0.0731, 0.0677)
(0.1258, 0.1051)
(0.2471, 0.1876)
(0.4067, 0.3041)

We would expect both goodness of fit tests to increase their rejection rates as the mixture probability goes up, i.e. as we sample from the uniform distribution more often. And that is what we see. But the K-S test outperforms the S-W test each time. Both test have rejection rates that increase with the mixture probability, but the rejection rates increase faster for the K-S test. Miller wins again.

3 thoughts on “Testing Rupert Miller’s suspicion”

Amos Newcombe

18 September 2019 at 20:38

I thought that this was well known and the solution was to use Kuiper’s variant of the KS test. At least this seems to be what _Numerical Recipes_ (3rd edition, section 14.3.4) is saying.
Derek Jones

19 September 2019 at 07:10

“Power comparisons of the Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests” by Razali agrees with you ;-)
John

19 September 2019 at 08:28

Thanks, Amos. I’m not familiar with Kuiper’s variant. Miller thought that variations on KS had the same problem, but I don’t know whether he knew of Kuiper’s version.

I was going to repeat my simulations with the test you recommend, but apparently there’s no implementation of it in SciPy, and I don’t want to put more time into this by searching for implementations or writing one myself.

Comments are closed.

More posts on hypothesis testing

3 thoughts on “Testing Rupert Miller’s suspicion”