A **two-sample t-test** is intended to determine whether there’s evidence that two samples have come from distributions with different means. The test assumes that both samples come from normal distributions.

## Robust to non-normality, not to asymmetry

It is fairly well known that the *t*-test is robust to departures from a normal distribution, as long as the actual distribution is symmetric. That is, the test works more or less as advertised as long as the distribution is symmetric like a normal distribution, but it may not work as expected if the distribution is asymmetric.

This post will explore the robustness of the *t*-test via simulation. How far can you be from a normal distribution and still do OK? Can you have *any* distribution as long as it’s symmetric? Does a little asymmetry ruin everything? If something does go wrong, *how* does it go wrong?

## Experiment design

For the purposes of this post, we’ll compare the null hypothesis that two groups both have mean 100 to the alternative hypothesis that one group has mean 100 and the other has mean 110. We’ll assume both distributions have a standard deviation of 15. There is a variation on the two-sample *t*-test that does not assume equal variance, but for simplicity we will keep the variance the same in both groups.

We’ll do a typical design, 0.05 significance and 0.80 power. That is, under the null hypothesis, that is if the two groups do come from the same distribution, we expect the test to wrongly conclude that the two distributions are different 5% of the time. And under the alternative hypothesis, when the groups do come from distributions with means 100 and 110, we expect the test to conclude that the two means are indeed different 80% of the time.

Under these assumptions, you’d need a sample of 36 in each group.

## Python simulation code

Here’s the code we’ll use for our simulation.

from scipy.stats import norm, t, gamma, uniform, ttest_ind num_per_group = 36 def simulate_trials(num_trials, group1, group2): num_reject = 0 for _ in range(num_trials): a = group1.rvs(num_per_group) b = group2.rvs(num_per_group) stat, pvalue = ttest_ind(a, b) if pvalue < 0.05: num_reject += 1 return(num_reject)

## Normal simulation

Let’s see how the two-sample *t*-test works under ideal conditions by simulating from the normal distributions that the method assumes.

First we simulate from the null, i.e. we draw the data for both groups from the same distribution

n1 = norm(100, 15) n2 = norm(100, 15) print( simulate_trials(1000, n1, n2) )

Out of 1000 experiment simulations, the method rejected the null 54 times, very close to the expected number of 50 based on 0.05 significance.

Next we simulate under the alternative, i.e. we draw data from different distributions.

n1 = norm(100, 15) n2 = norm(110, 15) print( simulate_trials(1000, n1, n2) )

This time the method rejected the null 804 times in 1000 simulations, very close to the expected 80% based on the power.

## Gamma simulation

Next we draw data from a gamma distribution. First we draw both groups from a gamma distribution with mean 100 and standard deviation 15. This requires a shape parameter of 44.44 and a scale of 2.25.

g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 44.44, scale=2.25) print( simulate_trials(1000, g1, g2) )

Under the null simulation, we rejected the null 64 times out of 1000 simulations, higher than expected, but not that remarkable.

Under the alternative simulation, we pick the second gamma distribution to have mean 110 and standard deviation 15, which requires shape 53.77 and scale 2.025.

g1 = gamma(a = 44.44, scale=2.25) g2 = gamma(a = 53.77, scale=2.045) print( simulate_trials(1000, g1, g2) )

We rejected the null 805 times in 1000 simulations, in line with what you’d expect from the power.

When the gamma distribution has such large shape parameters, the distributions are approximately normal, though slightly asymmetric. To increase the asymmetry, we’ll use a couple gamma distributions with smaller mean but shifted over to the right. Under the null, we create two gamma distributions with mean 10 and standard deviation 15, then shift them to the right by 90.

g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 6.67, scale=1.5, loc=90) print( simulate_trials(1000, g1, g2) )

Under this simulation we reject the null 56 times out of 1000 simulations, in line with what you’d expect.

For the alternative simulation, we pick the second distribution to have a mean of 20 and standard deviation 15, and then shift it to the right by 90 so tht it has mean 110. This distribution has quite a long tail.

g1 = gamma(a = 6.67, scale=1.5, loc=90) g2 = gamma(a = 1.28, scale=11.25, loc=90) print( simulate_trials(1000, g1, g2) )

This time we rejected the null 499 times in 1000 simulations. This is a serious departure from what’s expected. Under the alternative hypothesis, we reject the null 50% of the time rather than the 80% we’d expect. If higher mean is better, this means that half the time you’d fail to conclude that the better group really is better.

## Uniform distribution simulation

Next we use uniform random samples from the interval [74, 126]. This is a symmetric distribution with mean 100 and standard deviation 15. When we draw both groups from this distribution we rejected the null 48 times, in line with what we’d expect.

u1 = uniform(loc=74, scale=52) u2 = uniform(loc=74, scale=52) print( simulate_trials(1000, u1, u2) )

If we move the second distribution over by 10, drawing from [84, 136], we rejected the null 790 times out of 1000, again in line with what you’d get from a normal distribution.

u1 = uniform(loc=74, scale=52) u2 = uniform(loc=84, scale=52) print( simulate_trials(1000, u1, u2) )

In this case, we’ve made a big departure from normality and the test still worked as expected. But that’s not always the case, as in the *t*(3) distribution below.

## Student-*t* simulation

Finally we simulate from a Student-*t* distribution. This is a symmetric distribution, but heavier in the tails than the normal distribution.

First, we simulate from a *t* distribution with 6 degrees of freedom and scale 12.25, making the standard deviation 15. We shift the location of the distribution by 100 to make the mean 100.

t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=100, scale=12.25) print( simulate_trials(1000, t1, t2) )

When both groups come from this distribution, we rejected the null 46 times. When we shifted the second distribution to have mean 110, we rejected the null 777 times out of 1000.

t1 = t(df=6, loc=100, scale=12.25) t2 = t(df=6, loc=110, scale=12.25) print( simulate_trials(1000, t1, t2) )

In both cases, the results are in line what we’d expect given a 5% significance level and 80% power.

A *t* distribution with 6 degrees of freedom has a heavy tail compared to a normal. Let’s try making the tail even heavier by using 3 degrees of freedom. We make the scale 15 to keep the standard deviation at 15.

t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=100, scale=15) print( simulate_trials(1000, t1, t2) )

When we draw both samples from the same distribution, we rejected the null 37 times out of 1000, less than the 50 times we’d expect.

t1 = t(df=3, loc=100, scale=15) t2 = t(df=3, loc=110, scale=15) print( simulate_trials(1000, t1, t2) )

When we shift the second distribution over to have mean 110, we rejected the null 463 times out of 1000, far less than the 80% rejection you’d expect if the data were normally distributed.

Just looking at a plot of the PDFs, a *t*(3) looks much closer to a normal distribution than a uniform distribution does. But the tail behavior is different. The tails of the uniform are as thin as you can get—they’re zero!—while the *t*(3) has heavy tails.

These two examples show that you can replace the normal distribution with a moderately heavy tailed symmetric distribution, but don’t overdo it. When the data some from a heavy tailed distribution, even one that is symmetric, the two-sample *t*-test may not have the operating characteristics you’d expect.

Thank you, enlightening and useful article.

Interesting analysis. In the last example, with the heavy tails, one might argue that the t-test does exactly what it should: the heavy tailed example actually has less power than the Gaussian case, and the test confirms it.

“The test assumes that both samples come from normal distributions” : I believe you only need to assume that the *means* of the distributions are normally distributed. This makes quite a difference, because with the central limit theorem you get this normality approximately for asymptotic n, assuming the first two moments exist.

I think this should be said, as some people might be afraid they can not use the t-test because their data is not at all normally distributed ; but if they have enough data it is not an unreasonable thing to do.

Thanks for the nice post! I think there is an error, however, as far as the uniform distribution is concerned. The mean of uniform random variables converges to a normal distribution very quickly (even for n as small as 20, you get something that closely looks like a normal distribution). I’m not very fluent in Python, but with a simulation done in R, I get 79% power, and not 36% power as you do. Could you check whether there is an error in your code? (sorry if there’s one in mine, or if I have misunderstood what your simulation aimed to do)

Here’s my code:

nsim <- 1e4

result <- vector(mode = "double", length = nsim)

for(i in 1:nsim){

low <- runif(n=36, min = 74, max = 126)

high <- runif(n=36, min = 84, max = 136)

result[i] qt(df=70, p=0.975))/length(result)

@Arthur: It’s true that eventually the t-test will tell two distributions apart if the CLT applies. But if you do a sample size calculation assuming normality, but your distribution has a heavier tail, your study will be under-powered.

@Aurélien: Thanks for your comment. You were right. I’d misunderstood how SciPy parameterizes the mean. I had assumed the first two parameters were the end points of the interval, but that’s not how it does things. I’ve corrected the code and the text above, and my results are now similar to yours.

thanks John. Question: how’d you get to the sample size?

I used a program called STPLAN.