The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.
- If a person is an American, he is not a member of Congress.
- This person is a member of Congress.
- Therefore he is not American.
The initial premise is false, but the reasoning is correct if we assume the initial premise is true.
The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?
- If a person is an American, he is probably not a member of Congress.
- This person is a member of Congress.
- Therefore he is probably not American.
What went wrong? The probability is backward. We want to know the probability that someone is American given he is a member of Congress, not the probability he is a member of Congress given he is American.
Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.
- If the null hypothesis is correct, then these data are highly unlikely.
- These data have occurred.
- Therefore, the null hypothesis is highly unlikely.
Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.
We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!
I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.
Thats a well argued point, and its generally true, but in an attempt to defend significance, its also true that (given some assumptions) if we were to run this test twenty times (assuming independent random samples each time, of course, is unlikely), and the null hypothesis was true, we’d only accidentally reject it once. Its all about risk and the likelihood of the null hypothesis of being true a priori (in practice, of course, the null hypothesis is rarely ever true). So for determining whether a drug is more effective than placebo then being conservative makes a certain amount of sense, as we don’t want to easily accept a new drug. For determining whether a drug has a nasty side effect, having a null hypothesis that it doesn’t is potentially quite a bad idea.
I agree and this is a great example — but to me, the most striking part of that article were the quotes! Hypothesis testing is “a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring”? Wow.
mister k: I agree that if the null hypothesis were true, it would not often be rejected. But the kicker is if the null hypothesis were true. Very often the null hypothesis is something nobody believes. For example, the hypothesis that two treatments have the same effect. Surely they don’t, whatever they are. There must be some difference. So why do we base decisions on inference drawn from premise we agree is false?
Jerzy: Agreed. Quite a spicy quote!
What I find interesting is that it illustrates the common misunderstanding of the Frequentist point of view, in step 3 of the last syllogism. Although it is commonly viewed this way, the Frequentist position is that the data are highly unlikely conditional on the null hypothesis being true, but does not actually state anything about the probability of the null hypothesis being true. In fact, from a Frequentist perspective even talking about the probability of a null hypothesis being true is a category error, since it either is true or it is not, and there is nothing probabalistic or stochastic about whether it is true or not. Similarly, p-values are often misunderstood as the probability that the null nypothesis is true. By misunderstood I mean from the Frequentist perspective.
I don’t think there is anything inherently wrong with a null hypothesis which nobody believes. The reason is that the experiment is not about proving the null hypothesis to be true, but seeing if there is sufficient evidence to reject it as being true, again from the Frequentist perspective. Bear in mind that the Frequentist design is not complete without a statement about the power of a test conditional on an alternative hypothesis being true.
If someone offers me payment in the form of gold, and lets me choose which of two offered gold bars to accept as payment, would it be reasonable for me to test them to see if I can find evidence that would lead me to reject the null hypothesis that they contain the same number of atoms? No one would believe they have the same number of atoms, but if I can find no compelling evidence that they do not, I think it would be rational to accept either gold bar as payment. In other words, by contructing a test which uses a null hypothesis that no one believes, I can still make a rational decision. Note that by no means do I ever conclude that they have the same number of atoms, just that I can find no evidence that they do not. I certainly do not make any statement about the probability of any proposed difference in number of atoms!
Certainly I would agree that it would be more useful to construct a Bayesian experimental design and to calculate a posterior distribution to be used to make my desicion. But I don’t think that makes the Frequentist experimental design irrational simply because everyone would agree a priori that the null hypothesis is exceedingly unlikely.
Besides, two treatments could very well have the same effect, if they are the same treatment! I’m not really being facetious. What if both treatments have zero effect? For example, what if treatment A is to have someone on the other side of the world say, “abracadabra,” and treatment B is to have that person say, “shazam.” How sure are we that there is some difference between these two different treatments?
Was it irrational to think that gravity has exactly the same effect on mass regardless of the type of mass? Surely them must be some difference between the effect of gravity on a given mass of lead versus the same mass of feathers, right? If you set up an experiment with the null hypthesis that the difference in effect is exactly zero, could we fairly criticize the experiment for using a null hypothesis which no one believes?
With frequentist statistics, you typically compute the probability of a statistic being at least the observed value under the null hypothesis. So it is reasonable to assume that the null hypothesis includes enough information to compute this probability. If it is unlikely that that an American is a member of congress, suppose, as an example, that I can assume that the null hypothesis implies random sampling. Therefore, given the information that the person sampled is a member of congress I reject with high confidence the null hypothesis “The person sampled is randomly chosen from a collection of Americans.”
I find this rejection unsurprising: if I asked for a survey of a random sample of Americans and I found that those sampled all happened to be working for the survey company and in their head office at the time I requested the sample I would reject the survey.
I would similarly find unsurprising the rejection of a null hypothesis that presumed some non-random sampling process that was supposed to produce members of congress with small probability but in fact produced one.
Rewrite the first syllogism as
1. If [person] is {from USA}, [person] is probably not (member of Congress).
2. This [person] is (member of Congress).
3. Therefore, [person] is probably not {from USA}.
This is wrong because we can write a contradiction to 3 as
0. (member of Congress) is probably {from USA}
but we can not write
0. (member of Congress) must be either {from USA} or not {from USA}.
Now, rewrite the second syllogism as
1. If [data] is {from H0}, [data] is probably not (observed).
2. This [data] is (observed).
3. Therefore, [data] is probably not {from H0}
This is right because we can not write a contradiction to 3 as
0. (observed) is probably {from H0}
but we can write
0. (observed) must be either {from H0} or not {from H0}.
Conclusion: the two syllogisms are not comparable.
Even worse is when someone “accepts” the null hypothesis when p>0.05 without doing an appropriate prospective power calculation, or, finds p>0.05, does a retrospective power calculation, then spits in the face of the experiment.
I found the comments very illuminating, but wonder where Devil’s Advocate learned to develop such arguments. I’d very much like to brush up on logic, and find the argument difficult to follow.
For anyone else that tried to reach the Cohen paper and thought to look at the comments before jumping over to Google Scholar: http://www.ics.uci.edu/~sternh/courses/210/cohen94_pval.pdf