I recently found out about a book that was published earlier this year, The Cult of Statistical Significance by Stephen Ziliak and Deidra McCloskey. The subtitle is sure to stir up controversy: How the Standard Error Costs Us Jobs, Justice, and Lives.
From the parts I’ve read it sounds like the central criticism of the book is that statistical significance is not necessarily scientific significance. Statistical significance questions whether an effect exists and is unconcerned with the size or importance of the effect.
Significance testing errs in two directions. First, in practice many people believe that any hypothesis with a p-value less than 0.05 is very likely true and important, though often such hypotheses are untrue and unimportant. Second, many act as if a hypothesis with a p-value greater than 0.05 is “insignificant” regardless of context. Not only is the 0.05 cutoff arbitrary, it is quite common to say there is evidence if p = 0.049 and to say there is no evidence if p = 0.051. Common sense tells you that if 0.049 provides evidence then 0.051 provides slightly less evidence rather than no evidence.
The book gives the example of Merck saying there is “no evidence” that Vioxx has a higher probability of causing heart attacks than naproxen because their study did not achieve the magical 0.05 significance level. The book argues that “significance” should depend on context. When the stakes are higher, such as people suffering heart attacks, it should take less evidence before we declare an effect significant. Also, if you don’t want to find significance, you can always reduce the size of your study to decrease your chances of finding significance. [I have not followed the Vioxx case and have no opinion on its specifics.] In addition to the Vioxx case, Ziliak and McCloskey provide case studies in economics, psychology, and medicine.
Whenever someone raises objections to significance testing the reaction is always “Yes, everyone knows that.” Everyone agrees that the 0.05 cutoff is arbitrary, everyone agrees that effect sizes matter, etc. And yet nearly everyone continues to play the p < 0.05 game.
Related posts:
Origin of “statistically significant”
Five criticisms of significance testing


{ 8 comments… read them below or add one }
Scott 12.03.08 at 09:35
I could not agree more with this, John. It amazes me that everyone “knows” that effect size is important and the .05 cutoff is arbitrary, but published research continues to center around significance anyway. As you know, not only does decreasing the sample size remove significance, even the tiniest effect becomes significant if your sample size is large enough. In the field of education this means that almost anything is significant when thousands of children are tested. (Just about anything in a classroom has at least a tiny effect on just about anything else.) I am currently distressed over certain questionable practices that are being pushed because research has shown their “significance” even though effect sizes are clearly very small. Unfortunately the currently published research on these practices is not talking about the effect sizes at all.
John, how can academia’s knowledge of statistics be so sophisticated and their research be so ignorant of these basics?
John Venier 12.03.08 at 12:28
Amen to that. Some examples come to mind — Linus Pauling’s testing of the benefits of vitamin C showed statistical significance which was clinically trivial — his study had a large sample size.
Every time you see an ad on TV which says something like “no toothpaste was shown to be better than ours” mentally add “in our test with no power”.
One professor I know did some consulting for an unnamed hospital which was being sued by patients who contacted infections after hip replacement surgery. The professor was hired as an expert witness to testify whether the frequency of such infections at that hospital was worse than expected or not. When he analzyed the data as a whole there was no significant evidence that their rate was higher than expected. But the hospital had three surgical teams which performed those surgeries. When the data were grouped according to teams, one team was clearly much more likely to leave patients with an infection. As it turned out, that team was relatively new compared to the other two, and evidently it makes a big difference how long a team has been working together.
I’ve seen other cutoffs in other fields — 0.10 and 0.20 in fields such as psychology, sociology, and biology. But the papers I saw still used cutoffs to assess significance. I recall that they did publish the actual p-values that they obtained.
ekzept 12.03.08 at 22:31
There is a heartening discussion of this in Kraemer and Thiemann, How Many Subjects? Statistical Power Analysis In Research. The other context it is dealt with is in the careful comparison of Bayesian vs Frequentist decision theory, per J.O.Berger’s treatment, Statistical Decision Theory and Bayesian Analysis which I found thorough but difficult. Finally, a recent one: D.J.Murdoch, Y.-L. Tsai, J.Adcock, “P-values are random variables”, The American Statistician, 62(3), 242-245.
John Johnson 12.05.08 at 12:18
What amazes me is all the Phase 2 studies that “fail to meet endpoint” or “miss statistical significance” when those studies aren’t even powered in the first place. Or, even worse, the practice of “p-value fishing,” i.e. the attempt to cover up a worthless drug by showing a p-value less than 0.05.
This “cult of statistical significance” has led to many bizarre behaviors and habits in the scientific community, ones I think we would do well to expunge.
John Venier 12.05.08 at 13:13
It seems that a lot of lucrative drugs are found to have ‘better’ replacements just as their patents are expiring. The manufacturer holds the patent on the replacement, naturally. Often these replacements are metabolites of the original drug which are found to have the same action as the original drug. I suspect that the studies which demonstrate a (statistically) significant increase in effectiveness are extremely overpowered and that the clinical significance is nil.
Another ‘improvement’ strategy is simply using the more active stereoisomer (enantiomer) exclusively instead of both. The non-racemic drug can be patented, named, and marketed as a new drug, resulting in literally billions of dollars of profit. Prilosec and Nexium are a great example of this. I would imagine that with billions of dollars at stake AstraZenica tried every way imaginable to demonstrate any statistically significant improvement, however meager. From what I understand (and I am no specialist) there is actually no clinically meaningful difference between the drugs, but Nexium has a shiny new patent and Prilosec is now available as a generic OTC.
Luis 11.16.10 at 20:28
Of course ‘everyone knows that’ but what are the alternatives? I think the biggest problem is not showing that the current use of ’statistical significant’ is flawed: it is very easy to see with your example of 0.049 versus 0.051. The issue is what can researchers easily use instead.
The almost automatic reaction from researchers is to look for small p-values (years of reading papers highlighting them) and, suddenly, you want them to introduce a not so clearcut difference between significant (and therefore many times publishable) or not. It is not a matter of just following a recipe, we know that, but besides the statement of the obvious we need alternatives that we can sell to those pesky journal editors demanding small p-values.
John 11.16.10 at 20:39
Luis: One possibility would be to use estimation rather than hypothesis testing and report estimated effect sizes. And in a Bayesian context you could do more, reporting not just the posterior mean but also the probability that the estimated parameter exceeds a threshold of scientific significance.
Also in the Bayesian context, you could report posterior model probabilities based on Bayes factors. That’s what most people think a p-value is. See How loud is the evidence?
Another is to report probable effect sizes, or the probability that an effect exceeds a certain threshold.
Luis 11.18.10 at 20:01
Thanks for your answer. Effect sizes still prompts the question ‘but are the effects really significantly different?’ I do agree that the probability that an effect exceeds a certain threshold could be more meaningful. Thanks for the interesting blog posts.