The cult of significance testing

I recently found out about a book that was published earlier this year, The Cult of Statistical Significance by Stephen Ziliak and Deidra McCloskey. The subtitle is sure to stir up controversy: How the Standard Error Costs Us Jobs, Justice, and Lives.

From the parts I’ve read it sounds like the central criticism of the book is that statistical significance is not necessarily scientific significance. Statistical significance questions whether an effect exists and is unconcerned with the size or importance of the effect.

Significance testing errs in two directions. First, in practice many people believe that any hypothesis with a p-value less than 0.05 is very likely true and important, though often such hypotheses are untrue and unimportant. Second, many act as if a hypothesis with a p-value greater than 0.05 is “insignificant” regardless of context. Not only is the 0.05 cutoff arbitrary, it is quite common to say there is evidence if p = 0.049 and to say there is no evidence if p = 0.051. Common sense tells you that if 0.049 provides evidence then 0.051 provides slightly less evidence rather than no evidence.

The book gives the example of Merck saying there is “no evidence” that Vioxx has a higher probability of causing heart attacks than naproxen because their study did not achieve the magical 0.05 significance level. The book argues that “significance” should depend on context. When the stakes are higher, such as people suffering heart attacks, it should take less evidence before we declare an effect significant. Also, if you don’t want to find significance, you can always reduce the size of your study to decrease your chances of finding significance. [I have not followed the Vioxx case and have no opinion on its specifics.] In addition to the Vioxx case, Ziliak and McCloskey provide case studies in economics, psychology, and medicine.

Whenever someone raises objections to significance testing the reaction is always “Yes, everyone knows that.” Everyone agrees that the 0.05 cutoff is arbitrary, everyone agrees that effect sizes matter, etc. And yet nearly everyone continues to play the p < 0.05 game.

Related posts

12 thoughts on “The cult of significance testing

  1. I could not agree more with this, John. It amazes me that everyone “knows” that effect size is important and the .05 cutoff is arbitrary, but published research continues to center around significance anyway. As you know, not only does decreasing the sample size remove significance, even the tiniest effect becomes significant if your sample size is large enough. In the field of education this means that almost anything is significant when thousands of children are tested. (Just about anything in a classroom has at least a tiny effect on just about anything else.) I am currently distressed over certain questionable practices that are being pushed because research has shown their “significance” even though effect sizes are clearly very small. Unfortunately the currently published research on these practices is not talking about the effect sizes at all.

    John, how can academia’s knowledge of statistics be so sophisticated and their research be so ignorant of these basics?

  2. Amen to that. Some examples come to mind — Linus Pauling’s testing of the benefits of vitamin C showed statistical significance which was clinically trivial — his study had a large sample size.

    Every time you see an ad on TV which says something like “no toothpaste was shown to be better than ours” mentally add “in our test with no power”.

    One professor I know did some consulting for an unnamed hospital which was being sued by patients who contacted infections after hip replacement surgery. The professor was hired as an expert witness to testify whether the frequency of such infections at that hospital was worse than expected or not. When he analzyed the data as a whole there was no significant evidence that their rate was higher than expected. But the hospital had three surgical teams which performed those surgeries. When the data were grouped according to teams, one team was clearly much more likely to leave patients with an infection. As it turned out, that team was relatively new compared to the other two, and evidently it makes a big difference how long a team has been working together.

    I’ve seen other cutoffs in other fields — 0.10 and 0.20 in fields such as psychology, sociology, and biology. But the papers I saw still used cutoffs to assess significance. I recall that they did publish the actual p-values that they obtained.

  3. There is a heartening discussion of this in Kraemer and Thiemann, How Many Subjects? Statistical Power Analysis In Research. The other context it is dealt with is in the careful comparison of Bayesian vs Frequentist decision theory, per J.O.Berger’s treatment, Statistical Decision Theory and Bayesian Analysis which I found thorough but difficult. Finally, a recent one: D.J.Murdoch, Y.-L. Tsai, J.Adcock, “P-values are random variables”, The American Statistician, 62(3), 242-245.

  4. What amazes me is all the Phase 2 studies that “fail to meet endpoint” or “miss statistical significance” when those studies aren’t even powered in the first place. Or, even worse, the practice of “p-value fishing,” i.e. the attempt to cover up a worthless drug by showing a p-value less than 0.05.

    This “cult of statistical significance” has led to many bizarre behaviors and habits in the scientific community, ones I think we would do well to expunge.

  5. It seems that a lot of lucrative drugs are found to have ‘better’ replacements just as their patents are expiring. The manufacturer holds the patent on the replacement, naturally. Often these replacements are metabolites of the original drug which are found to have the same action as the original drug. I suspect that the studies which demonstrate a (statistically) significant increase in effectiveness are extremely overpowered and that the clinical significance is nil.

    Another ‘improvement’ strategy is simply using the more active stereoisomer (enantiomer) exclusively instead of both. The non-racemic drug can be patented, named, and marketed as a new drug, resulting in literally billions of dollars of profit. Prilosec and Nexium are a great example of this. I would imagine that with billions of dollars at stake AstraZenica tried every way imaginable to demonstrate any statistically significant improvement, however meager. From what I understand (and I am no specialist) there is actually no clinically meaningful difference between the drugs, but Nexium has a shiny new patent and Prilosec is now available as a generic OTC.

  6. Of course ‘everyone knows that’ but what are the alternatives? I think the biggest problem is not showing that the current use of ‘statistical significant’ is flawed: it is very easy to see with your example of 0.049 versus 0.051. The issue is what can researchers easily use instead.

    The almost automatic reaction from researchers is to look for small p-values (years of reading papers highlighting them) and, suddenly, you want them to introduce a not so clearcut difference between significant (and therefore many times publishable) or not. It is not a matter of just following a recipe, we know that, but besides the statement of the obvious we need alternatives that we can sell to those pesky journal editors demanding small p-values.

  7. Luis: One possibility would be to use estimation rather than hypothesis testing and report estimated effect sizes. And in a Bayesian context you could do more, reporting not just the posterior mean but also the probability that the estimated parameter exceeds a threshold of scientific significance.

    Also in the Bayesian context, you could report posterior model probabilities based on Bayes factors. That’s what most people think a p-value is. See How loud is the evidence?

    Another is to report probable effect sizes, or the probability that an effect exceeds a certain threshold.

  8. Thanks for your answer. Effect sizes still prompts the question ‘but are the effects really significantly different?’ I do agree that the probability that an effect exceeds a certain threshold could be more meaningful. Thanks for the interesting blog posts.

  9. And in a Bayesian context you could do more, reporting not just the posterior mean but also the probability that the estimated parameter exceeds a threshold of scientific significance.

    Bayesian approach is not a universal solution. In the Bayesian approach how do you solve an issue with unknown prior? I have heard that a uniform is used instead. Isn’t as adhoc as a 0.05 p-value cutoff.

    Of course, 0.05 is a heuristic, it does not replace a well-designed experimental setup.

  10. There are countless implicit sources of subjectivity in any statistical analysis, and yet everyone focuses on the one explicitly subjective decision, the choice of prior.

    There are many ways to address this issue. You could do a sensitivity analysis to see whether the choice of prior has much effect on the conclusion; often it doesn’t. You could use “reference priors,” priors justified by some optimization property. You could use a subjective prior based on a consensus of expert opinion. You could look for analogous data to base a prior on.

    There are often much weaker links in the analysis chain.

  11. I run into this issue in my work fairly often, too, in that we often have millions of records with which to conduct analysis. As a result, basically everything is highly significant, even if the difference between levels of a factor are very small. I try to avoid the issue as much as feasible in my papers, but sometimes you have to shrug, put in the p-value, and move on. The best way forward probably is to go Bayesian as discussed above.

  12. I think that P-values and significance testing get a pretty bad rap and have said so before ( But I completely agree that when possible, including a measure of scientific significance such as a confidence interval is both beneficial and appropriate.

    The main criticisms of P-values and hypothesis testing focus on errors in properly interpreting the results of these procedures. Regardless of what statistic people choose (Bayesian or Frequentist or Likelihoodist or Algorithmic), scientists and business people will want to use that statistic to make decisions. It isn’t clear to me that replacing the P-value with statistic X (posterior probability, etc.) will not lead to just as many issues with interpretation.

    For example, I don’t think that the problem with Bayesian approaches is the subjectivity of the prior. John is spot on when he says that most of statistics is based on subjective decisions, regardless of your framework. The problem is that I’m not sure Bayesian inference will actually lead to fewer mistakes in interpretation.

    I think rather than focusing on criticizing the use of specific statistics, we should be spending our energy on increasing statistical literacy. I think this would have a much larger positive impact than just outlawing significance testing/P-values.

Comments are closed.