John Ioannidis wrote an article in Chance magazine a couple years ago with the provocative title Why Most Published Research Findings are False. [Update: Here’s a link to the PLoS article reprinted by Chance. And here are some notes on the details of the paper.] Are published results really that bad? If so, what’s going wrong?
Whether “most” published results are false depends on context, but a large percentage of published results are indeed false. Ioannidis published a report in JAMA looking at some of the most highly-cited studies from the most prestigious journals. Of the studies he considered, 32% were found to have either incorrect or exaggerated results. Of those studies with a 0.05 p-value, 74% were incorrect.
The underlying causes of the high false-positive rate are subtle, but one problem is the pervasive use of p-values as measures of evidence.
Folklore has it that a “p-value” is the probability that a study’s conclusion is wrong, and so a 0.05 p-value would mean the researcher should be 95 percent sure that the results are correct. In this case, folklore is absolutely wrong. And yet most journals accept a p-value of 0.05 or smaller as sufficient evidence.
Here’s an example that shows how p-values can be misleading. Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!
The example above is exaggerated, but look at the JAMA study results again. In a sample of real medical experiments, 32% of those with “significant” results were wrong. And among those that just barely showed significance, 74% were wrong.
See Jim Berger’s criticisms of p-values for more technical depth.
12 thoughts on “Most published research results are false”
Many times I see researchs saying that “some thing” was tested in thousands of people where, compared to the world population, it is less than 0,001%… so how can I believe that the investigation is correct?
Basing a conclusion on a very small subset of the world population may be legitimate. It all depends on whether the sample is representative. One of the surprising results from statistics is that the quality of an inference depends only on the size of the sample, not on the size of the population the sample was drawn from. (Assuming the population is so large that you can safely ignore the difference between sampling with and without replacement, which is true of the world population.)
See also Cosma Shalizi’s post, “The Neutral Model of Inquiry”: http://cscs.umich.edu/~crshalizi/weblog/698.html
Filtering publishable results by desirable p-value (last paragraph) sounds like a classic case of Survivor Bias. Correct?
You could say there’s a survival bias: only those experiments that meet some statistical requirement are published. But it’s more subtle than that. You’ve got to have some procedure for deciding which results are correct. I would argue that p-values are not the right filter, or that at a minimum p-values are incorrectly interpreted.
It appears relevant to me to note that — though the .05 p-value indicates a 1 in 20 chance that the null hypothesis should not have been rejected, — in cases where one is testing whether treatment A performs better than treatment B, that the likelihood that treatment A was in fact better — when the result reported was that treatment A was better with p-value < .05 — includes a high proportion of cases in which A was at least as good. Put another way, I am unlikely to be making a truly bad decision to accept and act on selecting A in preference to B, even though I may be mistaken that it is truly better. Placing in context, some of us are primarily concerned with making a choice, and not primarily with the certainty that our choice is superior. Out certainty is not unimportant, it is just of less importance. I find the discussion and observation itself fascinating.
But, proving it is wrong can be wrong as well.
xkcd illustrated this well:
Here’s a fresh report that says the situation is bad, but not as bad as all that.
The more optimistic report seems to be based only on a theoretical model, but the more pessimistic result has empirical support which the article discusses at the bottom.
Many suggest to use confidence intervals instead of p values. I also really like empiric methods like bootstrap. These methods often give me “significant results” (CIs not crossing zero) in cases when classic (no resampling) p value based inference does not.
What do I have to infer from this phenomenon? Do CIs based significancy with or without resampling have the same amount of type I error?
Which is the improvements of usign CIs over p values? Maybe just that we’d get rid of an artificial threshold that is filtering good science with negative results?
(yes there are many question mixed in here)
The question sometimes comes up “How do you know the follow up studies that test the first conclusion are right?” We don’t, but follow-up studies are much more likely to be correct than the original studies. Here’s why.
The original study had lots of choices to make, sometimes called “researcher degrees of freedom.” Every choice gives another possibility to find something statistically significant. Rejected hypotheses and rejected ways of evaluating hypotheses fall on the cutting room floor never to be seen again. But the researcher who sets out to reproduce a finding has fewer choices. His hypothesis is dictated by what he’s trying to reproduce.