A few years ago the scientific community suddenly realized that a lot of scientific papers were wrong. I imagine a lot of people knew this all along, but suddenly it became a topic of discussion and people realized the problem was bigger than imagined.
The layman’s first response was “Are you saying scientists are making stuff up?” and the response from the scientific community was “No, that’s not what we’re saying. There are subtle reasons why an honest scientist can come to the wrong conclusion.” In other words, don’t worry about fraud. It’s something else.
Well, if it’s not fraud, what is it? The most common explanations are sloppiness and poor statistical practice.
The sloppiness hypothesis says that irreproducible results may be the result of errors. Or maybe the results are essentially correct, but the analysis is not reported in sufficient detail for someone to verify it. I first wrote about this in 2008.
While I was working for MD Anderson Cancer Center, a couple of my colleagues dug into irreproducible papers and tried to reverse engineer the mistakes and omissions. For example, this post mentioned some of the erroneous probability formulas that were implicitly used in journal articles.
The bad statistics hypothesis was championed by John Ioannidis in his now-famous paper Most published research findings are false. The article could have been titled “Why most research findings will be false, even if everyone is honest and careful.” For a cartoon version of Ioannidis’s argument, see xkcd’s explanation of why jelly beans cause acne. In a nutshell, the widespread use of p-values makes it too easy to find spurious but publishable results.
Ioannidis explained that in theory most results could be false, based on statistical theory, but potentially things could be better in practice than in theory. Unfortunately they are not. Numerous studies have tried to empirically estimate  what proportion of papers cannot be reproduced. The estimate depends on context, but it’s high.
For example, ScienceNews reported this week on an attempt to reproduce 193 experiments in cancer biology. Only 50 of the experiments could be reproduced, and of those, the reported effects were found to be 85% smaller than initially reported. Here’s the full report.
This post started out by putting fraud aside. In a sort of a scientific version of Halnon’s razor, we agreed not to attribute to fraud what could be adequately explained by sloppiness and bad statistics. But what about fraud?
There was a spectacular case of fraud in The Lancet last year.
The article was published May 22, 2020 and retracted on June 4, 2020. I forget the details, but the fraud was egregious. For example, if I remember correctly, the study claimed to have data on more than 100% of the population in some regions. Peer review didn’t catch the fraud but journalists did.
Who knows how common fraud is? I see articles occasionally that try to estimate it. But exposing fraud takes a lot of work, and it does not advance your career.
I said above that my former colleagues were good at reverse engineering errors. They also ended up exposing fraud. They started out trying to figure out how Anil Potti could have come to the results he did, and finally determined that he could not have. This ended up being reported in The Economist and on 60 Minutes.
As Nick Brown recently said on Twitter,
At some point I think we’re going to have to accept that “widespread fraud” is both a plausible and parsimonious explanation for the huge number of failed replications we see across multiple scientific disciplines.
That’s a hard statement to accept, but that doesn’t mean it’s wrong.
 If an attempt to reproduce a study fails, how do we know which one was right? The second study could be wrong, but it’s probably not. Verification is generally easier than discovery. The original authors probably explored multiple hypotheses looking for a publishable result, while the replicators tested precisely the published hypothesis.
Andrew Gelman suggested a thought experiment. When a large follow-up study fails to replicate a smaller initial study, image if the timeline were reversed. If someone ran a small study and came up with a different result than a previous large study, which study would have more credibility?