Three reasons scientific results cannot be reproduced

A few years ago the scientific community suddenly realized that a lot of scientific papers were wrong. I imagine a lot of people knew this all along, but suddenly it became a topic of discussion and people realized the problem was bigger than imagined.

The layman’s first response was “Are you saying scientists are making stuff up?” and the response from the scientific community was “No, that’s not what we’re saying. There are subtle reasons why an honest scientist can come to the wrong conclusion.” In other words, don’t worry about fraud. It’s something else.

Well, if it’s not fraud, what is it? The most common explanations are sloppiness and poor statistical practice.

Sloppiness

The sloppiness hypothesis says that irreproducible results may be the result of errors. Or maybe the results are essentially correct, but the analysis is not reported in sufficient detail for someone to verify it. I first wrote about this in 2008.

While I was working for MD Anderson Cancer Center, a couple of my colleagues dug into irreproducible papers and tried to reverse engineer the mistakes and omissions. For example, this post mentioned some of the erroneous probability formulas that were implicitly used in journal articles.

Bad statistics

The bad statistics hypothesis was championed by John Ioannidis in his now-famous paper Most published research findings are false. The article could have been titled “Why most research findings will be false, even if everyone is honest and careful.” For a cartoon version of Ioannidis’s argument, see xkcd’s explanation of why jelly beans cause acne. In a nutshell, the widespread use of p-values makes it too easy to find spurious but publishable results.

Ioannidis explained that in theory most results could be false, based on statistical theory, but potentially things could be better in practice than in theory. Unfortunately they are not. Numerous studies have tried to empirically estimate [1] what proportion of papers cannot be reproduced. The estimate depends on context, but it’s high.

For example, ScienceNews reported this week on an attempt to reproduce 193 experiments in cancer biology. Only 50 of the experiments could be reproduced, and of those, the reported effects were found to be 85% smaller than initially reported. Here’s the full report.

Fraud

This post started out by putting fraud aside. In a sort of a scientific version of Halnon’s razor, we agreed not to attribute to fraud what could be adequately explained by sloppiness and bad statistics. But what about fraud?

There was a spectacular case of fraud in The Lancet last year.

article summary with RETRACTED stamped on top in red

The article was published May 22, 2020 and retracted on June 4, 2020. I forget the details, but the fraud was egregious. For example, if I remember correctly, the study claimed to have data on more than 100% of the population in some regions. Peer review didn’t catch the fraud but journalists did.

Who knows how common fraud is? I see articles occasionally that try to estimate it. But exposing fraud takes a lot of work, and it does not advance your career.

I said above that my former colleagues were good at reverse engineering errors. They also ended up exposing fraud. They started out trying to figure out how Anil Potti could have come to the results he did, and finally determined that he could not have. This ended up being reported in The Economist and on 60 Minutes.

As Nick Brown recently said on Twitter,

At some point I think we’re going to have to accept that “widespread fraud” is both a plausible and parsimonious explanation for the huge number of failed replications we see across multiple scientific disciplines.

That’s a hard statement to accept, but that doesn’t mean it’s wrong.

[1] If an attempt to reproduce a study fails, how do we know which one was right? The second study could be wrong, but it’s probably not. Verification is generally easier than discovery. The original authors probably explored multiple hypotheses looking for a publishable result, while the replicators tested precisely the published hypothesis.

Andrew Gelman suggested a thought experiment. When a large follow-up study fails to replicate a smaller initial study, image if the timeline were reversed. If someone ran a small study and came up with a different result than a previous large study, which study would have more credibility?

6 thoughts on “Fraud, Sloppiness, and Statistics”

Ross

11 December 2021 at 11:28

Richard McElreath does some fascinating work on this topic in his YouTube series and book “statistical rethinking.” He deeply covers the (to put it nicely) suboptimal incentives in the academic publishing world.

Alvaro Castro Quesada

11 December 2021 at 11:51

This post reminds me about R. A. Fisher on Gregor Mendel’s work.

Frank Wilhoit

11 December 2021 at 16:31

There is a fourth category: performance.

BobC

12 December 2021 at 21:30

This always brings to mind my experience as an engineering undergrad in the early ’80’s who had just learned engineering stats, had started digging into what would later be called “Design of Experiments”, and wanted to find some good non-engineering examples of applied stats to broaden my exposure.

I naturally turned to medical studies from the ’70’s having lots of citations, which I figured had to be the gold standard for careful technique and analysis. Only to be severely disappointed, even shocked, at the hash the authors had made of what I’d been taught were the “first principles” of applied stats. I brought my list of papers, observations and questions to my stats prof, who started trying to explain some key domain differences, only to pull himself up short and say “They’re wrong.”

He sent my observations along with his notes over to the medical school faculty, seeking clarification. To receive no reply whatsoever. I moved on from the issue and finished my degree.

A decade ago I started coaching adult beginner triathletes, and needed to learn the real effects of obesity on folks who were fit but heavy. I found zero studies specifically investigating this topic, but did stumble across one fascinating story of a study that was in trouble because early data broke their statistical model. The large longitudinal study sought to answer, once and for all, which medical outcomes were most strongly associated with high BMI. While early data looked good overall, the number of outliers was excessive, and was not handled by their analytical model. They called in some stats troubleshooters from Harvard, who found nearly all the outliers were heavy folks who were very active, and they wrote a short paper or letter describing their analysis, and who then coached the study to monitor a few more activity-related parameters.

The New York Times stumbled across the short paper, and covered it in a brief piece with one of my favorite headlines ever: “Fitness Beats Fatness.” It turned out that, if you set aside BMI, athletes (high-activity folks) in the study deemed “obese” are statistically indistinguishable from non-obese athletes when it comes to medical problems (excluding only bone and joint issues). There were no correlations with BMI for any other medical issues among the heavy folks.

I am so glad to see “Fitness Beats Fatness” has become a meme in the medical community. Once again reminding physicians to “treat the symptoms, not the test result”.

Sam Hardwick

20 December 2021 at 07:42

I think sloppiness and fraud blend into each other pretty seamlessly. People try to get the best results for their career with the least effort, and because they are competing with other people who output a lot of research, they too must output a lot of research. They have good intentions, and would never commit fraud, but they don’t try very hard to find problems with their methods and conclusions. In fact, the worse their methods, the easier it may be to generate research.

A couple of decades in, they’ve become completely jaded and cynical and are knowingly perpetrating what they would once have thought of as fraud.

KT2

20 December 2021 at 20:50

BobC said “the number of outliers was excessive, and was not handled by their analytical model.”

Ive asked Gelman and others to redo LSV, QALY’s & DALY’s as the outliers – ala a oandemic – were “cleaned #! too noisy” from used datasets.

Perhaps someone here might know. Is “too noisy” fraud or bias?

And a related “we were truthful in the headline” science communication piece
https://astralcodexten.substack.com/p/the-phrase-no-evidence-is-a-red-flag

Thanks JDC

Comments are closed.