Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:
One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.
The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.
One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as
P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)
P(AB) = max( P(A), P(B) ).
Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.
To be fair, Baggerly and Coombes point out
These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …
So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.
17 thoughts on “Make up your own rules of probability”
I have a paper in the review process right now that points out an error in a previous paper. I put the author of that paper down as a suggested reviewer. Maybe this is why the reviews are taking so long.
Thanks for pointing this article out, John. It’s in equal parts fascinating and terrifying. I saw Keith Baggerly talk about this at BioConductor — it’s a fascinating detective story how they pieced together the original data and figured out the response. But the lack of access to that data, and the refusal of the journal to correct the record, is scary.
One of the worries is that even if the stats is internally correct, they have thought some other factors weren’t important enough to tell you that affect the reliability of the results. For example how they changed their models five times before getting the analysis they wanted. This is something that is already a major problem in epidemiology, and bioinformatics just multiplies the problem. Which of a hundred food groups causes cancer becomes which of thousands of genes causes cancer.
Went to a talk at the International Biometrics a few years ago where someone (Mark Segal ?) pulled apart a bioinformatics paper by comparing what the authors did to what they put in the paper and how it totally destroyed their significance calculations.
… but … but … the formulas are in the spreadsheet, so they must be right!
This is a well-known problem, and it’s not restricted to statistics (nor, would I assume, to bioinformatics papers). Just pick any paper that uses differential equations to describe a biochemical system. Chances are that the numerical analysis is either completely wrong or at least incomplete. In fact, the functions often describe variations of quantities of chemicals in the cells – and you should be very wary when these functions suddenly become negative. Or take papers with analyses of gene regulatory network.
And all those papers that assume some variation of P(A|B)=P(B|A), turning significance levels into the probability of a hypothesis being true.
My research indicates that most claims coming from medical observational studies that have been tested in randomized clinical trials fail to replicate. Some of the problems are covered in a lecture which can be found here:
Statistically speaking, we all make mistakes.
…but if the statistics are wrong, am I perfect?
Now if only the AGW scientists would release the models and data used to ‘prove’ man made global warming, we would have a treasure trove of new statistical methods…
Let’s not forget all the errors that are simply misapplications of statistics, for example the application of probability statistics to nonprobability samples. It’s the wild west out there.
Annie: I completely agree. See Three reasons to distrust microarray results.
The RHS of the second rule, max( P(A), P(B) ), jumps out as what is the standard rule for disjunction (the T-conorm) in fuzzy logic. And it’s not unusual in that domain to be sloppy and just call everything a probability even if you’re considering something less constrained. So, one could be more charitable and suppose that the authors had something like this in mind, but even so, that’s probably still a bit too much mind-reading.
That link is now: http://www.reproducibleresearch.net/blog/2008/12/10/three-reasons-to-distrust-microarray-results/comment-page-1/
I think this was reported on Andrew Gelman’s blog in the last couple of weeks, too. It was pointed out that the unusual rule you pointed out is a standard rule in “fuzzy logic”; I don’t know if that’s the case or not, but it makes the choice to use that rule seem a little less arbitrary. I also don’t know anything about “fuzzy logic”, except that it’ not Bayesian, and so I assume it’s simply wrong. I’d love to know what you, John, or any of the rest of you think about the legitimacy of “fuzzy logic”.
There are certain reasonable axioms for how to quantify and reason about uncertainty that imply the laws of probability. (Cox’s theorem?) So any other system, such as fuzzy logic, must violate some of these axioms. That implies that fuzzy logic will give unreasonable results in some circumstances. This may not be a problem in practice, but my interest in fuzzy logic plummeted when I learned that result.
Also, I read in a fuzzy logic book once that one must be careful to keep fuzzy logic from collapsing to probability. My reply would be “why avoid ‘collapsing’ to probability?”
In statistics, we take an average. Nobody tells us that the data might not be normal, so that average might not be meaningful. Eventually, all is revealed, but the intro to statistics class never changes and does little more than start you down the road to ruin.
The amazing thing about how wrong the formula for P(ABCD) is … is how easy it is to see that the incorrect right-hand side of the equation can be bigger than 1.