Make up your own rules of probability

Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:

One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.

The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.

One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as

P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)


P(AB) = max( P(A), P(B) ).

Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.

To be fair, Baggerly and Coombes point out

These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …

So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.

Related posts:

Irreproducible analysis
Highlights from Reproducible Ideas
Reproducible Ideas blog winding down

Tagged with: ,
Posted in Science, Statistics
15 comments on “Make up your own rules of probability
  1. John S. says:

    I have a paper in the review process right now that points out an error in a previous paper. I put the author of that paper down as a suggested reviewer. Maybe this is why the reviews are taking so long.

  2. Ken says:

    One of the worries is that even if the stats is internally correct, they have thought some other factors weren’t important enough to tell you that affect the reliability of the results. For example how they changed their models five times before getting the analysis they wanted. This is something that is already a major problem in epidemiology, and bioinformatics just multiplies the problem. Which of a hundred food groups causes cancer becomes which of thousands of genes causes cancer.

    Went to a talk at the International Biometrics a few years ago where someone (Mark Segal ?) pulled apart a bioinformatics paper by comparing what the authors did to what they put in the paper and how it totally destroyed their significance calculations.

  3. David Smith says:

    Thanks for pointing this article out, John. It’s in equal parts fascinating and terrifying. I saw Keith Baggerly talk about this at BioConductor — it’s a fascinating detective story how they pieced together the original data and figured out the response. But the lack of access to that data, and the refusal of the journal to correct the record, is scary.

  4. EastwoodDC says:

    … but … but … the formulas are in the spreadsheet, so they must be right!


  5. Konrad says:

    This is a well-known problem, and it’s not restricted to statistics (nor, would I assume, to bioinformatics papers). Just pick any paper that uses differential equations to describe a biochemical system. Chances are that the numerical analysis is either completely wrong or at least incomplete. In fact, the functions often describe variations of quantities of chemicals in the cells – and you should be very wary when these functions suddenly become negative. Or take papers with analyses of gene regulatory network.

  6. Dan P says:

    And all those papers that assume some variation of P(A|B)=P(B|A), turning significance levels into the probability of a hypothesis being true.

  7. Stan Young says:

    My research indicates that most claims coming from medical observational studies that have been tested in randomized clinical trials fail to replicate. Some of the problems are covered in a lecture which can be found here:

  8. Lane says:

    Statistically speaking, we all make mistakes.
    …but if the statistics are wrong, am I perfect?

  9. The Man says:

    Now if only the AGW scientists would release the models and data used to ‘prove’ man made global warming, we would have a treasure trove of new statistical methods…

  10. Annie Pettit says:

    Let’s not forget all the errors that are simply misapplications of statistics, for example the application of probability statistics to nonprobability samples. It’s the wild west out there.

  11. Joey says:

    The RHS of the second rule, max( P(A), P(B) ), jumps out as what is the standard rule for disjunction (the T-conorm) in fuzzy logic. And it’s not unusual in that domain to be sloppy and just call everything a probability even if you’re considering something less constrained. So, one could be more charitable and suppose that the authors had something like this in mind, but even so, that’s probably still a bit too much mind-reading.

  12. Tom Pollard says:

    I think this was reported on Andrew Gelman’s blog in the last couple of weeks, too. It was pointed out that the unusual rule you pointed out is a standard rule in “fuzzy logic”; I don’t know if that’s the case or not, but it makes the choice to use that rule seem a little less arbitrary. I also don’t know anything about “fuzzy logic”, except that it’ not Bayesian, and so I assume it’s simply wrong. I’d love to know what you, John, or any of the rest of you think about the legitimacy of “fuzzy logic”.

  13. John says:

    There are certain reasonable axioms for how to quantify and reason about uncertainty that imply the laws of probability. (Cox’s theorem?) So any other system, such as fuzzy logic, must violate some of these axioms. That implies that fuzzy logic will give unreasonable results in some circumstances. This may not be a problem in practice, but my interest in fuzzy logic plummeted when I learned that result.

    Also, I read in a fuzzy logic book once that one must be careful to keep fuzzy logic from collapsing to probability. My reply would be “why avoid ‘collapsing’ to probability?”

6 Pings/Trackbacks for "Make up your own rules of probability"
  1. [...] blog post Make up your own rules of probability discusses a couple of the innovative rules of probability Baggerly and Coombes discovered while [...]

  2. [...] blog post Make up your own rules of probability discusses a couple of the innovative rules of probability Baggerly and Coombes discovered while [...]

  3. [...] It escaped my attention last year, in part because “Annals of Applied Statistics” is not high on my journal radar. However, other bloggers did pick it up: see posts at Reproducible Research Ideas and The Endeavour. [...]

  4. [...] based on simple visual inspection and counting, and are not documented further.”There are also gross statistical errors.Continuing in the usual theme of my occasional posts, I’ll share what reproducible research [...]

  5. [...] A new paper in Annals of Applied Statistics by Keith Baggerly and Kevin Coombes discusses Forensic bioinformatics. I heard Keith talk about this at MCMSki3. Basically, the Baggerly and Coombes looked at a few bioinformatics papers, tried to reproduce some statistical analyses, and failed. In failing, Baggerly and Coombes realized the authors of those papers had not done what they said they had done. So Baggerly and Coombes reverse engineered what was actually done with astounding results, e.g. training data with sensitive/resistant labels reversed and bizarre probability laws [P(A,B) = max(P(A),P(B)). [...]

  6. [...] some creative extensions on computing the probability of the intersection of events as highlighted here). Rather many of the issues (but not all) had more to do with data preprocessing. After working [...]