Make up your own rules of probability

by John on September 18, 2009

Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:

One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.

The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.

One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as

P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)

and

P(AB) = max( P(A), P(B) ).

Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.

To be fair, Baggerly and Coombes point out

These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …

So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.

Related posts:

Irreproducible analysis
Highlights from Reproducible Ideas
Reproducible Ideas blog winding down

{ 2 trackbacks }

Reproducible Ideas » Blog Archive » New paper in forensic bioinformatics
09.18.09 at 12:14
New paper in forensic bioinformatics « Reproducible Research Ideas
10.13.09 at 08:38

{ 4 comments… read them below or add one }

1

John S. 09.18.09 at 12:55

I have a paper in the review process right now that points out an error in a previous paper. I put the author of that paper down as a suggested reviewer. Maybe this is why the reviews are taking so long.

2

Ken 09.18.09 at 16:50

One of the worries is that even if the stats is internally correct, they have thought some other factors weren’t important enough to tell you that affect the reliability of the results. For example how they changed their models five times before getting the analysis they wanted. This is something that is already a major problem in epidemiology, and bioinformatics just multiplies the problem. Which of a hundred food groups causes cancer becomes which of thousands of genes causes cancer.

Went to a talk at the International Biometrics a few years ago where someone (Mark Segal ?) pulled apart a bioinformatics paper by comparing what the authors did to what they put in the paper and how it totally destroyed their significance calculations.

3

David Smith 09.18.09 at 17:23

Thanks for pointing this article out, John. It’s in equal parts fascinating and terrifying. I saw Keith Baggerly talk about this at BioConductor — it’s a fascinating detective story how they pieced together the original data and figured out the response. But the lack of access to that data, and the refusal of the journal to correct the record, is scary.

4

EastwoodDC 09.21.09 at 06:53

… but … but … the formulas are in the spreadsheet, so they must be right!

Sheesh.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>