Microarray technology makes it possible to examine the expression levels of thousands of genes at once. So one way to do cancer research is to run microarray analyses on cancer and normal tissue samples, hoping to discover genes that are more highly expressed in one or the other. If, for example, a few genes are highly expressed in cancer samples, the proteins these genes code for may be targets for new therapies.

For numerous reasons, cancer research is more complicated than simply running millions of microarray experiments and looking for differences. One complication is that false positives are very likely.

A previous post gives a formula for the probability of a reported result being true. The most important term in that formula is the prior odds R that a hypothesis in a certain context is correct. John Ioannidis gives a hypothetical but realistic example in the paper mentioned earlier (*). In his example, he supposes that 100,000 gene polymorphisms are being tested for association with schizophrenia. If 10 polymorphisms truly are associated with schizophrenia, the pre-study probability that a given gene is associated is 0.0001. If a study has 60% power (β = 0.4) and significance level α = 0.05, the post-study probability that a polymorphism determined to be associated really is associated is 0.0012. That is, a gene reported to be associated with schizophrenia is 12 times more likely to actually be associated with the disease than a gene chosen at random. However, the bad news is that 12 times 0.0001 is only 0.0012. There’s a 99.8% chance that the result is false.

The example above is extreme, but it shows that a completely brute-force approach isn’t going to get you very far. Nobody actually believes that 100,000 polymorphisms are equally likely to be associated with any disease. Biological information makes it possible to narrow down the list of things to test, increasing the value of R. Suppose it were possible to narrow the list down to 1,000 polymorphisms to test, but a couple important genes were left out, leaving 8. Then R increases to 0.008. Now the probability of a reported association being correct increases to 0.088. This is a great improvement, though reported results are still have more than a 90% chance of being wrong.

(*) John P. A. Ioannidis, Why most published research findings are false. CHANCE volume 18, number 4, 2005.

i’m new at this, but isnt this sort of problem the thing that false discovery rate adjusted p values (q values) is supposed to account for?

You might also want to look at Wacholder, et al’s False Positive Report Probability (FPRP) (J Natl Cancer Inst, 2004 Mar 17;96(6):434-42), which is an attempt to directly address some of the issues you raise in a quasi-Bayesian framework.

Why wouldn’t repeat runs, with some mod of sampling and procedure, weed out false positives?

The advantage of microarrays is that they’re cheap. If you do enough independent runs to make quality inferences, they’re no longer cheap.

I know this is a rather old blog entry, but since the paper by Wacholder has been cited by Abhijit, I wanted also to cite this strong criticism to Wacholder’s paper: Lucke JF, A critique of the false-positive report probability, Genet Epidemiol. 2009 Feb;33(2):145-50. (http://www.ncbi.nlm.nih.gov/pubmed/18720477).