Microarrays | John D. Cook

Microarray technology makes it possible to examine the expression levels of thousands of genes at once. So one way to do cancer research is to run microarray analyses on cancer and normal tissue samples, hoping to discover genes that are more highly expressed in one or the other. If, for example, a few genes are highly expressed in cancer samples, the proteins these genes code for may be targets for new therapies.

For numerous reasons, cancer research is more complicated than simply running millions of microarray experiments and looking for differences. One complication is that false positives are very likely.

A previous post gives a formula for the probability of a reported result being true. The most important term in that formula is the prior odds R that a hypothesis in a certain context is correct. John Ioannidis gives a hypothetical but realistic example in the paper mentioned earlier (*). In his example, he supposes that 100,000 gene polymorphisms are being tested for association with schizophrenia. If 10 polymorphisms truly are associated with schizophrenia, the pre-study probability that a given gene is associated is 0.0001. If a study has 60% power (β = 0.4) and significance level α = 0.05, the post-study probability that a polymorphism determined to be associated really is associated is 0.0012. That is, a gene reported to be associated with schizophrenia is 12 times more likely to actually be associated with the disease than a gene chosen at random. However, the bad news is that 12 times 0.0001 is only 0.0012. There’s a 99.8% chance that the result is false.

The example above is extreme, but it shows that a completely brute-force approach isn’t going to get you very far. Nobody actually believes that 100,000 polymorphisms are equally likely to be associated with any disease. Biological information makes it possible to narrow down the list of things to test, increasing the value of R. Suppose it were possible to narrow the list down to 1,000 polymorphisms to test, but a couple important genes were left out, leaving 8. Then R increases to 0.008. Now the probability of a reported association being correct increases to 0.088. This is a great improvement, though reported results are still have more than a 90% chance of being wrong.

(*) John P. A. Ioannidis, Why most published research findings are false. CHANCE volume 18, number 4, 2005.