False positives for medical tests

The most commonly given example of Bayes theorem is testing for rare diseases. The results are not intuitive. If a disease is rare, then your probability of having the disease given a positive test result remains low. For example, suppose a disease effects 0.1% of the population and a test for the disease is 95% accurate. Then your probability of having the disease given that you test positive is only about 2%.

Textbooks typically rush through the medical testing example, though I believe it takes a more details and numeric examples for it to sink in. I know I didn’t really get it the first couple times I saw it presented.

I just posted an article that goes over the medical testing example slowly and in detail: Canonical example of Bayes’ theorem in detail. I take what may be rushed through in half a page of a textbook and expand it to six pages, and I use more numbers and graphs than equations. It’s worth going over this example slowly because once you understand it, you’re well on your way to understanding Bayes’ theorem.

Most published research results are false

John Ioannidis wrote an article in Chance magazine a couple years ago with the provocative title Why Most Published Research Findings are False.  [Update: Here’s a link to the PLoS article reprinted by Chance. And here are some notes on the details of the paper.] Are published results really that bad? If so, what’s going wrong?

Whether “most” published results are false depends on context, but a large percentage of published results are indeed false. Ioannidis published a report in JAMA looking at some of the most highly-cited studies from the most prestigious journals. Of the studies he considered, 32% were found to have either incorrect or exaggerated results. Of those studies with a 0.05 p-value, 74% were incorrect.

The underlying causes of the high false-positive rate are subtle, but one problem is the pervasive use of p-values as measures of evidence.

Folklore has it that a “p-value” is the probability that a study’s conclusion is wrong, and so a 0.05 p-value would mean the researcher should be 95 percent sure that the results are correct. In this case, folklore is absolutely wrong. And yet most journals accept a p-value of 0.05 or smaller as sufficient evidence.

Here’s an example that shows how p-values can be misleading. Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!

The example above is exaggerated, but look at the JAMA study results again. In a sample of real medical experiments, 32% of those with “significant” results were wrong. And among those that just barely showed significance, 74% were wrong.

See Jim Berger’s criticisms of p-values for more technical depth.

Proofs of false statements

Mark Dominus brought up an interesting question last month: have there been major screw-ups in mathematics? He defines a “major screw-up” to be a flawed proof of an incorrect statement that was accepted for a significant period of time. He excludes the case of incorrect proofs of statements that were nevertheless true.

It’s remarkable that he can even ask the question. Can you imagine someone asking with a straight face whether there have ever been major screw-ups in, say, software development? And yet it takes some hard thought to come up with examples of really big blunders in math.

No doubt there are plenty of flawed proofs of false statements in areas too obscure for anyone to care about. But in mainstream areas of math, blunders are usually uncovered very quickly. And there are examples of theorems that were essentially correct but neglected some edge case. Proofs of statements that are just plain wrong are hard to think of. But Mark Dominus came up with a few.

Yesterday he gave an example of a statement by Kurt Gödel that was flat-out wrong but accepted for over 30 years. Warning: reader discretion advised. His post is not suitable for those who get queasy at the sight of symbolic logic.

A bigger clipboard

Imagine you find a paragraph on the web that you want to email to a friend. You copy the paragraph. Then you think you should send a link to full article, so you copy that too. You start composing your email and you type Ctrl-V to paste in the paragraph, but to your disappointment you just paste the link. So you go back and copy the paragraph again.

The problem is that the Windows clipboard only holds the most recent thing you copied. Jeff Atwood posted an article a few days ago called Reinventing the Clipboard where he recommends a utility called ClipX that extends the clipboard. After you install ClipX, typing Ctrl-Shift-V brings up a little menu of recent clippings available for pasting.

I’ve been using ClipX for a few days now. It’s simple and unobtrusive. The only slight challenge at first is remembering that it’s available. One you think to yourself once or twice, “Oh wait, I don’t have to go back and copy that again,” you’re hooked.

Footnote to interruption post

In my post yesterday about interruptions I quoted Mary Czerwinski from Microsoft Research. She told me afterward that two of the applications mentioned in the interview have been released. They are publicly available from the Microsoft Research download site.

I haven’t had a chance to use either of these tools yet. If you try them out, let me know what you think.

Faith, hope, love, and marketing

Seth Godin has a post this morning about what he calls the three key marketing levers: fear, hope, and love. He concludes

The easiest way to build a brand is to sell fear. The best way, though, may be to deliver on hope while aiming for love …

People don’t want fear, they want faith. They want to buy something they can place their trust in to alleviate their fears. Replacing the term “fear” with “faith” makes the three levers more parallel, stating each in terms of positive aspiration. With this edit you could summarize Seth Godin’s marketing advice as follows.

Now abide faith, hope, and love. But the greatest of these is love.

I think I’ve read that somewhere before.

Rethinking interruptions

If you read a few personal productivity articles you’ll run into this advice: Interruptions are bad, so eliminate interruptions. That’s OK as far as it goes. But are interruptions necessarily bad? And when you are interrupted, what can you do to recover faster?

Not all interruptions are created equal. Paul Graham talks about this in his essay Holding a Program in One’s Head.

The danger of a distraction depends not on how long it is, but on how much it scrambles your brain. A programmer can leave the office and go and get a sandwich without losing the code in his head. But the wrong kind of interruption can wipe your brain in 30 seconds.

In her interview with John Udell, Mary Czerwinski points out that while interruptions are detrimental to the productivity of the person being interrupted, maybe 75% of the time interruptions are beneficial for the organization as a whole. If one person is stuck and other person can get them unstuck by answering a question, the productivity of the person asking the question may go up more than the productivity of the person being asked the question goes down.

Given that interruptions are good, or at least inevitable, how can you manage them? Czerwinski uses the phrase context reacquisition to describe getting back to your previous state of mind following an interruption. Czerwinski and others at Microsoft Research are looking at software for context acquisition. For example, one of the ideas they are trying out is software that takes snapshots of your desktop. If you could see what your desktop looked like before the phone rang, it could help you get back into the frame of mind you had before you started helping the person on the other end of the line.

Have you discovered a tool or habit that helps with context reacquisition? If so, please leave a comment.

Task switching

If you’re working on three projects, you’re probably spending 40% of your time task switching. Task switching is the dark matter of life: there’s a lot of it, but we’re hardly aware of it.

I’m not talking about multitasking, such as replying to email while you’re on the phone. People are starting to realize that multitasking isn’t as productive as it seems. I’m talking about having multiple assignments at work.

John Maeda posted a note about multiple projects in which he gives a link to a PowerPoint slide graphing percentage of productive time as a function the number of concurrent assignments. According to the graph, the optimal number of projects is two. With two projects, you can do something else when one project is stuck waiting for input or when you need variety. But any more than that and productivity tanks.

Johanna Rothman has an interview on the Pragmatic Programmer podcast where she discusses, among other things, having multiple concurrent projects. She thought it was absurd when she was asked to work 50% on one project, 30% on another, and 20% on another. Research environments are worse. Because of grant funding, people are sometimes allocated 37% to this project, 5% to that project, etc.

We’re not nearly as good at task switching as we think we are. I hear people talking about how it may take 15 or 30 minutes to get back into the flow after an interruption. Maybe that’s true if you were interrupted from something simple. A colleague who works on complex statistical problems says it takes her about two or three days to recover from switching projects. In his article Good and Bad Procrastination, Paul Graham says “You probably only have to interrupt someone a couple times a day before they’re unable to work on hard problems at all.”

Population drift

The goal of a clinical trial is to determine what treatment will be most effective in a given population. What if the population changes while you’re conducting your trial? Say you’re treating patients with Drug X and Drug Y, and initially more patients were responding to X, but later more responded to Y. Maybe you’re just seeing random fluctuation, but maybe things really are changing and the rug is being pulled out from under your feet.

Advances in disease detection could cause a trial to enroll more patients with early stage disease as the trial proceeds. Changes in the standard of care could also make a difference. Patients often enroll in a clinical trial because standard treatments have been ineffective. If the standard of care changes during a trial, the early patients might be resistant to one therapy while later patients are resistant to another therapy. Often population drift is slow compared to the duration of a trial and doesn’t affect your conclusions, but that is not always the case.

My interest in population drift comes from adaptive randomization. In an adaptive randomized trial, the probability of assigning patients to a treatment goes up as evidence accumulates in favor of that treatment. The goal of such a trial design is to assign more patients to the more effective treatments. But what if patient response changes over time? Could your efforts to assign the better treatments more often backfire? A trial could assign more patients to what was the better treatment rather than what is now the better treatment.

On average, adaptively randomized trials do treat more patients effectively than do equally randomized trials. The report Power and bias in adaptive randomized clinical trials shows this is the case in a wide variety of circumstances, but it assumes constant response rates, i.e. it does not address population drift.

I did some simulations to see whether adaptive randomization could do more harm than good. I looked at more extreme population drift than one is likely to see in practice in order to exaggerate any negative effect. I looked at gradual changes and sudden changes. In all my simulations, the adaptive randomization design treated more patients effectively on average than the comparable equal randomization design. I wrote up my results in The Effect of Population Drift on Adaptively Randomized Trials.

Related: Adaptive clinical trial design