Between 1966 and 1995, there were forty-seven studies of acupuncture in China, Taiwan, and Japan, and every single trial concluded that acupuncture was an effective treatment. During the same period, there were ninety-four clinical trials of acupuncture in the United States, Sweden, and the U.K., and only fifty-six per cent of these studies found any therapeutic benefits.
Back in March I wrote a blog post asking whether gaining weight makes you taller. Weight and height are clearly associated, and from that data alone one might speculate that gaining weight could make you taller. Of course causation is in the other direction: becoming taller generally makes you gain weight.
In the 1980’s, cardiologists discovered that patients with irregular heart beats for the first 12 days following a heart attack were much more likely to die. Antiarrythmic drugs became standard therapy. But in the next decade cardiologist discovered this was a bad idea. According to Philip Devereaux, “The trial didn’t just show that the drugs weren’t saving lives, it showed they were actually killing people.”
David Freedman relates the story above in his book Wrong. Freedman says
In fact, notes Devereaux, the drugs killed more Americans than the Vietnam War did—roughly an average of forty thousand a year died from the drugs in the United States alone.
Cardiologists had good reason to suspect that antiarrythmic drugs would save lives. In retrospect, it may be that heart-attack patients with poor prognosis have arrhythmia rather than arrhythmia causing poor prognosis. Or it may be that the association is more complicated than either explanation.
How well can you predict height based on genetic markers?
A 2009 study came up with a technique for predicting the height of a person based on looking at the 54 genes found to be correlated with height in 5,748 people — and discovered the results were one-tenth as accurate as the 125–year-old technique of averaging the heights of both parents and adjusting for sex.
For highly heritable traits such as height, we conclude that in applications in which parental phenotypic information is available (eg, medicine), the Victorian Galton’s method will long stay unsurpassed, in terms of both discriminative accuracy and costs.
Jon Udell’s latest Interviews with Innovators podcast features Randall Julian of Indigo BioSystems. I found this episode particularly interesting because it deals with issues I have some experience with.
The problems in managing biological data begin with how to store the raw experiment data. As Julian says
… without buying into all the hype around semantic web and so on, you would argue that a flexible schema makes more sense in a knowledge gathering or knowledge generation context than a fixed schema does.
So you need something less rigid than a relational database and something with more structure than a set of Excel spreadsheets. That’s not easy, and I don’t know whether anyone has come up with an optimal solution yet. Julian said that he has seen many attempts to put vast amounts of biological data into a rigid relational database schema but hasn’t seen this approach succeed yet. My experience has been similar.
Representing raw experimental data isn’t enough. In fact, that’s the easy part. As Jon Udell comments during the interview
It’s easy to represent data. It’s hard to represent the experiment.
That is, the data must come with ample context to make sense of the data. Julian comments that without this context, the data may as well be a list of zip codes. And not only must you capture experimental context, you must describe the analysis done to the data. (See, for example, this post about researchers making up their own rules of probability.)
Julian comments on how electronic data management is not nearly as common as someone unfamiliar with medical informatics might expect.
So right now maybe 50% of the clinical trials in the world are done using electronic data capture technology. … that’s the thing that maybe people don’t understand about health care and the life sciences in general is that there is still a huge amount of paper out there.
Part of the reason for so much paper goes back to the belief that one must choose between highly normalized relational data stores and unstructured files. Given a choice between inflexible bureaucracy and chaos, many people choose chaos. It may work about as well, and it’s much cheaper to implement. I’ve seen both extremes. I’ve also been part of a project using a flexible but structured approach that worked quite well.
I recently ran across this quote from Mithat Gönen of Memorial Sloan-Kettering Cancer Center:
While there are certainly some at other centers, the bulk of applied Bayesian clinical trial design in this country is largely confined to a single zip code.
from “Bayesian clinical trials: no more excuses,” Clinical Trials 2009; 6; 203.
The zip code Gönen alludes to is 77030, the zip code of M. D. Anderson Cancer Center. I can’t say how much activity there is elsewhere, but certainly we design and conduct a lot of Bayesian clinical trials at MDACC.
Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:
One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.
The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.
One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as
Here are some of my favorite posts from the Reproducible Ideas blog.
Three reasons to distrust microarray results Provenance in art and science Forensic bioinformatics (continued) Preserving (the memory of) documents Programming is understanding Musical chairs and reproducibility drills Taking your code out for a walk
The most popular and most controversial was the first in the list, reasons to distrust microarray results.
The emphasis shifts from science to software development as you go down the list, though science and software are intertwined throughout the posts.
Before I started working for a cancer center, I was not aware of the tension between science and medicine. Popular perception is that the two go together hand and glove, but that’s not always true.
Physicians are trained to use their subjective judgment and to be decisive. And for good reason: making a fairly good decision quickly is often better than making the best decision eventually. But scientists must be tentative, withhold judgment, and follow protocols.
Sometimes physician-scientists can reconcile their two roles, but sometimes they have to choose to wear one hat or the other at different times.
The physician-scientist tension is just one facet of the constant tension between treating each patient effectively and learning how to treat future patients more effectively. Sometimes the interests of current patients and future patients coincide completely, but not always.
This ethical tension is part of what makes biostatistics a separate field of statistics. In manufacturing, for example, you don’t need to balance the interests of current light bulbs and future light bulbs. If you need to destroy 1,000 light bulbs to find out how to make better bulbs in the future, no big deal. But different rules apply when experimenting on people. Clinical trials will often use statistical designs that sacrifice some statistical power in order to protect the people participating in the trial. Ethical constraints make biostatistics interesting.
Hardly anyone cares about statistics directly. People more often care about decisions they need to make with the help of statistics. This suggests that the statistics and decision-making process should be explicitly integrated. The name for this integrated approach is “decision theory.” Problems in decision theory are set up with the goal of maximizing “utility,” the benefit you expect to get from a decision. Equivalently, problems are set up to minimize expected cost. Cost may be a literal monetary cost, but it could be some other measure of something you want to avoid.
I was at a conference this morning where David Draper gave an excellent talk entitled Bayesian Decision Theory in Biostatistics: the Utility of Utility. Draper presented an example of selecting variables for a statistical model. But instead of just selecting the most important variables in a purely statistical sense, he factored in the cost of collecting each variable. So if two variables make nearly equal contributions to a model, for example, the procedure would give preference to the variable that is cheaper to collect. In short, Draper recommended a cost-benefit analysis rather than the typical (statistical) benefit-only analysis. Very reasonable.
Why don’t people always take this approach? One reason is that it’s hard to assign utilities to outcomes. Dollar costs are often easy to account for, but it can be much harder to assign values to benefits. For example, you have to ask “Benefit for whom?” In a medical context, do you want to maximize the benefit to patients? Doctors? Insurance companies? Tax payers? Regulators? Statisticians? If you want to maximize some combination of these factors, how do you weight the interests of the various parties?
Assigning utilities is hard work, and you can never make everyone happy. No matter how good of a job you do, someone will criticize you. Nearly everyone agrees in the abstract that considering utilities is the way to go, but in practice it is hardly ever done. Anyone who proposes a way to quantify utility is immediately shot down by people who have a better idea. The net result is that rather than using a reasonable but imperfect idea of utility, no utility is used at all. Or rather no explicit definition of utility is used. There is usually some implicit idea of utility, chosen for mathematical convenience, and that one wins by default. In general, people much prefer to leave utilities implicit.
In the Q&A after his talk, Draper said something to the effect that the status quo persists for a very good reason: thinking is hard work, and it opens you up to criticism.
Tomorrow morning I’m giving a talk on how to subject fewer patients to ineffective treatment in clinical trials. I should have used something like the title of this post as the title of my talk, but instead my talk is called “Clinical Trial Monitoring With Bayesian Hypothesis Testing.” Classic sales mistake: emphasizing features rather than benefits. But the talk is at a statistical conference, so maybe the feature-oriented title isn’t so bad.
Ethical concerns are the main consideration that makes biostatistics a separate branch of statistics. You can’t test experimental drugs on people the way you test experimental fertilizers on crops. In human trials, you want to stop the trial early if it looks like the experimental treatment is not as effective as a comparable established treatment, but you want to keep going if it looks like the new treatment might be better. You need to establish rules before the trial starts that quantify exactly what it means to look like a treatment is doing better or worse than another treatment. There are a lot of ways of doing this quantification, and some work better than others. Within its context (single-arm phase II trials with binary or time-to-event endpoints) the method I’m presenting stops ineffective trials sooner than the methods we compare it to while stopping no more often in situations where you’d want the trial to continue.
If you’re not familiar with statistics, this may sound strange. Why not always stop when a treatment is worse and never stop when it’s better? Because you never know with certainty that one treatment is better than another. The more patients you test, the more sure you can be of your decision, but some uncertainty always remains. So you face a trade-off between being more confident of your conclusion and experimenting on more patients. If you think a drug is bad, you don’t want to treat thousands more patients with it in order to be extra confident that it’s bad, so you stop. But you run the risk of shutting down a trial of a treatment that really is an improvement but by chance appeared to be worse at the time you made the decision to stop. Statistics is all about such trade-offs.
In a recent interview on the PowerScripting Podcast, Jeffrey Snover said that software versioning isn’t a problem, it’s a dilemma. The difference is that problems can be solved, but dilemmas can only be managed. No versioning system can do everything that everyone would like.
The same phenomena exists in biostatistics. As I’d mentioned in Galen and clinical trials, biostatistics is filled with dilemmas. There are problems along the way that can be solved, but fundamentally biostatistics manages dilemmas.
Here’s a quote from the Greek physician Galen (c. 130-210 A.D.)
All who drink of this remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases.
Imagine a dialog between Galen and a modern statistician.
Stat: You say your new treatment is better than the previous one?
Stat: But more people died on the new treatment.
Galen: Those patients don’t count because they were incurable. They would have died anyway.
The problem with Galen’s line of reasoning is that it is not falsifiable: no experiment could disprove it. He could call any treatment superior by claiming that evidence against it doesn’t count. Still, Galen might have been right.
Now suppose our statistician has a long talk with Galen and tells him about modern statistical technique.
Galen: Can’t you look back at my notes and see whether there was something different about the patients who didn’t respond to the new treatment? There’s got to be some explanation. Maybe my new treatment isn’t better for everyone, but there must be a group for whom it’s better.
Stat: Well, that’s tricky business. Advocates call that “subset analysis.” Critics call it “data dredging.” The problem is that the more clever you are with generating after-the-fact explanations, the more likely you’ll come up with one that seems true but isn’t.
Galen: I’ll have to think about that one. What do you propose we do?
Stat: We’ll have to do a randomized experiment. When each patient arrives, we’ll flip a coin to decide whether to give them the old or the new treatment. That way we expect about the same number of incurable patients to receive each treatment.
Galen: But the new treatment is better. Why should I give half my patients the worse treatment?
Stat: We don’t really know that the new treatment is better. Maybe it’s not. A randomized experiment will give us more confidence one way or another.
Galen: But couldn’t we be unlucky and assign more incurable patients to the better treatment?
Stat: Yes, that’s possible. But it’s not likely we will assign too many more incurable patients to either treatment. That’s just a chance we’ll have to take.
The issues in these imaginary dialogs come up all the time. There are people who believe their treatment is superior despite evidence to the contrary. But sometimes they’re right. New treatments are often tested on patients with poor prognosis, so the complaints of receiving more incurable patients are justified. And yet until there’s some evidence that a new treatment may be at least as good as standard, it’s unethical to give that treatment to patients with better prognosis. Sometimes post-hoc analysis finds a smoking gun, and sometimes it’s data dredging. Sometimes randomized trials fail to balance on important patient characteristics. There are no simple answers. Context is critical, and dilemmas remain despite our best efforts. That’s what makes biostatistics interesting.
As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.
You’ve got a new drug and it’s time to test it on patients. How much of the drug do you give? That’s the question dose-finding trials attempt to answer.
The typical dose-finding procedure starts by selecting a small number of dose levels, say four or five. The trial begins by giving the lowest dose to the first few patients, and there is some procedure for deciding when to try higher doses. Convention says it is unethical to start at any dose other than lowest dose. I will give several reasons to question convention.
Suppose you want to run a clinical trial to test the following four doses of Agent X: 10 mg, 20 mg, 30 mg, 50 mg. You want to start with 20 mg. Your trial goes for statistical review and the reviewer says your trial is unethical because you are not starting at the lowest dose. You revise your protocol saying you only want to test three doses: 20 mg, 30 mg, and 50 mg. Now suddenly it is perfectly ethical to start with a dose of 20 mg because it is the lowest dose.
The more difficult but more important question is whether a dose of 20 mg of Agent X is medically reasonable. The first patient in the trial does not care whether higher or lower doses will be tested later. He only cares about the one dose he’s about to receive. So rather than asking “Why are you starting at dose 2?” reviewers should ask “How did you come up with this list of doses to test?”
A variation of the start-at-the-lowest-dose rule is the rule to always start at “dose 1”. Suppose you revise the original protocol to say dose 1 is 20 mg, dose 2 is 30 mg, and dose 3 is 50 mg. The protocol also includes a “dose -1” of 10 mg. You explain that you do not intend to give dose -1, but have included it as a fallback in case the lowest dose (i.e. 20 mg) turns out to be too toxic. Now because you call 20 mg “dose 1” it is ethical to begin with that dose. You could even begin with 30 mg if you were to label the two smaller doses “dose -2” and “dose -1.” With this reasoning, it is ethical to start at any dose, as long as you call it “dose 1.” This approach is justified only if the label “dose 1” carries the implicit endorsement of an expert that it is a medically reasonable starting dose.
Part of the justification for starting at the lowest dose is that the earliest dose-finding methods would only search in one direction. This explains why some people still speak of “dose escalation” rather than “dose-finding.” More modern dose-finding methods can explore up and down a dose range.
The primary reason for starting at the lowest dose is fear of toxicity. But when treating life-threatening diseases, one could as easily justify starting at the highest dose for fear of under treatment. (Some trials do just that.) Depending on the context, it could be reasonable to start at the lowest, highest, or any dose in between.
The idea of first selecting a range of doses and then deciding where to start exploring seems backward. It makes more sense to first pick the starting dose, then decide what other doses to consider.