Subjecting fewer patients to ineffective treatments

Tomorrow morning I’m giving a talk on how to subject fewer patients to ineffective treatment in clinical trials. I should have used something like the title of this post as the title of my talk, but instead my talk is called “Clinical Trial Monitoring With Bayesian Hypothesis Testing.” Classic sales mistake: emphasizing features rather than benefits. But the talk is at a statistical conference, so maybe the feature-oriented title isn’t so bad.

Ethical concerns are the main consideration that makes biostatistics a separate branch of statistics. You can’t test experimental drugs on people the way you test experimental fertilizers on crops. In human trials, you want to stop the trial early if it looks like the experimental treatment is not as effective as a comparable established treatment, but you want to keep going if it looks like the new treatment might be better. You need to establish rules before the trial starts that quantify exactly what it means to look like a treatment is doing better or worse than another treatment. There are a lot of ways of doing this quantification, and some work better than others. Within its context (single-arm phase II trials with binary or time-to-event endpoints) the method I’m presenting stops ineffective trials sooner than the methods we compare it to while stopping no more often in situations where you’d want the trial to continue.

If you’re not familiar with statistics, this may sound strange. Why not always stop when a treatment is worse and never stop when it’s better? Because you never know with certainty that one treatment is better than another. The more patients you test, the more sure you can be of your decision, but some uncertainty always remains. So you face a trade-off between being more confident of your conclusion and experimenting on more patients. If you think a drug is bad, you don’t want to treat thousands more patients with it in order to be extra confident that it’s bad, so you stop. But you run the risk of shutting down a trial of a treatment that really is an improvement but by chance appeared to be worse at the time you made the decision to stop. Statistics is all about such trade-offs.

Related: Adaptive clinical trial design

Problems versus dilemmas

In a recent interview on the PowerScripting Podcast, Jeffrey Snover said that software versioning isn’t a problem, it’s a dilemma. The difference is that problems can be solved, but dilemmas can only be managed. No versioning system can do everything that everyone would like.

The same phenomena exists in biostatistics. As I’d mentioned in Galen and clinical trials, biostatistics is filled with dilemmas. There are problems along the way that can be solved, but fundamentally biostatistics manages dilemmas.

Galen and clinical trials

Here’s a quote from the Greek physician Galen (c. 130-210 A.D.)

All who drink of this remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases.

Imagine a dialog between Galen and a modern statistician.

Stat: You say your new treatment is better than the previous one?

Galen: Yes.

Stat: But more people died on the new treatment.

Galen: Those patients don’t count because they were incurable. They would have died anyway.

The problem with Galen’s line of reasoning is that it is not falsifiable: no experiment could disprove it. He could call any treatment superior by claiming that evidence against it doesn’t count. Still, Galen might have been right.

Now suppose our statistician has a long talk with Galen and tells him about modern statistical technique.

Galen: Can’t you look back at my notes and see whether there was something different about the patients who didn’t respond to the new treatment? There’s got to be some explanation. Maybe my new treatment isn’t better for everyone, but there must be a group for whom it’s better.

Stat: Well, that’s tricky business. Advocates call that “subset analysis.” Critics call it “data dredging.” The problem is that the more clever you are with generating after-the-fact explanations, the more likely you’ll come up with one that seems true but isn’t.

Galen: I’ll have to think about that one. What do you propose we do?

Stat: We’ll have to do a randomized experiment. When each patient arrives, we’ll flip a coin to decide whether to give them the old or the new treatment. That way we expect about the same number of incurable patients to receive each treatment.

Galen: But the new treatment is better. Why should I give half my patients the worse treatment?

Stat: We don’t really know that the new treatment is better. Maybe it’s not. A randomized experiment will give us more confidence one way or another.

Galen: But couldn’t we be unlucky and assign more incurable patients to the better treatment?

Stat: Yes, that’s possible. But it’s not likely we will assign too many more incurable patients to either treatment. That’s just a chance we’ll have to take.

The issues in these imaginary dialogs come up all the time. There are people who believe their treatment is superior despite evidence to the contrary. But sometimes they’re right. New treatments are often tested on patients with poor prognosis, so the complaints of receiving more incurable patients are justified. And yet until there’s some evidence that a new treatment may be at least as good as standard, it’s unethical to give that treatment to patients with better prognosis. Sometimes post-hoc analysis finds a smoking gun, and sometimes it’s data dredging. Sometimes randomized trials fail to balance on important patient characteristics. There are no simple answers. Context is critical, and dilemmas remain despite our best efforts. That’s what makes biostatistics interesting.

Related: Adaptive clinical trial design

 

Randomized trials of parachute use

It is widely assumed that parachute use improves your chances of surviving a leap from an airplane. However, a meta analysis suggests this practice is not adequately supported by controlled experiments. See the article Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomized controlled trials by Gordon C S Smith and Jill P Pell. The authors summarize their conclusions in the abstract.

As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.

Dose-finding: why start at the lowest dose?

You’ve got a new drug and it’s time to test it on patients. How much of the drug do you give? That’s the question dose-finding trials attempt to answer.

The typical dose-finding procedure starts by selecting a small number of dose levels, say four or five. The trial begins by giving the lowest dose to the first few patients, and there is some procedure for deciding when to try higher doses. Convention says it is unethical to start at any dose other than lowest dose. I will give several reasons to question convention.

Suppose you want to run a clinical trial to test the following four doses of Agent X: 10 mg, 20 mg, 30 mg, 50 mg. You want to start with 20 mg. Your trial goes for statistical review and the reviewer says your trial is unethical because you are not starting at the lowest dose. You revise your protocol saying you only want to test three doses: 20 mg, 30 mg, and 50 mg. Now suddenly it is perfectly ethical to start with a dose of 20 mg because it is the lowest dose.

The more difficult but more important question is whether a dose of 20 mg of Agent X is medically reasonable. The first patient in the trial does not care whether higher or lower doses will be tested later. He only cares about the one dose he’s about to receive. So rather than asking “Why are you starting at dose 2?” reviewers should ask “How did you come up with this list of doses to test?”

A variation of the start-at-the-lowest-dose rule is the rule to always start at “dose 1”. Suppose you revise the original protocol to say dose 1 is 20 mg, dose 2 is 30 mg, and dose 3 is 50 mg. The protocol also includes a “dose -1” of 10 mg. You explain that you do not intend to give dose -1, but have included it as a fallback in case the lowest dose (i.e. 20 mg) turns out to be too toxic. Now because you call 20 mg “dose 1” it is ethical to begin with that dose. You could even begin with 30 mg if you were to label the two smaller doses “dose -2” and “dose -1.” With this reasoning, it is ethical to start at any dose, as long as you call it “dose 1.” This approach is justified only if the label “dose 1” carries the implicit endorsement of an expert that it is a medically reasonable starting dose.

Part of the justification for starting at the lowest dose is that the earliest dose-finding methods would only search in one direction. This explains why some people still speak of “dose escalation” rather than “dose-finding.” More modern dose-finding methods can explore up and down a dose range.

The primary reason for starting at the lowest dose is fear of toxicity. But when treating life-threatening diseases, one could as easily justify starting at the highest dose for fear of under treatment. (Some trials do just that.) Depending on the context, it could be reasonable to start at the lowest, highest, or any dose in between.

The idea of first selecting a range of doses and then deciding where to start exploring seems backward. It makes more sense to first pick the starting dose, then decide what other doses to consider.

Related: Adaptive clinical trial design

False positives for medical papers

My previous two posts have been about false research conclusions and false positives in medical tests. The two are closely related.

With medical testing, the prevalence of the disease in the population at large matters greatly when deciding how much credibility to give a positive test result. Clinical studies are similar. The proportion of potential genuine improvements in the class of treatments being tested is an important factor in deciding how credible a conclusion is.

In medical tests and clinical studies,  we’re often given the opposite of what we want to know. We’re given the probability of the evidence given the conclusion, but we want to know the probability of the conclusion given the evidence. These two probabilities may be similar, or they may be very different.

The analogy between false positives in medical testing and false positives in clinical studies is helpful, because the former is easier to understand that the latter. But the problem of false conclusions in clinical studies is more complicated. For one thing, there is no publication bias in medical tests: patients get the results, whether positive or negative. In research, negative results are usually not published.

False positives for medical tests

The most commonly given example of Bayes theorem is testing for rare diseases. The results are not intuitive. If a disease is rare, then your probability of having the disease given a positive test result remains low. For example, suppose a disease effects 0.1% of the population and a test for the disease is 95% accurate. Then your probability of having the disease given that you test positive is only about 2%.

Textbooks typically rush through the medical testing example, though I believe it takes a more details and numeric examples for it to sink in. I know I didn’t really get it the first couple times I saw it presented.

I just posted an article that goes over the medical testing example slowly and in detail: Canonical example of Bayes’ theorem in detail. I take what may be rushed through in half a page of a textbook and expand it to six pages, and I use more numbers and graphs than equations. It’s worth going over this example slowly because once you understand it, you’re well on your way to understanding Bayes’ theorem.

Most published research results are false

John Ioannidis wrote an article in Chance magazine a couple years ago with the provocative title Why Most Published Research Findings are False.  [Update: Here’s a link to the PLoS article reprinted by Chance. And here are some notes on the details of the paper.] Are published results really that bad? If so, what’s going wrong?

Whether “most” published results are false depends on context, but a large percentage of published results are indeed false. Ioannidis published a report in JAMA looking at some of the most highly-cited studies from the most prestigious journals. Of the studies he considered, 32% were found to have either incorrect or exaggerated results. Of those studies with a 0.05 p-value, 74% were incorrect.

The underlying causes of the high false-positive rate are subtle, but one problem is the pervasive use of p-values as measures of evidence.

Folklore has it that a “p-value” is the probability that a study’s conclusion is wrong, and so a 0.05 p-value would mean the researcher should be 95 percent sure that the results are correct. In this case, folklore is absolutely wrong. And yet most journals accept a p-value of 0.05 or smaller as sufficient evidence.

Here’s an example that shows how p-values can be misleading. Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!

The example above is exaggerated, but look at the JAMA study results again. In a sample of real medical experiments, 32% of those with “significant” results were wrong. And among those that just barely showed significance, 74% were wrong.

See Jim Berger’s criticisms of p-values for more technical depth.

Population drift

The goal of a clinical trial is to determine what treatment will be most effective in a given population. What if the population changes while you’re conducting your trial? Say you’re treating patients with Drug X and Drug Y, and initially more patients were responding to X, but later more responded to Y. Maybe you’re just seeing random fluctuation, but maybe things really are changing and the rug is being pulled out from under your feet.

Advances in disease detection could cause a trial to enroll more patients with early stage disease as the trial proceeds. Changes in the standard of care could also make a difference. Patients often enroll in a clinical trial because standard treatments have been ineffective. If the standard of care changes during a trial, the early patients might be resistant to one therapy while later patients are resistant to another therapy. Often population drift is slow compared to the duration of a trial and doesn’t affect your conclusions, but that is not always the case.

My interest in population drift comes from adaptive randomization. In an adaptive randomized trial, the probability of assigning patients to a treatment goes up as evidence accumulates in favor of that treatment. The goal of such a trial design is to assign more patients to the more effective treatments. But what if patient response changes over time? Could your efforts to assign the better treatments more often backfire? A trial could assign more patients to what was the better treatment rather than what is now the better treatment.

On average, adaptively randomized trials do treat more patients effectively than do equally randomized trials. The report Power and bias in adaptive randomized clinical trials shows this is the case in a wide variety of circumstances, but it assumes constant response rates, i.e. it does not address population drift.

I did some simulations to see whether adaptive randomization could do more harm than good. I looked at more extreme population drift than one is likely to see in practice in order to exaggerate any negative effect. I looked at gradual changes and sudden changes. In all my simulations, the adaptive randomization design treated more patients effectively on average than the comparable equal randomization design. I wrote up my results in The Effect of Population Drift on Adaptively Randomized Trials.

Related: Adaptive clinical trial design