# Bayesian adaptive clinical trials: promise and pitfalls

This afternoon I’m giving a talk at the Houston INFORMS chapter entitled “Bayesian adaptive clinical trials: promise and pitfalls.”

When I started working in adaptive clinical trials, I was very excited about the potential of such methods. The clinical trial methods most commonly used are very crude, and there’s plenty of room for improvement.

Over time I became concerned about overly complex methods, methods which were good for academic publication but may not be best for patients. Such methods are extremely time-consuming to develop and may not perform as well in practice as simpler methods.

There’s a great deal of opportunity between the extremes, methods that are more sophisticated than the status quo without being unnecessarily complex.

# Frequentist properties of Bayesian methods

Bayesian methods for designing clinical trials have become more common, and yet these Bayesian designs are almost always evaluated by frequentist criteria. For example, a trial may be designed to stop early 95% of the time under some bad scenario and stop no more than 20% of the time under some good scenario.

These criteria are arbitrary, since the “good” and “bad” scenarios are arbitrary, and because the stopping probability requirements of 95% and 20% are arbitrary. Still, there’s an idea in lurking in the background that in every design there must be something that is shown to happen no more than 5% of the time.

It takes a great deal of effort to design Bayesian methods with desired frequentist properties. It’s an inverse problem, searching for the parameters in a high-dimensional design space, usually via lengthy simulation, that cause the method to satisfy some criteria. Of course frequentist methods satisfy frequentist criteria by design and so meet these criteria with far less effort. It’s rare to see the tables turned, evaluating frequentist methods by Bayesian criteria.

Sometimes the effort to beat frequentist designs at their own game is futile because the frequentist designs are optimal by their own criteria. More often, however, the Bayesian and frequentist methods being compared are not direct competitors but only analogs. The aim in this case is to match the frequentist method’s operating characteristics by one criterion while doing better by a new criterion.

Sometimes a Bayesian method can be shown to have better frequentist operating characteristics than its frequentist counterpart. This puts dogmatic frequentists in the awkward position of admitting that what they see as an unjustified approach to statistics has nevertheless produced a superior product. Some anti-Bayesians are fine with this, happy to have a procedure with better frequentist properties, even though it happened to be discovered via a process they view as illegitimate.

Related postBayesian clinical trials in one zip code

# Skin in the game for observational studies

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with

Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is Deming, data and observational studies: A process out of control and needing fixing. It appeared in the September 2011 issue of Significance.

Update: The article can be found here.

# Clinical trial software

This week’s resource post lists some of the projects I managed or contributed to while working at MD Anderson Cancer Center in biostatistics.

If you’d like help with the above software or would like help with clinical trial design, please contact me.

Last week’s resource post: Stand-alone numerical code

# Finding the best dose

In a dose-finding clinical trial, you have a small number of doses to test, and you hope find the one with the best response. Here “best” may mean most effective, least toxic, closest to a target toxicity, some combination of criteria, etc.

Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose.  This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.

Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)

Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the best dose more often, but it picks a good dose less often.

In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.

This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.

The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four arbitrarily chosen doses under consideration. Maybe the best dose is 350 mg, 843 mg, or some other dose not under consideration.

A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)

Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.

(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)

You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.

Dose-finding trials of chemotherapy agents look for the MTD: maximum tolerated dose. The idea is to give patients as much chemotherapy as they can tolerate, hoping to do maximum damage to tumors without doing too much damage to patients.

But “maximum tolerated dose” implies a degree of personalization that rarely exists in clinical trials. Phase I chemotherapy trials don’t try to find the maximum dose that any particular patient can tolerate. They try to find a dose that is toxic to a certain percentage of the trial participants, say 30%. (This rate may seem high, but it’s typical. It’s not far from the toxicity rate implicit in the so-called 3+3 rule or from the explicit rate given in many CRM (“continual reassessment method”) designs.)

It’s tempting to think of “30% toxicity rate” as meaning that each patient experiences a 30% toxic reaction. But that’s not what it means. It means that each patient has a 30% chance of a toxicity, however toxicity is defined in a particular trial. If toxicity were defined as kidney failure, for example, then 30% toxicity rate means that each patient has a 30% probability of kidney failure, not that they should expect a 30% reduction in kidney function.

Related posts:

# New development in cancer research scandal

My interest in the Anil Potti scandal started when my former colleagues could not reproduce the analysis in one of Potti’s papers. (Actually, they did reproduce the analysis, at great effort, in the sense of forensically determining the erroneous steps that were carried out.) Two years ago, the story was on 60 Minutes. The straw that broke the camel’s back was not bad science but résumé padding.

It looks like the story is a matter of fraud rather than sloppiness. This is unfortunate because sloppiness is much more pervasive than fraud, and this could have made a great case study of bad analysis. However, one could look at it as a case study in how good analysis (by the folks at MD Anderson) can uncover fraud.

Now there’s a new development in the Potti saga. The latest issue of The Cancer Letter contains letters by whistle-blower Bradford Perez who warned officials at Duke about problems with Potti’s research.

# Robust in one sense, sensitive in another

When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.

Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.

But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.

One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.

Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.

# A priori overfitting

The term overfitting usually describes fitting too complex a model to available data. But it is possible to overfit a model before there are any data.

An experimental design, such as a clinical trial, proposes some model to describe the data that will be collected. For simple, well-known models the behavior of the design may be known analytically. For more complex or novel methods, the behavior is evaluated via simulation.

If an experimental design makes strong assumptions about data, and is then simulated with scenarios that follow those assumptions, the design should work well. So designs must be evaluated using scenarios that do not exactly follow the model assumptions. Here lies a dilemma: how far should scenarios deviate from model assumptions? If they do not deviate at all, you don’t have a fair evaluation. But deviating too far is unreasonable as well: no method can be expected to work well when it’s assumptions are flagrantly violated.

With complex designs, it may not be clear to what extent scenarios deviate from modeling assumptions. The method may be robust to some kinds of deviations but not to others. Simulation scenarios for complex designs are samples from a high dimensional space, and it is impossible to adequately explore a high dimensional space with a small number of points. Even if these scenarios were chosen at random—which would be an improvement over manually selecting scenarios that present a method in the best light—how do you specify a probability distribution on the scenarios? You’re back to a variation on the previous problem.

Once you have the data in hand, you can try a complex model and see how well it fits. But with experimental design, the model is determined before there are any data, and thus there is no possibility of rejecting the model for being a poor fit. You might decide after its too late, after the data have been collected, that the model was a poor fit. However, retrospective model criticism is complicated for adaptive experimental designs because the model influenced which data were collected.

This is especially a problem for one-of-a-kind experimental designs. When evaluating experimental designs — not the data in the experiment but the experimental design itself—each experiment is one data point. With only one data point, it’s hard to criticize a design. This means we must rely on simulation, where it is possible to obtain many data points. However, this brings us back to the arbitrary choice of simulation scenarios. In this case there are no empirical data to test the model assumptions.

Related posts:

# Probability of long runs

Suppose you’ve written a program that randomly assigns test subjects to one of two treatments, A or B, with equal probability. The researcher using your program calls you to tell you that your software is broken because it has assigned treatment A to seven subjects in a row.

You might argue that the probability of seven A’s in a row is 1/2^7 or about 0.008. Not impossible, but pretty small. Maybe the software is broken.

But this line of reasoning grossly underestimates the probability of a run of 7 identical assignments. If someone asked the probability that the next 7 assignments would all be A’s, then 1/2^7 would be the right answer. But that’s not the same as asking whether an experiment is likely to see a run of length 7 because the run could start any time, not just on the next assignment. Also, the phone didn’t ring out of the blue: it rang precisely because there had just been a run.

Suppose you have a coin that has probability of heads p and you flip this coin n times. A rule of thumb says that the expected length of the longest run of heads is about

provided that n(1-p) is much larger than 1.

So in a trial of n = 200 subjects with p = 0.5, you’d expect the longest run of heads to be about seven in a row. When p is larger than 0.5, the longest expected run will be longer. For example, if p = 0.6, you’d expect a run of about 9.

The standard deviation of the longest run length is roughly 1/log(1/p), independent of n. For coin flips with equal probability of heads or tails, this says an approximate 95% confidence interval would be about 3 either side of the point estimate. So for 200 tosses of a fair coin, you’d expect the longest run of heads to be about 7 ± 3, or between 4 and 10.

The following Python code gives an estimate of the probability that the longest run is between a and b inclusive, based on an extreme value distribution.

```def prob(a, b, n, p):
r = -log(n*(1-p))/log(p)
cdf = lambda x: exp(- p**x )
return cdf(b + 1 - r) - cdf(a - r)```

What if you were interested in the longest run of head or tails? With a fair coin, this just adds 1 to the estimates above. To see this, consider a success to be when consecutive coins turn up the same way. This new sequence has the same expected run lengths, but a run of length m in this sequence corresponds to a run of length m + 1 in the original sequence.

For more details, see “The Surprising Predictability of Long Runs” by Mark F. Schilling, Mathematics Magazine 85 (2012), number 2, pages 141–149.

Randomized clinical trials essentially flip a coin to assign patients to treatment arms. Outcome-adaptive randomization “bends” the coin to favor what appears to be the better treatment at the time each randomized assignment is made. The method aims to treat more patients in the trial effectively, and on average it succeeds.

However, looking only at the average number of patients assigned to each treatment arm conceals the fact that the number of patients assigned to each arm can be surprisingly variable compared to equal randomization.

Suppose we have 100 patients to enroll in a clinical trial. If we assign each patient to a treatment arm with probability 1/2, there will be about 50 patients on each treatment. The following histogram shows the number of patients assigned to the first treatment arm in 1000 simulations. The standard deviation is about 5.

Next we let the randomization probability vary. Suppose the true probability of response is 50% on one arm and 70% on the other. We model the probability of response on each arm as a beta distribution, starting from a uniform prior. We randomize to an arm with probability equal to the posterior probability that that arm has higher response. The histogram below shows the number of patients assigned to the better treatment in 1000 simulations.

The standard deviation in the number of patients is now about 17. Note that while most trials assign 50 or more patients to the better treatment, some trials in this simulation put less than 20 patients on this treatment. Not only will these trials treat patients less effectively, they will also have low statistical power (as will the trials that put nearly all the patients on the better arm).

The reason for this volatility is that the method can easily be mislead by early outcomes. With one or two early failures on an arm, the method could assign more patients to the other arm and not give the first arm a chance to redeem itself.

Because of this dynamic, various methods have been proposed to add “ballast” to adaptive randomization. See a comparison of three such methods here. These methods reduce the volatility in adaptive randomization, but do not eliminate it. For example, the following histogram shows the effect of adding a burn-in period to the example above, randomizing the first 20 patients equally.

The standard deviation is now 13.8, less than without the burn-in period, but still large compared to a standard deviation of 5 for equal randomization.

Another approach is to transform the randomization probability. If we use an exponential tuning parameter of 0.5, the sample standard deviation of the number of patients on the better arm is essentially the same, 13.4. If we combine a burn-in period of 20 and an exponential parameter of 0.5, the sample standard deviation is 11.7, still more than twice that of equal randomization.

Related:

# Personalized medicine

When I hear someone say “personalized medicine” I want to ask “as opposed to what?”

All medicine is personalized. If you are in an emergency room with a broken leg and the person next to you is lapsing into a diabetic coma, the two of you will be treated differently.

The aim of personalized medicine is to increase the degree of personalization, not to introduce personalization. In particular, there is the popular notion that it will become routine to sequence your DNA any time you receive medical attention, and that this sequence data will enable treatment uniquely customized for you. All we have to do is collect a lot of data and let computers sift through it. There are numerous reasons why this is incredibly naive. Here are three to start with.

• Maybe the information relevant to treating your malady is in how DNA is expressed, not in the DNA per se, in which case a sequence of your genome would be useless. Or maybe the most important information is not genetic at all. The data may not contain the answer.
• Maybe the information a doctor needs is not in one gene but in the interaction of 50 genes or 100 genes. Unless a small number of genes are involved, there is no way to explore the combinations by brute force. For example, the number of ways to select 5 genes out of 20,000 is 26,653,335,666,500,004,000. The number of ways to select 32 genes is over a googol, and there isn’t a googol of anything in the universe. Moore’s law will not get us around this impasse.
• Most clinical trials use no biomarker information at all. It is exceptional to incorporate information from one biomarker. Investigating a handful of biomarkers in a single trial is statistically dubious. Blindly exploring tens of thousands of biomarkers is out of the question, at least with current approaches.

Genetic technology has the potential to incrementally increase the degree of personalization in medicine. But these discoveries will require new insight, and not simply more data and more computing power.

Related posts:

# How long will there be computer science departments?

The first computer scientists resided in math departments. When universities began to form computer science departments, there was some discussion over how long computer science departments would exist. Some thought that after a few years, computer science departments would have served their purpose and computer science would be absorbed into other departments that applied it.

It looks like computer science departments are here to stay, but that doesn’t mean that there are not territorial disputes. If other departments are not satisfied with the education their students are getting from the computer science department, they will start teaching their own computer science classes. This is happening now, to different extents in different places.

Some institutions have departments of bioinformatics or biomathematics. Will they always? Or will “bioinformatics” and “biomathematics” simply be “biology” in a few years?

Statisticians sometimes have their own departments, sometimes reside in mathematics departments, and sometimes are scattered to the four winds with de facto statisticians working in departments of education, political science, etc. It would be interesting to see which of these three options grows in the wake of “big data.” A fourth possibility is the formation of “data science” departments, essentially statistics departments with more respect for machine learning and with better marketing.

No doubt computer science, bioinformatics, and statistics will be hot areas for years to come, but the scope of academic departments by these names will change. At different institutions they may grow, shrink, or even disappear.

Academic departments argue that because their subject is important, their department is important. And any cut to their departmental budget is framed as a cut to the budget for their subject. But neither of these is necessarily true. Matt Briggs wrote about this yesterday in regard to philosophy. He argues that philosophy is important but that philosophy departments are not. He quotes Peter Kreeft:

Philosophy was not a “department” to its founders. They would have regarded the expression “philosophy department” as absurd as “love department.”

Love is important, but it doesn’t need to be a department. In fact, it’s so important that the idea of quarantining it to a department is absurd.

Computer science and statistics departments may shrink as their subjects diffuse throughout the academy. Their departments may not go away, but they may become more theoretical and more specialized. Already most statistics education takes place outside of statistics departments, and the same may be true of computer science soon if it isn’t already.

* * *

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

# How do you justify that distribution?

Someone asked me yesterday how people justify probability distribution assumptions. Sometimes the most mystifying assumption is the first one: “Assume X is normally distributed …” Here are a few answers.

1. Sometimes distribution assumptions are not justified.
2. Sometimes distributions can be derived from fundamental principles. For example, there are axioms that uniquely specify a Poisson distribution.
3. Sometimes distributions are justified on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.
4. Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.
5. Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.

The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results. Here’s an example of the latter.

Suppose you are testing a new drug and hoping that it improves how long patients live. You want to stop the clinical trial early if it looks like patients are living no longer than they would have on standard treatment. There is a Bayesian method for monitoring such experiments that assumes survival times have an exponential distribution. But survival times are not exponentially distributed, not even close.

The method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. Simulations show that the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.

Related posts:

# Small data

Big data is getting a lot of buzz lately, but small data is interesting too. In some ways it’s more interesting. Because of limit theorems, a lot of things become dull in the large that are more interesting in the small.

When working with small data sets you have to accept that you will very often draw the wrong conclusion. You just can’t have high confidence in inference drawn from a small amount of data, unless you can do magic. But you do the best you can with what you have. You have to be content with the accuracy of your method relative to the amount of data available.

For example, a clinical trial may try to find the optimal dose of some new drug by giving the drug to only 30 patients. When you have five doses to test and only 30 patients, you’re just not going to find the right dose very often. You might want to assign 6 patients to each dose, but you can’t count on that. For safety reasons, you have to start at the lowest dose and work your way up cautiously, and that usually results in uneven allocation to doses, and thus less statistical power. And you might not treat all 30 patients. You might decide — possibly incorrectly — to stop the trial early because it appears that all doses are too toxic or ineffective. (This gives a glimpse of why testing drugs on people is a harder statistical problem than testing fertilizers on crops.)

Maybe your method finds the right answer 60% of the time, hardly a satisfying performance. But if alternative methods find the right answer 50% of the time under the same circumstances, your 60% looks great by comparison.

Related post: The law of medium numbers