Dose finding != dose escalation

You’ll often hear Phase I dose-finding trials referred to as dose escalation studies. This is because simple dose-finding methods can only explore in one direction: they can only escalate.

Three-plus-three rule

The most common dose finding method is the 3+3 rule. There are countless variations on this theme, but the basic idea is that you give a dose of an experimental drug to three people. If all three are OK, you go up a dose next time. If two out of three are OK, you give that dose again. If only one out of three is OK, you stop [1].

Deterministic thinking

The 3+3 algorithm implicitly assumes deterministic thinking, at least in part. The assumption is that if three out of three patients respond well, we know the dose is safe [2].

If you increase the dose level and the next three patients experience adverse events, you stop the trial. Why? Because you know that the new dose is dangerous, and you know the previous dose was safe. You can only escalate because you assume you have complete knowledge based on three samples.

But if we treat three patients at a particular dose level and none have an adverse reaction we do not know for certain that the dose level is safe, though we may have sufficient confidence in its safety to try the next dose level. Similarly, if we treat three patients at a dose and all have an adverse reaction, we do not know for certain that the dose is toxic.

Bayesian dose-finding

A Bayesian dose-finding method estimates toxicity probabilities given the data available. It might decide at one point that a dose appears safe, then reverse its decision later based on more data. Similarly, it may reverse an initial assessment that a dose is unsafe.

A dose-finding method based on posterior probabilities of toxicity is not strictly a dose escalation method because it can explore in two directions. It may decide that the next dose level to explore is higher or lower than the current level.

Starting at the lowest dose

In Phase I studies of chemotherapeutics, you conventionally start at the lowest dose. This makes sense. These are toxic agents, and you naturally want to start at a dose you have reason to believe isn’t too toxic. (NB: I say “too toxic” because chemotherapy is toxic. You hope that it’s toxic to a tumor without being too toxic for the patient host.)

But on closer inspection maybe you shouldn’t start at the lowest dose. Suppose you want to test 100 mg, 200 mg, and 300 mg of some agent. Then 100 mg is the lowest dose, and it’s ethical to start at 100 mg. Now what if we add a dose of 50 mg to the possibilities? Did the 100 mg dose suddenly become unethical as a starting dose?

If you have reason to believe that 100 mg is a tolerable dose, why not start with that dose, even if you add a lower dose in case you’re wrong? This makes sense if you think of dose-finding, but not if you think only in terms of dose escalation. If you can only escalate, then it’s impossible to ever give a dose below the starting dose.

Related posts

[1] I have heard, but I haven’t been able to confirm, that the 3+3 method has its origin in a method proposed by John Tukey during WWII for testing bombs. When testing a mechanical system, like a bomb, there is much less uncertainty than when testing a drug in a human. In a mechanical setting, you may have a lot more confidence from three samples than you would in a medical setting.

[2] How do you explain the situation where one out of three has an adverse reaction? Is the dose safe or not? Here you naturally switch to probabilistic thinking because deterministic thinking leads to a contradiction.

 

Multi-arm adaptively randomized clinical trials

This post will look at adaptively randomized trial designs. In particular, we want to focus on multi-arm trials, i.e. trials of more than two treatments. The aim is to drop the less effective treatments quickly so the trial can focus on determining which of the better treatments is best.

We’ll briefly review our approach to adaptive randomization but not go into much detail. For a more thorough introduction, see, for example, this report.

Why adaptive randomization

Adaptive randomization designs allow the randomization probabilities to change in response to accumulated outcome data so that more subjects are assigned to (what appear to be) more effective treatments. They also allow for continuous monitoring, so one can stop a trial early if we’re sufficiently confident that we’ve found the best treatment.

Adapting randomization probabilities

Of course we don’t know what treatments are more effective or else we wouldn’t be running a clinical trial. We have an idea based on the data seen so far at any point in the trial, and that assessment may change as more data become available.

We could simply put each patient on what appears to be the best arm (the “play-the-winner” strategy), but this would forgo the benefits of randomization. Instead we compromise, continuing to randomize, but increasing the randomization probability for what appears to be the best treatment at the time a subject enters the trial.

Continuous monitoring

By monitoring the performance of each treatment arm, we can drop a poorly performing arm and assign subjects to the better treatment arms. This is particularly important for multi-arm trials. We want to weed out the poor treatments quickly so we can focus on the more promising treatments.

Continuous monitoring opens the possibility of stopping trials early if there is a clear winner. If all treatments perform similarly, more patients will be used. The maximum number of patients will be enrolled only if necessary.

Multi-arm trials

Randomizing an equal number of patients to each of several treatment arms would require a lot of subjects. A multi-arm adaptive trail turns into a two-arm trial once the other arms are dropped. We’ll present simulation results below that demonstrate this.

Running a big trial with several treatment arms could be more cost effective than running several smaller trials because there is a certain fixed cost associated with running any trial, no matter how small: protocol review, IRB approval, etc.

There has been some skepticism whether two-arm adaptively randomized trials live up to their hype. Trial design is a multi-objective optimization problem and it’s easy to claim victory by doing better by one criteria while doing worse by another. In my opinion, adaptive randomization is more promising for multi-arm trials than two-arm trials.

In my experience, multi-arm trials benefit more from early stopping than from adapting randomization probabilities. That is, one may treat more patients effectively by randomizing equally but dropping poorly performing treatments. Instead of reducing the probability of assigning patients to a poor treatment arm, continue to randomize equally so you can more quickly gather enough evidence to drop the arm.

I initially thought that gradually decreasing the randomization probability of a poorly performing arm would be better than keeping the randomization probability equal until it suddenly drops to zero. But experience suggests this intuition was wrong.

Simulation study

I designed a 4-arm trial with a uniform prior on the probability of response on each arm. The maximum accrual was set to 400 patients.

An arm is suspended if the posterior probability of it being best drops below 0.05. (Note: A suspended arm is not necessarily closed. It may become active again in response to more data.)

Subjects are randomized equally to all available arms. If only one arm is available, the trial stops. Each trial was simulated 1,000 times.

In the first scenario, I assume the true probabilities of successful response on the treatment arms are 0.3, 0.4, 0.5, and 0.6 respectively. The treatment arm with 30% response was dropped early in 99.5% of the simulations, and on average only 12.8 patients were assigned to this treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |      0.3 |          0.995 | 12.8     |
|   2 |      0.4 |          0.968 | 25.7     |
|   3 |      0.5 |          0.754 | 60.8     |
|   4 |      0.6 |          0.086 | 93.9     |
|-----+----------+----------------+----------|

An average of 193.2 patients were used out of the maximum accrual of 400. Note that 80% of the subjects were allocated to the two best treatments.

Here are the results for the second scenario. Note that in this scenario there are two bad treatments and two good treatments. As we’d hope, the two bad treatments are dropped early and the trial concentrates on deciding the better of the two good treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |     0.35 |          0.999 |     11.7 |
|   2 |     0.45 |          0.975 |     22.5 |
|   3 |     0.60 |          0.502 |     85.6 |
|   4 |     0.65 |          0.142 |    111.0 |
|-----+----------+----------------+----------|

An average of 230.8 patients were used out of the maximum accrual of 400. Now 85% of patients were assigned to the two best treatments. More patients were used in this scenario because the two best treatments were harder to tell apart.

Related posts

Valuing results and information

Chris Wiggins gave an excellent talk at Rice this afternoon on data science at the New York Times. In the Q&A afterward, someone asked how you would set up a machine learning algorithm where you’re trying to optimize for outcomes and for information.

Here’s how I’ve approached this dilemma in the past. Information and outcomes are not directly comparable, so any objective function combining the two is going to add incommensurate things. One way around this is to put a value not on information per se but on what you’re going to do with the information.

In the context of a clinical trial, you want to treat patients in the trial effectively, and you want a high probability of picking the best treatment at the end. You can’t add patients and probabilities meaningfully. But why do you want to know which treatment is best? So you can treat future patients effectively. The value of knowing which treatment is best is the increase in the expected number of future successful treatments. This gives you a meaningful objective: maximize the expected number of effective treatments, of patients in the trial and future patients treated as a result of the trial.

The hard part is guessing what will be done with the information learned in the trial. Of course this isn’t exactly known, but it’s the key to estimating the value of the information. If nobody will be treated any differently in the future no matter how the trial turns out—and unfortunately this may be the case—then the trial should be optimized strictly for the benefit of the people in the trial. On the other hand, if a trial will determine the standard of care for thousands of future patients, then there is more value in picking the best treatment.

Interim analysis, futility monitoring, and predictive probability

An interim analysis of a clinical trial is an unusual analysis. At the end of the trial you want to estimate how well some treatment X works. For example, you want to how likely is it that treatment X works better than the control treatment Y. But in the middle of the trial you want to know something more subtle.

It’s possible that treatment X is doing so poorly that you want to end the trial without going any further. It’s also possible that X is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with futility. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments and Y are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment X is doing a little better than Y, but not so much better that you can declare with confidence that X is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If X is doing poorly, ethics might dictate that you stop giving X to patients so you stop early. If X is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if X is doing so-so, there’s usually not an ethical reason to stop, unless X is worse than Y on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the trial, not of the treatment per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why predictive probability is natural for interim analysis. You’re trying to predict outcomes, not parameters. (This is subtle: you’re trying to estimate the probability of outcomes that lead to certain estimates of parameters, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a p-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (Bayesian) predictive probability of meeting your frequentist criteria!

Click to learn more about Bayesian statistics consulting

Mathematical modeling for medical devices

We’re about to see a lot of new, powerful, inexpensive medical devices come out. And to my surprise, I’ve contributed to a few of them.

Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.

In order to infer what you’d like to measure from what you can measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.

I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.

After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.

If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.

 

Bayesian adaptive clinical trials: promise and pitfalls

This afternoon I’m giving a talk at the Houston INFORMS chapter entitled “Bayesian adaptive clinical trials: promise and pitfalls.”

When I started working in adaptive clinical trials, I was very excited about the potential of such methods. The clinical trial methods most commonly used are very crude, and there’s plenty of room for improvement.

Over time I became concerned about overly complex methods, methods which were good for academic publication but may not be best for patients. Such methods are extremely time-consuming to develop and may not perform as well in practice as simpler methods.

There’s a great deal of opportunity between the extremes, methods that are more sophisticated than the status quo without being unnecessarily complex.

Click to learn more about Bayesian statistics consulting

Frequentist properties of Bayesian methods

Bayesian methods for designing clinical trials have become more common, and yet these Bayesian designs are almost always evaluated by frequentist criteria. For example, a trial may be designed to stop early 95% of the time under some bad scenario and stop no more than 20% of the time under some good scenario.

These criteria are arbitrary, since the “good” and “bad” scenarios are arbitrary, and because the stopping probability requirements of 95% and 20% are arbitrary. Still, there’s an idea in lurking in the background that in every design there must be something that is shown to happen no more than 5% of the time.

It takes a great deal of effort to design Bayesian methods with desired frequentist properties. It’s an inverse problem, searching for the parameters in a high-dimensional design space, usually via lengthy simulation, that cause the method to satisfy some criteria. Of course frequentist methods satisfy frequentist criteria by design and so meet these criteria with far less effort. It’s rare to see the tables turned, evaluating frequentist methods by Bayesian criteria.

Sometimes the effort to beat frequentist designs at their own game is futile because the frequentist designs are optimal by their own criteria. More often, however, the Bayesian and frequentist methods being compared are not direct competitors but only analogs. The aim in this case is to match the frequentist method’s operating characteristics by one criterion while doing better by a new criterion.

Sometimes a Bayesian method can be shown to have better frequentist operating characteristics than its frequentist counterpart. This puts dogmatic frequentists in the awkward position of admitting that what they see as an unjustified approach to statistics has nevertheless produced a superior product. Some anti-Bayesians are fine with this, happy to have a procedure with better frequentist properties, even though it happened to be discovered via a process they view as illegitimate.

Click to learn more about Bayesian statistics consulting

 

Related postBayesian clinical trials in one zip code

Reproducible randomized controlled trials

“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things.

But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?

One of the purposes in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.

The example of sex as a factor is oversimplified because there’s reason to suspect a priori that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.

Another purpose in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone not involved in the study.

And why would you want an randomization assignments to be reproducible? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.

Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.

Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.

Need help with randomized trials? Let’s talk.

Skin in the game for observational studies

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with

Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is Deming, data and observational studies: A process out of control and needing fixing. It appeared in the September 2011 issue of Significance.

Update: The article can be found here.

Adaptive clinical trials and machine learning

Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.

This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

Related: Adaptive clinical trial design