Multi-arm adaptively randomized clinical trials

This post will look at adaptively randomized trial designs. In particular, we want to focus on multi-arm trials, i.e. trials of more than two treatments. The aim is to drop the less effective treatments quickly so the trial can focus on determining which of the better treatments is best.

We’ll briefly review our approach to adaptive randomization but not go into much detail. For a more thorough introduction, see, for example, this report.

Why adaptive randomization

Adaptive randomization designs allow the randomization probabilities to change in response to accumulated outcome data so that more subjects are assigned to (what appear to be) more effective treatments. They also allow for continuous monitoring, so one can stop a trial early if we’re sufficiently confident that we’ve found the best treatment.

Adapting randomization probabilities

Of course we don’t know what treatments are more effective or else we wouldn’t be running a clinical trial. We have an idea based on the data seen so far at any point in the trial, and that assessment may change as more data become available.

We could simply put each patient on what appears to be the best arm (the “play-the-winner” strategy), but this would forgo the benefits of randomization. Instead we compromise, continuing to randomize, but increasing the randomization probability for what appears to be the best treatment at the time a subject enters the trial.

Continuous monitoring

By monitoring the performance of each treatment arm, we can drop a poorly performing arm and assign subjects to the better treatment arms. This is particularly important for multi-arm trials. We want to weed out the poor treatments quickly so we can focus on the more promising treatments.

Continuous monitoring opens the possibility of stopping trials early if there is a clear winner. If all treatments perform similarly, more patients will be used. The maximum number of patients will be enrolled only if necessary.

Multi-arm trials

Randomizing an equal number of patients to each of several treatment arms would require a lot of subjects. A multi-arm adaptive trail turns into a two-arm trial once the other arms are dropped. We’ll present simulation results below that demonstrate this.

Running a big trial with several treatment arms could be more cost effective than running several smaller trials because there is a certain fixed cost associated with running any trial, no matter how small: protocol review, IRB approval, etc.

There has been some skepticism whether two-arm adaptively randomized trials live up to their hype. Trial design is a multi-objective optimization problem and it’s easy to claim victory by doing better by one criteria while doing worse by another. In my opinion, adaptive randomization is more promising for multi-arm trials than two-arm trials.

In my experience, multi-arm trials benefit more from early stopping than from adapting randomization probabilities. That is, one may treat more patients effectively by randomizing equally but dropping poorly performing treatments. Instead of reducing the probability of assigning patients to a poor treatment arm, continue to randomize equally so you can more quickly gather enough evidence to drop the arm.

I initially thought that gradually decreasing the randomization probability of a poorly performing arm would be better than keeping the randomization probability equal until it suddenly drops to zero. But experience suggests this intuition was wrong.

Simulation study

I designed a 4-arm trial with a uniform prior on the probability of response on each arm. The maximum accrual was set to 400 patients.

An arm is suspended if the posterior probability of it being best drops below 0.05. (Note: A suspended arm is not necessarily closed. It may become active again in response to more data.)

Subjects are randomized equally to all available arms. If only one arm is available, the trial stops. Each trial was simulated 1,000 times.

In the first scenario, I assume the true probabilities of successful response on the treatment arms are 0.3, 0.4, 0.5, and 0.6 respectively. The treatment arm with 30% response was dropped early in 99.5% of the simulations, and on average only 12.8 patients were assigned to this treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |      0.3 |          0.995 | 12.8     |
|   2 |      0.4 |          0.968 | 25.7     |
|   3 |      0.5 |          0.754 | 60.8     |
|   4 |      0.6 |          0.086 | 93.9     |
|-----+----------+----------------+----------|

An average of 193.2 patients were used out of the maximum accrual of 400. Note that 80% of the subjects were allocated to the two best treatments.

Here are the results for the second scenario. Note that in this scenario there are two bad treatments and two good treatments. As we’d hope, the two bad treatments are dropped early and the trial concentrates on deciding the better of the two good treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |     0.35 |          0.999 |     11.7 |
|   2 |     0.45 |          0.975 |     22.5 |
|   3 |     0.60 |          0.502 |     85.6 |
|   4 |     0.65 |          0.142 |    111.0 |
|-----+----------+----------------+----------|

An average of 230.8 patients were used out of the maximum accrual of 400. Now 85% of patients were assigned to the two best treatments. More patients were used in this scenario because the two best treatments were harder to tell apart.

BobC

22 December 2018 at 13:43

I’ve been thinking about this post in the context of my prior work developing and testing “new physics” sensors. Each candidate material goes through two rounds of testing, the first to determine fundamental material behaviors, the second to test prototype sensor designs. Both test rounds are often destructive, and the materials are always expensive and often rare, so it is important to get as much useful data as possible before running out of material and/or the time to process/fabricate/test it.

The test list includes temperature, shock/vibration, voltage, current, pressure, light, RF, radiation (alpha, beta, gamma, neutron) and more. It is most cost-effective to test in batches, but that risks a single extreme event ruining an entire batch. It is time-efficient to run multiple tests in parallel (including both batches and tests).

We would create batch sizes and test sequences using the seat of our pants, relying primarily on experience from prior testing and theoretical assessments/models of the material. The goal, of course, being to have as many samples as possible survive the entire process, while testing them hard enough to warrant continued effort.

We tweaked our testing as devices failed or failed to fail: We wanted to test some to destruction, but only when it generated useful data. We often reached “crunch points” when early failure rates were unexpectedly high: We wondered how much and what kinds of testing is warranted for the remaining precious samples.

Despite the extensive test preparation effort, we never simulated our test plans prior to implementing them. I now see that as a significant lost opportunity.

If we view the samples as patients and the tests as therapies, and assign probability distributions to our test stress estimates, we would have been able to be both more concretely aware of the higher risks, and have better plans in place for reacting to them.

One thought on “Multi-arm adaptively randomized clinical trials”

BobC

22 December 2018 at 13:43

I’ve been thinking about this post in the context of my prior work developing and testing “new physics” sensors. Each candidate material goes through two rounds of testing, the first to determine fundamental material behaviors, the second to test prototype sensor designs. Both test rounds are often destructive, and the materials are always expensive and often rare, so it is important to get as much useful data as possible before running out of material and/or the time to process/fabricate/test it.

The test list includes temperature, shock/vibration, voltage, current, pressure, light, RF, radiation (alpha, beta, gamma, neutron) and more. It is most cost-effective to test in batches, but that risks a single extreme event ruining an entire batch. It is time-efficient to run multiple tests in parallel (including both batches and tests).

We would create batch sizes and test sequences using the seat of our pants, relying primarily on experience from prior testing and theoretical assessments/models of the material. The goal, of course, being to have as many samples as possible survive the entire process, while testing them hard enough to warrant continued effort.

We tweaked our testing as devices failed or failed to fail: We wanted to test some to destruction, but only when it generated useful data. We often reached “crunch points” when early failure rates were unexpectedly high: We wondered how much and what kinds of testing is warranted for the remaining precious samples.

Despite the extensive test preparation effort, we never simulated our test plans prior to implementing them. I now see that as a significant lost opportunity.

If we view the samples as patients and the tests as therapies, and assign probability distributions to our test stress estimates, we would have been able to be both more concretely aware of the higher risks, and have better plans in place for reacting to them.

Comments are closed.