Posts tagged as:

Bayesian

Four pillars of Bayesian statistics

by John on April 7, 2009

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

Related posts:

Probability and information
What is a confidence interval?

{ 8 comments }

Finding the shortest interval with given mass

by John on February 23, 2009

Here’s an elegant little theorem applied in statistics but useful more generally. Suppose you have a density function f(x) with one hump. Suppose a and b are two points on opposite sides of the hump with f(a) = f(b). Then [a, b] is the shortest interval with its mass. That is, any other interval of length b-a will have less mass than the interval [a, b]. (Here the “mass” of an interval is just the integral of f(x) over that interval.)

Suppose we want to find the shortest interval that has a given mass k. Start by imagining a horizontal line sitting on top of the graph of f(x).

graph of gamma(4,1) pdf

Now lower this horizontal line so that it intersects the graph in two places.

graph of gamma(4,1) pdf cut at height 0.1

Draw vertical lines down from these two points of intersection to find their x-coordinates.

graph showing how to find the x coordinates of the two intersection points

In this example, the two x-coordinates are about 1.30 and 5.77. So the interval [1.30, 5.77] is the shortest interval with its mass. In other words, no other interval of length 4.47 can contain more mass than this interval does.

We can find the shortest interval of mass k by lowering this horizontal line until the interval it defines has mass k. The lower the horizontal line, the greater the mass. So for any given mass less than the total mass f(x) assigns, there is a unique height of the horizontal line that defines an interval with that mass.

This procedure could be used to find the shortest confidence interval or the shortest Bayesian credible interval. In that case the “mass” is probability, and the task is to find the shortest interval containing a specified probability. The theorem says that the shortest confidence interval or credible interval has equal probability density at each end of the interval.

A proof of this theorem is given in Statistical Inference, chapter 9. Technically, f(x) must be unimodal and positive with finite integral. A homework exercise in the same chapter outlines a simpler proof using the additional assumption that f(x) is continuous.

Related post: What is a confidence interval?

{ 0 comments }

What is a confidence interval?

by John on February 3, 2009

At first glance, a confidence interval is simple. If we say [3, 4] is a 95% confidence interval for a parameter θ, then there’s a 95% chance that θ is between 3 and 4. That explanation is not correct, but it works better in practice than in theory.

If you’re a Bayesian, the explanation above is correct if you change the terminology from “confidence” interval to “credible” interval. But if you’re a frequentist, you can’t make probability statements about parameters.

Confidence intervals take some delicate explanation. I took a look at Andrew Gelman and Deborah Nolan’s book Teaching Statistics: a bag of tricks,  to see what they had to say about teaching confidence intervals. They begin their section on the topic by saying “Confidence intervals are complicated …” That made me feel better. Some folks with more experience teaching statistics also find this challenging to teach. And according to The Lady Testing Tea, confidence intervals were controversial when they were first introduced.

From a frequentist perspective, confidence intervals are random, parameters are not, exactly the opposite of what everyone naturally thinks. You can’t talk about the probability that θ is in an interval because θ isn’t random. But in that case, what good is a confidence interval? As L. J. Savage once said,

The only use I know for a confidence interval is to have confidence in it.

In practice, people don’t go too wrong using the popular but technically incorrect notion of a confidence interval. Frequentist confidence intervals often approximate Bayesian credibility intervals; the frequentist approach is more useful in practice than in theory.

It’s interesting to see a sort of détente between frequentist and Bayesian statisticians. Some frequentists say that the Bayesian interpretation of statistics is nonsense, but the methods these crazy Bayesians come up with often have good frequentist properties. And some Bayesians say that frequentist methods, such as confidence intervals, are useful because they can come up with results that often approximate Bayesian results.

{ 2 comments }

Hardly anyone cares about statistics directly. People more often care about decisions they need to make with the help of statistics. This suggests that the statistics and decision-making process should be explicitly integrated. The name for this integrated approach is “decision theory.” Problems in decision theory are set up with the goal of maximizing “utility,” the benefit you expect to get from a decision. Equivalently, problems are set up to minimize expected cost. Cost may be a literal monetary cost, but it could be some other measure of something you want to avoid.

I was at a conference this morning where David Draper gave an excellent talk entitled Bayesian Decision Theory in Biostatistics: the Utility of Utility.  Draper presented an example of selecting variables for a statistical model. But instead of just selecting the most important variables in a purely statistical sense, he factored in the cost of collecting each variable. So if two variables make nearly equal contributions to a model, for example, the procedure would give preference to the variable that is cheaper to collect. In short, Draper recommended a cost-benefit analysis rather than the typical (statistical) benefit-only analysis. Very reasonable.

Why don’t people always take this approach? One reason is that it’s hard to assign utilities to outcomes. Dollar costs are often easy to account for, but it can be much harder to assign values to benefits. For example, you have to ask “Benefit for whom?” In a medical context, do you want to maximize the benefit to patients? Doctors? Insurance companies? Tax payers? Regulators? Statisticians? If you want to maximize some combination of these factors, how do you weight the interests of the various parties?

Assigning utilities is hard work, and you can never make everyone happy. No matter how good of a job you do, someone will criticize you. Nearly everyone agrees in the abstract that considering utilities is the way to go, but in practice it is hardly ever done. Anyone who proposes a way to quantify utility is immediately shot down by people who have a better idea. The net result is that rather than using a reasonable but  imperfect idea of utility, no utility is used at all. Or rather no explicit definition of utility is used. There is usually some implicit idea of utility, chosen for mathematical convenience, and that one wins by default. In general, people much prefer to leave utilities implicit.

In the Q&A after his talk, Draper said something to the effect that the status quo persists for a very good reason: thinking is hard work, and it opens you up to criticism.

{ 3 comments }

The probability that Shakespeare wrote a play

by John on December 18, 2008

Some people object to asking about the probability that Shakespeare wrote this or that play. One objection is that someone has already written the play, either Shakespeare or someone else. If Shakespeare wrote it, then the probability is one that he did. Otherwise the probability is zero. By this reasoning, no one can make probability statements about anything that has already happened. Another objection is that probability only applies to random processes. We cannot apply probability to questions about document authorship because documents are not random.

I just ran across a blog post by Ted Dunning that weighs in on this question. He writes

The statement “It cannot be probability …” is essentially a tautology. It should read, “We cannot use the word probability to describe our state of knowledge because we have implicitly accepted the assumption that probability cannot be used to describe our state of knowledge”.

He goes on to explain that if we think about statements of knowledge in terms of probabilities, we get a consistent system, so we might as well reason as if it’s OK to use probability theory.

The uncertainty about the authorship of the play does not exist in history — an omniscient historian would know who wrote it. Nor does it exist in nature — the play was not created by a random process. The uncertainty is in our heads. We don’t know who wrote it. But if we use numbers to represent our uncertainty, and we agree to certain common-sense axioms about how this should be done, we inevitably get probability theory.

As E. T. Jaynes once put it, “probabilities do not describe reality — only our information about reality.”

{ 1 comment }

Three math diagrams

by John on October 24, 2008

Here are three pages containing diagrams that each summarize several theorems.

The distribution relationships page summarizing around 40 relationships between around 20 statistical distributions.

The modes of convergence page has three diagrams like the following that explain when one kind of convergence implies another: almost everywhere, almost uniform, Lp, and convergence in measure.

The page conjugate prior relationships is similar to the probability distribution page, but concentrates on relationships in Bayesian statistics.

What are some of your favorite diagrams? Do you have any suggestions for diagrams that you’d like to see someone make?

{ 1 comment }

Diagram of conjugate prior relationships

by John on October 8, 2008

Here is a diagram to summarize some well-known conjugate prior relationships.

See conjugate prior relationships for details regarding distributions and posterior parameters.

{ 0 comments }

The previous posts in this series have looked at P(X > Y), the probability that a sample from a random variable X is greater than a sample from an independent random variable Y. In applications, X and Y have different distributions but come from the same distribution family.

Sometimes applications require computing P(X > max(Y, Z)). For example, an adaptively randomized trial of three treatments may be designed to assign a treatment with probability equal to the probability that that treatment has the best response. In a trial with a binary outcome, the variables X, Y, and Z may be beta random variables representing the probability of response. In a trial with a time-to-event outcome, the variables might be gamma random variables representing survival time.

Sometimes we’re interested in the opposite inequality, P(X < min(Y,Z)). This would be the case if we thought in terms of failures rather than responses, or wanted to minimize the time to a desirable event rather than maximizing the time to an undesirable event.

The maximum and minimum inequalities are related by the following equation:

P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z).

These inequalities are used for safety monitoring rules as well as to determine randomization probabilities. In a trial seeking to maximize responses, a treatment arm X might be dropped if P(X > max(Y,Z)) becomes too small.

In principle one could design an adaptively randomized trial with n treatment arms for any n ≥ 2 based on P(X1 > max(X2, …, Xn)). In practice, the most common value of n by far is 2. Sometimes n is 3. I’m not familiar with an adaptively randomized trial with more than three arms. I’ve heard of an adaptively randomized trial that was designed with five arms, but I don’t believe the trial ran.

Computing P(X1 > max(X2, …, Xn)) by numerical integration becomes more difficult as n increases. For large n, simulation may be more efficient than integration. Computing P(X1 > max(X2, …, Xn)) for gamma random variables with n=3 was unacceptably slow in a previous version of our adaptive randomization software. The search for a faster algorithm lead to this paper: Numerical Evaluation of Gamma Inequalities.

Previous posts on random inequalities:

Introduction
Analytical results
Numerical results
Cauchy distributions
Beta distributions
Gamma distributions

{ 1 comment }

Bayesian statisticians often talk about models “learning” as data accumulate. Here’s an example that applies information theory to quantify how much you can learn from an experiment using the same likelihood function but two different priors: a conjugate prior and a robust prior.

Here’s an example from a paper Luis Pericchi and I wrote recently. Suppose X ~ Normal(θ, 1) where the prior on θ is either a standard Cauchy distribution or a normal distribution with mean 0 and variance 2.19. (The variance on the normal was chosen following an example by Jim Berger so that both priors put half their mass on the interval [-1, 1].)

The expected information gain from a single observation using the normal (conjugate) prior was 0.58. The corresponding gain for the Cauchy (robust) prior was 1.20. Because robust priors are more responsive to data, the expected gain in information is larger (in this case twice as large) when using these priors.

{ 0 comments }

Valen Johnson and I recently posted a working paper on a method for stopping trials of ineffective drugs earlier. For Bayesians, we argue that our method is more consistently Bayesian than other methods in common use. For frequentists, we show that our method has better frequentist operating characteristics than the most commonly used safety monitoring method.

The paper looks at binary and time-to-event trials. The results are most dramatic for the time-to-event analog of the Thall-Simon method, the Thall-Wooten method, as shown below.

This graph plots the probability of concluding that an experimental treatment is inferior when simulating from true mean survival times ranging from 2 to 12 months. The trial is designed to test a null hypothesis of 6 months mean survival against an alternative hypothesis of 8 months mean survival. When the true mean survival time is less than the alternative hypothesis of 8 months, the Bayes factor design is more likely to stop early. And when the true mean survival time is greater than the alternative hypothesis, the Bayes factor method is less likely to stop early.

The Bayes factor method also outperforms the Thall-Simon method for monitoring single-arm trials with binary outcomes. The Bayes factor method stops more often when it should and less often when it should not. However, the difference in operating characteristics is not as pronounced as in the time-to-event case.

The paper also compares the Bayes factor method to the frequentist mainstay, the Simon two-stage design. Because the Bayes factor method uses continuous monitoring, the method is able to use fewer patients while maintaining the type I and type II error rates of the Simon design as illustrated in the graph below.

bayes factor vs simon two-stage designs

The graph above plots the number of patients used in a trial testing a null hypothesis of a 0.2 response rate against an alternative of a 0.4 response rate. Design 8 is the Bayes factor method advocated in the paper. Designs 7a and 7b are variations on the Simon two-stage design. The horizontal axis gives the true probabilities of response. We simulated true probabilities of response varying from 0 to 1 in increments of 0.05. The vertical axis gives the number of patients treated before the trial was stopped. When the true probability of response is less than the alternative hypothesis, the Bayes factor method treats fewer patients. When the true probability of response is better than the alternative hypothesis, the Bayes factor method treats slightly more patients.

Design 7a is the strict interpretation of the Simon method: one interim look at the data and another analysis at the end of the trial. Design 7b is the Simon method as implemented in practice, stopping when the criteria for continuing cannot be met at the next analysis. (For example, if the design says to stop if there are three or fewer responses out of the first 15 patients, then the method would stop after the 12th patient if there have been no responses.) In either case, the Bayes factor method uses fewer patients. The rejection probability curves, not shown here, show that the Bayes factor method matches (actually, slightly improves upon) the type I and type II error rates for the Simon two-stage design.

{ 3 comments }

Random inequalities V: beta distributions

by John on August 21, 2008

I’ve put a lot of effort into writing software for evaluating random inequality probabilities with beta distributions because such inequalities come up quite often in application. For example, beta inequalities are at the heart of the Thall-Simon method for monitoring single-arm trials and adaptively randomized trials with binary endpoints.

It’s not easy to evaluate P(X > Y) accurately and efficiently when X and Y are independent random variables. I’ve seen several attempts that were either inaccurate or slow, including a few attempts on my part. Efficiency is important because this calculation is often in the inner loop of a simulation study. Part of the difficulty is that the calculation depends on four parameters and no single algorithm will work well for all parameter combinations.

Let g(a, b, c, d) equal P(X > Y) where X ~ beta(a, b) and Y ~ beta(c, d). Then the function g has several symmetries.

  • g(a, b, c, d) = 1 – g(c, d, a, b)
  • g(a, b, c, d) = g(d, c, b, a)
  • g(a, b, c, d) = g(d, b, c, a)

The first two relations were published by W. R. Thompson in 1933, but as far as I know the third relation first appeared in this technical report in 2003.

For special values of the parameters, the function g(a, b, c, d) can be computed in closed form. Some of these special cases are when

  • one of the four parameters is an integer
  • a + b + c + d = 1
  • a + b = c + d = 1.

The function g(a, b, c, d) also satisfies several recurrence relations that make it possible to bootstrap the latter two special cases into more results. Define the beta function B(a, b) as Γ(a, b)/(Γ(a) Γ(b)) and define h(a, b, c, d) as B(a+c, b+d)/( B(a, b) B(c, d) ). Then the following recurrence relations hold.

  • g(a+1, b, c, d) = g(a, b, c, d) + h(a, b, c, d)/a
  • g(a, b+1, c, d) = g(a, b, c, d) – h(a, b, c, d)/b
  • g(a, b, c+1, d) = g(a, b, c, d) – h(a, b, c, d)/c
  • g(a, b, c, d+1) = g(a, b, c, d) + h(a, b, c, d)/d

For more information about beta inequalities, see these papers:

Numerical computation of stochastic inequality probabilities
Exact calculation of beta inequalities

Previous posts on random inequalities:

Introduction
Analytical results
Numerical results
Cauchy distributions

{ 0 comments }

Tomorrow morning I’m giving a talk on how to subject fewer patients to ineffective treatment in clinical trials. I should have used something like the title of this post as the title of my talk, but instead my talk is called “Clinical Trial Monitoring With Bayesian Hypothesis Testing.” Classic sales mistake: emphasizing features rather than benefits. But the talk is at a statistical conference, so maybe the feature-oriented title isn’t so bad.

Ethical concerns are the main consideration that makes biostatistics a separate branch of statistics. You can’t test experimental drugs on people the way you test experimental fertilizers on crops. In human trials, you want to stop the trial early if it looks like the experimental treatment is not as effective as a comparable established treatment, but you want to keep going if it looks like the new treatment might be better. You need to establish rules before the trial starts that quantify exactly what it means to look like a treatment is doing better or worse than another treatment. There are a lot of ways of doing this quantification, and some work better than others. Within its context (single-arm phase II trials with binary or time-to-event endpoints) the method I’m presenting stops ineffective trials sooner than the methods we compare it to while stopping no more often in situations where you’d want the trial to continue.

If you’re not familiar with statistics, this may sound strange. Why not always stop when a treatment is worse and never stop when it’s better? Because you never know with certainty that one treatment is better than another. The more patients you test, the more sure you can be of your decision, but some uncertainty always remains. So you face a trade-off between being more confident of your conclusion and experimenting on more patients. If you think a drug is bad, you don’t want to treat thousands more patients with it in order to be extra confident that it’s bad, so you stop. But you run the risk of shutting down a trial of a treatment that really is an improvement but by chance appeared to be worse at the time you made the decision to stop. Statistics is all about such trade-offs.

{ 1 comment }

My previous post introduced random inequalities and their application to Bayesian clinical trials. This post will discuss how to evaluate random inequalities analytically. The next post in the series will discuss numerical evaluation when analytical evaluation is not possible.

For independent random variables X and Y, how would you compute P(X>Y), the probability that a sample from X will be larger than a sample from Y? Let fX be the probability density function (PDF) of X and let FX be the cumulative distribution function (CDF) of X. Define fY and FY similarly. Then the probability P(X > Y) is the integral of fX(x) fY(y) over the part of the x-y plane below the diagonal line x = y.

\begin{eqnarray*} P(X \geq Y) &=& \int \!\int _{[x > y]} f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty \! \int_{-\infty}^x f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty f_X(x) F_Y(x) \, dx \end{eqnarray*}

This result makes intuitive sense: fX(x) is the density for x and FY(x)  is the probability that Y is less than x. Sometimes this integral can be evaluated analytically, though in general it must be evaluated numerically. The technical report Numerical computation of stochastic inequality probabilities explains how P(X > Y) can be computed in closed form for several common distribution families as well as how to evaluate inequalities involving other distributions numerically.

Exponential: If X and Y are exponential random variables with mean μX and μY respectively, then P(X > Y) = μX/(μX + μY).

Normal: If X and Y are normal random variables with mean and standard deviation (μX, σX) and (μY, σY) respectively, then P(X > Y) = Φ((μX – μY)/√(σX2 + σY2)) where Φ is the CDF of a standard normal distribution.

Gamma:  If X and Y are gamma random variables with shape and scale (αX, βX) and (αY, βY) respectively, then P(X > Y) = IxX/(βX + βY)) where Ix is the incomplete beta function with parameters αY and αX, i.e. the CDF of a beta distribution with parameters αY and αX.

The inequality P(X > Y) where X and Y are beta random variables comes up very often in applications. This inequality cannot be computed in closed form in general, though there are closed-form solutions for special values of the beta parameters. If X ~ beta(a, b) and Y ~ beta(c, d), the probability P(X > Y) can be evaluated in closed form if (1) one of the parameters a, b, c, or d is an integer, (2) a + b + c + d = 1, or (3) a + b = c + d = 1. These last two cases can be combined with a recurrence relation to compute other probabilities. See Exact calculation of beta inequalities for more details.

Sometimes you need to calculate P(X > max(Y, Z)) for three independent random variables. This comes up, for example, when computing adaptive randomization probabilities for a three-arm clinical trial. For a time-to-event trial as implemented here, the random variables have a gamma distribution. See Numerical evaluation of gamma inequalities for analytical as well as numerical results for computing P(X > max(Y, Z)) in that case.

The next post in this series will discuss how to evaluate random inequalities numerically when closed-form integration is not possible.

Update: See Part IV of this series for results with the Cauchy distribution.

{ 0 comments }

Random inequalities I: introduction

by John on July 26, 2008

Many Bayesian clinical trial methods have at their core a random inequality. Some examples from M. D. Anderson: adaptive randomization, binary safety monitoring, time-to-event safety monitoring. These method depends critically on evaluating P(X > Y) where X and Y are independent random variables. Roughly speaking, P(X > Y) is the probability that the treatment represented by X is better than the treatment represented by Y. In a trial with binary outcomes, X and Y may be the posterior probabilities of response on each treatment. In a trial with time-to-event outcomes, X and Y may be posterior probabilities of median survival time on two treatments.

People often have a little difficulty understanding what P(X > Y) means. What does it mean? If we take a sample from X and a random sample from Y, P(X >Y) is the probability that the former is larger than the latter. Most confusion around random inequalities comes from thinking of random variables as constants, not random quantities. Here are a couple examples.

First, suppose X and Y have normal distributions with standard deviation 1. If X has mean 4 and Y has mean 3, what is P(X > Y)? Some would say 1, because X is bigger than Y. But that’s not true. X has a larger mean than Y, but fairly often a sample from Y will be larger than a sample from X.  P(X > Y) = 0.76 in this case.

Next, suppose X and Y are identically distributed. Now what is P(X > Y)? I’ve heard people say zero because the two random variables are equal. But they’re not equal. Their distribution functions are equal but the two random variables are independent. P(X > y) = 1/2 by symmetry.

I believe there’s a psychological tendency to underestimate large inequality probabilities. (I’ve had several discussions with people who would not believe a reported inequality probability until they computed it themselves. These discussions are important when the decision whether to continue a clinical trial hinges on the result.) For example, suppose X and Y represent the probability of success in a trial in which there were 17 successes out of 30 on X and 12 successes out of 30 on Y. Using a beta distribution model, the density functions of X and Y are given below.

beta inequality graph

The density function for X is essentially the same as Y but shifted to the right. Clearly P(X > Y) is greater than 1/2. But how much greater than a half? You might think not too much since there’s a lot of mass in the overlap of the two densities. But P(X > Y) is a little more than 0.9.

The image above and the numerical results mentioned in this post were produced by the Inequality Calculator software.

Part II will discuss analytically evaluating random inequalities. Part III will discuss numerically evaluating random inequalities.

{ 5 comments }

Yesterday I gave a presentation on designing clinical trials using adaptive randomization software developed at M. D. Anderson Cancer Center. The heart of the presentation is summarized in the following diagram.

Diagram of three methods of tuning adaptively randomized trial designs

(A slightly larger and clearer version if the diagram is available here.)

Traditional randomized trials use equal randomization (ER). In a two-arm trial, each treatment is given with probability 1/2. Simple adaptive randomization (SAR) calculates the probability that a treatment is the better treatment given the data seen so far and randomizes to that treatment with that probability. For example, if it looks like there’s an 80% chance that Treatment B is better, patients will be randomized to Treatment B with probability 0.80. Myopic optimization (MO) gives each patient what appears to be the best treatment given the available data with no randomization.

Myopic optimization is ethically appealing, but has terrible statistical properties. Equal randomization has good statistical properties, but will put the same number of patients on each treatment, regardless of the evidence that one treatment is better. Simple adaptive randomization is a compromise position, retaining much of the power of equal randomization while also treating more patients on the better treatment on average.

The adaptive randomization software provides three ways of compromising between the operating characteristics ER and SAR.

  1. Begin the trial with a burn-in period of equal randomization followed by simple adaptive randomization.
  2. Use simple adaptive randomization, except if the randomization probability drops below a certain threshold, substitute that minimum value.
  3. Raise the simple adaptive randomization probability to a power between 0 and 1 to obtain a new randomization probability.

Each of these three approaches reduces to ER at one extreme and SAR at the other. In between the extremes, each produces a design with operating characteristics somewhere between those of ER and SAR.

In the first approach, if the burn-in period is the entire trial, you simply have an ER trial. If there is no burn-in period, you have an SAR trial. In between you could have a burn-in period equal to some percentage of the total trial between 0 and 100%. A burn-in period of 20% is typical.

In the second approach, you could specify the minimum randomization probability as 0.5, negating the adaptive randomization and yielding ER. At the other extreme, you could set the minimum randomization probability to 0, yielding SAR. In between you could specify some non-zero randomization probability such as 0.10.

In the third approach, a power of zero yields ER. A power of 1 yields SAR. Unlike the other two approaches, this approach could yield designs approaching MO by using powers larger than 1. This is the most general approach since it can produce a continuum of designs with characteristics ranging from ER to MO. For more on this approach, see Understanding the exponential tuning parameter in adaptively randomized trials.

So with three methods to choose from, which one do you use? I did some simulations to address this question. I expected that all three methods would perform about the same. However, this is not what I found. To read more, see Comparing methods of tuning adaptive randomized trials.

{ 1 comment }