Universal confidence interval

Here’s a way to find a 95% confidence interval for any parameter θ.

  • With probability 0.95, return the real line.
  • With probability 0.05, return the empty set.

Clearly 95% of the time this procedure will return an interval that contains θ.

This example shows the difference between a confidence interval and a credible interval.

A 95% credible interval is an interval such that the probability is 95% that the parameter is in the interval. That’s what people think a 95% confidence interval is, but it’s not.

Suppose I give you a confidence interval using the procedure above. The probability that θ is in the interval is 1 if I return the real line and 0 if I return the empty set. In either case, the interval that I give you tells you absolutely nothing about θ.

But if I give you a 95% credible interval (a, b), then given the model and the data that went into it, the probability is 95% that θ is in the interval (a, b).

Confidence intervals are more useful in practice than in theory because they often approximately correspond to a credible interval under a reasonable model.

Credible intervals depend on your modeling assumptions. So do confidence intervals.

Confidence interval widths

Suppose you do N trials of something that can succeed or fail. After your experiment you want to present a point estimate and a confidence interval. Or if you’re a Bayesian, you want to present a posterior mean and a credible interval. The numerical results hardly differ, though the two interpretations differ.

If you got half successes, you will report a confidence interval centered around 0.5. The more unbalanced your results were, the smaller your confidence interval will be. That is, the confidence interval will be smallest if you had no successes and widest if you had half successes.

What can we say about how the width of your confidence varies as a function of your point estimate p? That question is the subject of this post [1].

Frequentist confidence interval width

We’ll start with an example, where N = 1,000,000. We let our number of observed successes n vary from 0 to 500,000 and plot the width of a 95% confidence interval as a function of n on a log-log plot. (By symmetry we only need to look to up to N/2. If you have more than half successes, reverse your definitions of success and failure.)

The result is almost a perfectly straight line, with the exception of a tiny curve at the end. This says the log of the confidence interval is a linear function of the log of n, or in other words, the dependence of confidence interval width on n follows a power law.

Bayesian credible interval width

The plot above measures frequentist confidence intervals, using Wilson score with continuity correction. We can repeat our experiment, putting on our Bayesian hat, and compute credible intervals using a Jeffreys prior, that is, a beta(0.5, 0.5) prior.

The results are indistinguishable. The confidence interval widths differ after the sixth decimal place.

In this example, there’s very little quantitative difference between a frequentist confidence interval and a Bayesian credible interval. The difference is primarily one of interpretation. The Bayesian interpretation makes sense and the frequentist interpretation doesn’t [2].

Power law

If the logs of two things are related linearly [3], then the things themselves are related by a power law. So if confidence/credible interval width varies as a power law with the point estimate p, what is that power law?

The plots above suggest that to a good approximation, if we let w be the credible interval length,

log w = m log p + b

for some slope m and intercept b.

We can estimate the slope by choosing p = 10-1 and p = 10-5. This gives us m = 0.4925 and b = -5.6116. There are theoretical reasons to expect that m should be 0.5, so we’ll round 0.4925 up to 0.5 both for convenience and for theory.

So

log w = 0.5 log p – 5.6116.

and so

w = 0.00366 √p

Let’s see how well this works by comparing it to the exact interval length. (I’m using Jeffreys’ credible interval, not that it matters.)

The power law approximation works well when the estimated proportion p is small, say less than 0.2. For larger p the normal approximation works well.

We could have guessed that the confidence interval width was proportional to √p(1-p) based on the normal approximation/central limit theorem. But the normal approximation is not uniformly good over the range of all ps.

Related posts

[1] Standard notation would put a little hat on top of the p but HTML doesn’t make this possible without inserting images into text. I hope you don’t mind if I take my hat off.

[2] This quip is a little snarky, but it’s true. When I would ask graduate students to describe what a confidence interval is, they would basically give me the definition of a credible interval. Most people who calculate frequentist confidence intervals think they’re calculating Bayesian credible intervals, though they don’t realize it. The definition of confidence interval is unnatural and convoluted. But confidence intervals work better in practice than in theory because they approximate credible intervals.

L. J. Savage once said

“The only use I know for a confidence interval is to have confidence in it.”

In other words, the value of a confidence interval is that you can often get away with interpreting it as a credible interval.

[3] Unfortunately the word “linear” is used in two inconsistent ways. Technically we should say “affine” when describing a function y = mx + b, and I wish we would. But that word isn’t commonly known. People call this relation linear because the graph is a line. Which makes sense, but it’s inconsistent with the use of “linear” as in “linear algebra.”

 

Scaled Beta2 distribution

I recently ran across a new probability distribution called the “Scaled Beta2” or “SBeta2” for short in [1].

For positive argument x and for positive parameters p, q, and b, its density is

\frac{\Gamma(p+q)}{b\,\Gamma(p) \, \Gamma(q)}\, \frac{\left(\frac{x}{b}\right)^{p-1}}{\left(\left(\frac{x}{b}\right) + 1 \right)^{p+q}}

This is a heavy-tailed distribution. For large x, the probability density is O(xq-1), the same as a Student-t distribution with q degrees of freedom.

The authors in [1] point out several ways this distribution can be useful in application. It can approximate a number of commonly-used prior distributions while offering advantages in terms of robustness and analytical tractability.

Related posts

[1] The Scaled Beta2 Distribution as a Robust Prior for Scales. María-Eglée Pérez, Luis Raúl Pericchi, Isabel Cristina Ramírez. Available here.

Predictive probability for large populations

Suppose you want to estimate the number of patients who respond to some treatment in a large population of size N and what you have is data on a small sample of size n.

The usual way of going about this calculates the proportion of responses in the small sample, then pretends that this is the precisely known proportion, and applies it to the population as a whole. This approach underestimates the uncertainty in the final estimate because it ignores the uncertainty in the initial estimate.

A more sophisticated approach creates a distribution representing what is known about the proportion of successes based on the small sample and any prior information. Then it estimates the posterior predictive probability of outcomes in the larger population based on this distribution. For more background on predictive probability, see my post on uncertainty in a probability.

This post will look at ways to compute predictive probabilities using the asymptotic approximation presented in the previous post.

Suppose your estimate on the probability of success, based on your small sample and prior information, has a Beta(α, β) distribution. Then the predictive probability of s successes and f failures is

 {s + f \choose s} \frac{B(s + \alpha, f + \beta)}{B(\alpha, \beta)}

The beta functions in the numerator and denominator are hard to understand and hard to compute, so we would like to replace them with something easier to understand and easier to compute.

We begin by expanding all the terms involving s and f in terms of gamma functions.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+f+1)}{\Gamma(s+1)\, \Gamma(f+1)} \frac{\Gamma(s+\alpha)\, \Gamma(f+\beta)}{\Gamma(s+f+\alpha+\beta)}

By rearranging the terms we can see three examples of the kind of gamma function ratios estimated in the previous post.

\frac{1}{B(\alpha,\beta)} \frac{\Gamma(s+\alpha)}{\Gamma(s+1)} \frac{\Gamma(f+\beta)}{\Gamma(f+1)} \frac{\Gamma(s+f+1)}{\Gamma(s+f+\alpha+\beta)}

Now we can apply the simplest approximation from the previous post to get the following.

\frac{s^{\alpha-1} f^{\beta-1} (s +f)^{1 - \alpha-\beta-1}}{B(\alpha, \beta)}

We can make this approximation more symmetric by expanding Beta(α, β) in terms of factorials:

 \frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\,\frac{s^{\alpha-1} f^{\beta-1}}{(s+f)^{\alpha+\beta-1}}

We saw in the previous post that we can make our asymptotic approximation much more accurate by shifting the arguments a bit, and the following expression, while a little more complicated, should be more accurate.

\frac{(\alpha + \beta-1)!}{(\alpha-1)!\, (\beta-1)!} \,\, \frac{ \left(s + \tfrac{\alpha}{2}\right)^{\alpha-1} \left(f + \tfrac{\beta}{2}\right)^{\beta-1} }{ \left(s +f + \tfrac{\alpha+\beta}{2}\right)^{\alpha+\beta-1} }

This should be accurate for sufficiently large s and f, but I haven’t tested it numerically and wouldn’t recommend using it without doing some testing first.

Automatic data reweighting

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don’t have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you’d like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you’d like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It’s called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

Related posts

Dose finding != dose escalation

You’ll often hear Phase I dose-finding trials referred to as dose escalation studies. This is because simple dose-finding methods can only explore in one direction: they can only escalate.

Three-plus-three rule

The most common dose finding method is the 3+3 rule. There are countless variations on this theme, but the basic idea is that you give a dose of an experimental drug to three people. If all three are OK, you go up a dose next time. If two out of three are OK, you give that dose again. If only one out of three is OK, you stop [1].

Deterministic thinking

The 3+3 algorithm implicitly assumes deterministic thinking, at least in part. The assumption is that if three out of three patients respond well, we know the dose is safe [2].

If you increase the dose level and the next three patients experience adverse events, you stop the trial. Why? Because you know that the new dose is dangerous, and you know the previous dose was safe. You can only escalate because you assume you have complete knowledge based on three samples.

But if we treat three patients at a particular dose level and none have an adverse reaction we do not know for certain that the dose level is safe, though we may have sufficient confidence in its safety to try the next dose level. Similarly, if we treat three patients at a dose and all have an adverse reaction, we do not know for certain that the dose is toxic.

Bayesian dose-finding

A Bayesian dose-finding method estimates toxicity probabilities given the data available. It might decide at one point that a dose appears safe, then reverse its decision later based on more data. Similarly, it may reverse an initial assessment that a dose is unsafe.

A dose-finding method based on posterior probabilities of toxicity is not strictly a dose escalation method because it can explore in two directions. It may decide that the next dose level to explore is higher or lower than the current level.

Starting at the lowest dose

In Phase I studies of chemotherapeutics, you conventionally start at the lowest dose. This makes sense. These are toxic agents, and you naturally want to start at a dose you have reason to believe isn’t too toxic. (NB: I say “too toxic” because chemotherapy is toxic. You hope that it’s toxic to a tumor without being too toxic for the patient host.)

But on closer inspection maybe you shouldn’t start at the lowest dose. Suppose you want to test 100 mg, 200 mg, and 300 mg of some agent. Then 100 mg is the lowest dose, and it’s ethical to start at 100 mg. Now what if we add a dose of 50 mg to the possibilities? Did the 100 mg dose suddenly become unethical as a starting dose?

If you have reason to believe that 100 mg is a tolerable dose, why not start with that dose, even if you add a lower dose in case you’re wrong? This makes sense if you think of dose-finding, but not if you think only in terms of dose escalation. If you can only escalate, then it’s impossible to ever give a dose below the starting dose.

More clinical trial posts

[1] I have heard, but I haven’t been able to confirm, that the 3+3 method has its origin in a method proposed by John Tukey during WWII for testing bombs. When testing a mechanical system, like a bomb, there is much less uncertainty than when testing a drug in a human. In a mechanical setting, you may have a lot more confidence from three samples than you would in a medical setting.

[2] How do you explain the situation where one out of three has an adverse reaction? Is the dose safe or not? Here you naturally switch to probabilistic thinking because deterministic thinking leads to a contradiction.

 

Multi-arm adaptively randomized clinical trials

This post will look at adaptively randomized trial designs. In particular, we want to focus on multi-arm trials, i.e. trials of more than two treatments. The aim is to drop the less effective treatments quickly so the trial can focus on determining which of the better treatments is best.

We’ll briefly review our approach to adaptive randomization but not go into much detail. For a more thorough introduction, see, for example, this report.

Why adaptive randomization

Adaptive randomization designs allow the randomization probabilities to change in response to accumulated outcome data so that more subjects are assigned to (what appear to be) more effective treatments. They also allow for continuous monitoring, so one can stop a trial early if we’re sufficiently confident that we’ve found the best treatment.

Adapting randomization probabilities

Of course we don’t know what treatments are more effective or else we wouldn’t be running a clinical trial. We have an idea based on the data seen so far at any point in the trial, and that assessment may change as more data become available.

We could simply put each patient on what appears to be the best arm (the “play-the-winner” strategy), but this would forgo the benefits of randomization. Instead we compromise, continuing to randomize, but increasing the randomization probability for what appears to be the best treatment at the time a subject enters the trial.

Continuous monitoring

By monitoring the performance of each treatment arm, we can drop a poorly performing arm and assign subjects to the better treatment arms. This is particularly important for multi-arm trials. We want to weed out the poor treatments quickly so we can focus on the more promising treatments.

Continuous monitoring opens the possibility of stopping trials early if there is a clear winner. If all treatments perform similarly, more patients will be used. The maximum number of patients will be enrolled only if necessary.

Multi-arm trials

Randomizing an equal number of patients to each of several treatment arms would require a lot of subjects. A multi-arm adaptive trail turns into a two-arm trial once the other arms are dropped. We’ll present simulation results below that demonstrate this.

Running a big trial with several treatment arms could be more cost effective than running several smaller trials because there is a certain fixed cost associated with running any trial, no matter how small: protocol review, IRB approval, etc.

There has been some skepticism whether two-arm adaptively randomized trials live up to their hype. Trial design is a multi-objective optimization problem and it’s easy to claim victory by doing better by one criteria while doing worse by another. In my opinion, adaptive randomization is more promising for multi-arm trials than two-arm trials.

In my experience, multi-arm trials benefit more from early stopping than from adapting randomization probabilities. That is, one may treat more patients effectively by randomizing equally but dropping poorly performing treatments. Instead of reducing the probability of assigning patients to a poor treatment arm, continue to randomize equally so you can more quickly gather enough evidence to drop the arm.

I initially thought that gradually decreasing the randomization probability of a poorly performing arm would be better than keeping the randomization probability equal until it suddenly drops to zero. But experience suggests this intuition was wrong.

Simulation study

I designed a 4-arm trial with a uniform prior on the probability of response on each arm. The maximum accrual was set to 400 patients.

An arm is suspended if the posterior probability of it being best drops below 0.05. (Note: A suspended arm is not necessarily closed. It may become active again in response to more data.)

Subjects are randomized equally to all available arms. If only one arm is available, the trial stops. Each trial was simulated 1,000 times.

In the first scenario, I assume the true probabilities of successful response on the treatment arms are 0.3, 0.4, 0.5, and 0.6 respectively. The treatment arm with 30% response was dropped early in 99.5% of the simulations, and on average only 12.8 patients were assigned to this treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |      0.3 |          0.995 | 12.8     |
|   2 |      0.4 |          0.968 | 25.7     |
|   3 |      0.5 |          0.754 | 60.8     |
|   4 |      0.6 |          0.086 | 93.9     |
|-----+----------+----------------+----------|

An average of 193.2 patients were used out of the maximum accrual of 400. Note that 80% of the subjects were allocated to the two best treatments.

Here are the results for the second scenario. Note that in this scenario there are two bad treatments and two good treatments. As we’d hope, the two bad treatments are dropped early and the trial concentrates on deciding the better of the two good treatment.

|-----+----------+----------------+----------|
| Arm | Response | Pr(early stop) | Patients |
|-----+----------+----------------+----------|
|   1 |     0.35 |          0.999 |     11.7 |
|   2 |     0.45 |          0.975 |     22.5 |
|   3 |     0.60 |          0.502 |     85.6 |
|   4 |     0.65 |          0.142 |    111.0 |
|-----+----------+----------------+----------|

An average of 230.8 patients were used out of the maximum accrual of 400. Now 85% of patients were assigned to the two best treatments. More patients were used in this scenario because the two best treatments were harder to tell apart.

Related posts

The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Six sigma events

I saw on Twitter this afternoon a paraphrase of a quote from Nassim Taleb to the effect that if you see a six-sigma event, that’s evidence that it wasn’t really a six-sigma event.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a k-sigma event for various k.

        |-------+-----------------|
        | Sigma | Odds            |
        |-------+-----------------|
        |     1 | 2 : 1           |
        |     2 | 21 : 1          |
        |     3 | 370 : 1         |
        |     4 | 16,000 : 1      |
        |     5 | 1,700,000 : 1   |
        |     6 | 500,000,000 : 1 |
        |-------+-----------------|

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

Bayes rule and Bayes factors

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a fat-tailed distribution, say a Student t distribution with 10 degrees of freedom, rescaled to also have variance 1.

normal distribution vs t with 10 dof

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is relatively much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the t distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the t distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the fat-tailed distribution is equally likely, just a possibility. If you did think a priori that both possibilities were equally likely, the posterior odds for the t distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your a priori confidence in a normal model.

By the way, a t distribution with 10 degrees of freedom is not a very fat-tailed distribution. It has fatter tails than a normal for sure, but not nearly as fat as a Cauchy, which corresponds to a t with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

Hypothesis testing vs estimation

I was looking at my daughter’s statistics homework recently, and there were a pair of questions about testing the level of lead in drinking water. One question concerned testing whether the water was safe, and the other concerned testing whether the water was unsafe.

There’s something bizarre, even embarrassing, about this. You want to do two things: estimate the amount of lead, and decide what to do in response. But instead of simply doing just that, you do this arcane dance of choosing two hypotheses, one natural and one arbitrary, and treating the two asymmetrically, depending on which one you call the null and which you call the alternative. This asymmetry is the reason you make a distinction between testing whether the water is safe and testing whether it is unsafe.

It’s a weird tangle of estimation and decision making. The decision-making rules implicit in the procedure are not at all transparent. And even though you are testing the level of lead, you’re doing so indirectly.

The Bayesian approach to the problem is much easier to understand. You estimate the probability distribution for the concentration of lead based on all available information. You can plot this distribution and show it to civil engineers, politicians, or anybody else who needs to make a decision. Non-statisticians are much more likely to understand such a plot than the nuances of null and alternative hypotheses, significance, power, and whether you’re testing for safety versus testing for non-safety. (Statisticians are more likely to understand estimation as well.)

In the homework problems, the allowable level of lead was 15 ppm. After obtaining the posterior distribution on the concentration of lead, you could simply estimate the probability that the concentration is above 15 ppm. But you could also calculate the probability that the concentration lies in any other range you’re interested in.

Classical statistics does not allow such probability calculations. Even a confidence interval, something that looks like a probability statement about the concentration of lead, is actually a probability statement about the statistical process being used and not a probability statement about lead concentration per se.