Posts tagged as:

Probability and Statistics

The first post in this series introduced random inequalities. The second post discussed random inequalities can could be computed in closed form. This post considers random inequalities that must be evaluated numerically.

The simplest and most obvious approach to computing P(X > Y) is to generate random samples from X and Y and count how often the sample from X is bigger than the sample from Y. However, this approach is only accurate on average. In any particular sample, the error may be large. Even if you are willing to tolerate, for example, a 5% chance of being outside your error tolerance, the required number of random draws may be large. The more accuracy you need, the worse it gets since your accuracy only improves as n-1/2 where n is the number of samples. For numerical integration methods, the error decreases as a much larger negative power of n depending on the method. In the timing study near the end of this technical report, numerical integration was 2,875 times faster than simulation for determining beta inequality probabilities to just two decimal places. For greater desired accuracy, the advantage of numerical integration would increase.

Why is efficiency important? For some applications it may not be, but often these random inequalities are evaluated in the inner loop of a simulation. A simulation study that takes hours to run may be spending most of its time evaluating random inequalities.

As derived in the previous post, the integral to evaluate is given by

\begin{eqnarray*} P(X \geq Y) &=& \int \!\int _{[x > y]} f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty \! \int_{-\infty}^x f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty f_X(x) F_Y(x) \, dx \end{eqnarray*}

The technical report Numerical computation of stochastic inequality probabilities gives some general considerations for numerically evaluating random inequalities, but each specific distribution family must be considered individually since each may present unique numerical challenges. The report gives some specific techniques for beta and Weibull distributions.  The report Numerical evaluation of gamma inequalities gives some techniques for gamma distributions.

Numerical integration can be complicated by singularities in integrand for extreme values of distribution parameters. For example, if X ~ beta(a, b) the integral for computing P(X > Y) has a singularity if either a or b are less than 1. For general advice on numerically integrating singular integrands, see “The care and treatment of singularities” in Numerical Methods that Work by Forman S. Acton.

Once you come up with a clever numerical algorithm for evaluating a random inequality, you need to have a way to test the results.

One way to test numerical algorithms for random inequalities is to compare simulation results to integration results. This will find blatant errors, but it may not be as effective in uncovering accuracy problems. It helps to use random parameters for testing.

Another way to test numerical algorithms in this context is to compute both P(X > Y) and P(Y > X). For continuous random variables, these two values should add to 1. Of course such a condition is not sufficient to guarantee accuracy, but it is a simple and effective test. The Inequality Calculator software reports both P(X > Y) and P(Y > X) partly for the convenience of the user but also as a form of quality control: if the two probabilities do not add up to approximately 1, you know something has gone wrong. Along these lines, it’s useful to verify that P(X > Y) = 1/2 if X and Y are identically distributed.

Finally, integrands may have closed-form solutions for special parameters. For example, Exact calculation of beta inequalities gives closed-form results for many special cases of the parameters. As long as the algorithm being tested does not depend on these special values, these special values provide valuable test cases.

{ 0 comments }

My previous post introduced random inequalities and their application to Bayesian clinical trials. This post will discuss how to evaluate random inequalities analytically. The next post in the series will discuss numerical evaluation when analytical evaluation is not possible.

For independent random variables X and Y, how would you compute P(X>Y), the probability that a sample from X will be larger than a sample from Y? Let fX be the probability density function (PDF) of X and let FX be the cumulative distribution function (CDF) of X. Define fY and FY similarly. Then the probability P(X > Y) is the integral of fX(x) fY(y) over the part of the x-y plane below the diagonal line x = y.

\begin{eqnarray*} P(X \geq Y) &=& \int \!\int _{[x > y]} f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty \! \int_{-\infty}^x f_X(x) f_Y(y) \, dy \, dx \\ &=& \int_{-\infty}^\infty f_X(x) F_Y(x) \, dx \end{eqnarray*}

This result makes intuitive sense: fX(x) is the density for x and FY(x)  is the probability that Y is less than x. Sometimes this integral can be evaluated analytically, though in general it must be evaluated numerically. The technical report Numerical computation of stochastic inequality probabilities explains how P(X > Y) can be computed in closed form for several common distribution families as well as how to evaluate inequalities involving other distributions numerically.

Exponential: If X and Y are exponential random variables with mean μX and μY respectively, then P(X > Y) = μX/(μX + μY).

Normal: If X and Y are normal random variables with mean and standard deviation (μX, σX) and (μY, σY) respectively, then P(X > Y) = Φ((μX - μY)/√(σX2 + σY2)) where Φ is the CDF of a standard normal distribution.

Gamma:  If X and Y are gamma random variables with shape and scale (αX, βX) and (αY, βY) respectively, then P(X > Y) = IxX/(βX + βY)) where Ix is the incomplete beta function with parameters αY and αX, i.e. the CDF of a beta distribution with parameters αY and αX.

The inequality P(X > Y) where X and Y are beta random variables comes up very often in applications. This inequality cannot be computed in closed form in general, though there are closed-form solutions for special values of the beta parameters. If X ~ beta(a, b) and Y ~ beta(c, d), the probability P(X > Y) can be evaluated in closed form if (1) one of the parameters a, b, c, or d is an integer, (2) a + b + c + d = 1, or (3) a + b = c + d = 1. These last two cases can be combined with a recurrence relation to compute other probabilities. See Exact calculation of beta inequalities for more details.

Sometimes you need to calculate P(X > max(Y, Z)) for three independent random variables. This comes up, for example, when computing adaptive randomization probabilities for a three-arm clinical trial. For a time-to-event trial as implemented here, the random variables have a gamma distribution. See Numerical evaluation of gamma inequalities for analytical as well as numerical results for computing P(X > max(Y, Z)) in that case.

The next post in this series will discuss how to evaluate random inequalities numerically when closed-form integration is not possible.

Update: See Part IV of this series for results with the Cauchy distribution.

{ 0 comments }

Random inequalities I: introduction

by John on July 26, 2008

Many Bayesian clinical trial methods have at their core a random inequality. Some examples from M. D. Anderson: adaptive randomization, binary safety monitoring, time-to-event safety monitoring. These method depends critically on evaluating P(X > Y) where X and Y are independent random variables. Roughly speaking, P(X > Y) is the probability that the treatment represented by X is better than the treatment represented by Y. In a trial with binary outcomes, X and Y may be the posterior probabilities of response on each treatment. In a trial with time-to-event outcomes, X and Y may be posterior probabilities of median survival time on two treatments.

People often have a little difficulty understanding what P(X > Y) means. What does it mean? If we take a sample from X and a random sample from Y, P(X >Y) is the probability that the former is larger than the latter. Most confusion around random inequalities comes from thinking of random variables as constants, not random quantities. Here are a couple examples.

First, suppose X and Y have normal distributions with standard deviation 1. If X has mean 4 and Y has mean 3, what is P(X > Y)? Some would say 1, because X is bigger than Y. But that’s not true. X has a larger mean than Y, but fairly often a sample from Y will be larger than a sample from X.  P(X > Y) = 0.76 in this case.

Next, suppose X and Y are identically distributed. Now what is P(X > Y)? I’ve heard people say zero because the two random variables are equal. But they’re not equal. Their distribution functions are equal but the two random variables are independent. P(X > y) = 1/2 by symmetry.

I believe there’s a psychological tendency to underestimate large inequality probabilities. (I’ve had several discussions with people who would not believe a reported inequality probability until they computed it themselves. These discussions are important when the decision whether to continue a clinical trial hinges on the result.) For example, suppose X and Y represent the probability of success in a trial in which there were 17 successes out of 30 on X and 12 successes out of 30 on Y. Using a beta distribution model, the density functions of X and Y are given below.

beta inequality graph

The density function for X is essentially the same as Y but shifted to the right. Clearly P(X > Y) is greater than 1/2. But how much greater than a half? You might think not too much since there’s a lot of mass in the overlap of the two densities. But P(X > Y) is a little more than 0.9.

The image above and the numerical results mentioned in this post were produced by the Inequality Calculator software.

Part II will discuss analytically evaluating random inequalities. Part III will discuss numerically evaluating random inequalities.

{ 2 comments }

Yesterday I gave a presentation on designing clinical trials using adaptive randomization software developed at M. D. Anderson Cancer Center. The heart of the presentation is summarized in the following diagram.

Diagram of three methods of tuning adaptively randomized trial designs

(A slightly larger and clearer version if the diagram is available here.)

Traditional randomized trials use equal randomization (ER). In a two-arm trial, each treatment is given with probability 1/2. Simple adaptive randomization (SAR) calculates the probability that a treatment is the better treatment given the data seen so far and randomizes to that treatment with that probability. For example, if it looks like there’s an 80% chance that Treatment B is better, patients will be randomized to Treatment B with probability 0.80. Myopic optimization (MO) gives each patient what appears to be the best treatment given the available data with no randomization.

Myopic optimization is ethically appealing, but has terrible statistical properties. Equal randomization has good statistical properties, but will put the same number of patients on each treatment, regardless of the evidence that one treatment is better. Simple adaptive randomization is a compromise position, retaining much of the power of equal randomization while also treating more patients on the better treatment on average.

The adaptive randomization software provides three ways of compromising between the operating characteristics ER and SAR.

  1. Begin the trial with a burn-in period of equal randomization followed by simple randomization.
  2. Use simple randomization, except if the randomization probability drops below a certain threshold, substitute that minimum value.
  3. Raise the simple randomization probability to a power between 0 and 1 to obtain a new randomization probability.

Each of these three approaches reduces to ER at one extreme and SAR at the other. In between the extremes, each produces a design with operating characteristics somewhere between those of ER and SAR.

In the first approach, if the burn-in period is the entire trial, you simply have an ER trial. If there is no burn-in period, you have an SAR trial. In between you could have a burn-in period equal to some percentage of the total trial between 0 and 100%. A burn-in period of 20% is typical.

In the second approach, you could specify the minimum randomization probability as 0.5, negating the adaptive randomization and yielding ER. At the other extreme, you could set the minimum randomization probability to 0, yielding SAR. In between you could specify some non-zero randomization probability such as 0.10.

In the third approach, a power of zero yields ER. A power of 1 yields SAR. Unlike the other two approaches, this approach could yield designs approaching MO by using powers larger than 1. This is the most general approach since it can produce a continuum of designs with characteristics ranging from ER to MO. For more on this approach, see Understanding the exponential tuning parameter in adaptively randomized trials.

So with three methods to choose from, which one do you use? I did some simulations to address this question. I expected that all three methods would perform about the same. However, this is not what I found. To read more, see Comparing methods of tuning adaptive randomized trials.

{ 0 comments }

Why heights are not normally distributed

by John on July 20, 2008

In my previous post, I speculated on why heights are normally distributed, that is, why their statistical distribution is very nearly Gaussian. In this post I want to point out where it breaks down. I’ll look closely at an example from Elementary Statistics by Mario Triola.

At the beginning of the chapter, we noted that the United States Army requires that women’s heights be between 58 and 80 inches. Find the percentage of women satisfying that requirement. Again assume that women have heights that are normally distributed with a mean of 63.6 inches and a standard deviation of 2.5 inches.

The book gives a solution of 98.7%. That’s probably a fairly realistic result, though maybe not to three significant figures. My quibble is with one of the details along the way to the solution, not the final solution itself.

A height of 80 inches is 6.56 standard deviations away from the mean. The probability of a normal random random variable taking on a value that far away from its mean is between 2 and 3 out of 100 billion. Since there are about 7 billion people on our planet, and less than half of these are adult women, this says it would be unlikely to ever find a woman 80 inches (6′ 8″) tall. But there are many women that tall or taller. The world record is 91 inches (7′ 7″), or about 11 standard deviations from the mean. If heights really were normally distributed, the probability of such a height would be 1.9 x 10-28 or about 2 chances in 10,000,000,000,000,000,000,000,000,000. The fit is even worse in the lower tail of the distribution. The world’s shortest woman is 25.5 inches tall, 15 standard deviations below the mean.

The normal distribution describes heights remarkably well near the mean, even a couple standard deviations on either side of the mean. But in the extremes, such as six standard deviations out, the model doesn’t fit well. The absolute error is small: the normal model predicts that women 80 inches tall or taller are uncommon, and indeed they are. But they are not nearly as uncommon as the model suggests. The relative error in the model when predicting extreme values is enormous.

The normal model often doesn’t fit well in the extremes. It often underestimates the probability of rare events. The Black Swan gives numerous examples of rare events that were not as rare as a normal distribution would predict. What might account for this poor fit?

Well, why should we expect a normal distribution to fit well in the first place? Because of the central limit theorem. This theorem says roughly that if you average a large number of independent random variables, the result has an approximately normal distribution. But there are many ways the assumptions of this theorem could fail to hold: the random variables might not be independent, they might not be identically distributed, they might have thick tails, etc. And even when the assumptions of the central limit do apply, the theorem only guarantees that the absolute error in the normal approximation goes to zero. It says nothing about the relative error. That may be why the normal model accurately predicts what percentage of women are eligible to serve in the US Army but does not accurately predict how many women are over 6′ 8″ tall.

{ 0 comments }

Why heights are normally distributed

by John on July 20, 2008

The canonical example of the normal distribution given in textbooks is human heights. Measure the heights of a large sample of adult men and the numbers will follow a normal (Gaussian) distribution. The heights of women also follow a normal distribution. What textbooks never discuss is why heights should be normally distributed.

Why should heights be normally distributed? If height were a simple genetic characteristic, there would be two possibilities: short and tall, like Mendel’s peas that were either wrinkled or smooth but never semi-wrinkled. But height is not a simple characteristic. There are numerous genetic and environmental factors that influence height. When there are many independent factors that contribute to some phenomena, the end result may follow a Gaussian distribution due to the central limit theorem.

The normal distribution is a remarkably good model of heights for some purposes. It may be more interesting to look at where the model breaks down. See my next post, why heights are not normally distributed.

Update: See Distribution of adult heights

{ 0 comments }

Normal approximation errors

by John on July 18, 2008

Many well-known probability distributions converge to the normal distribution as some parameter or other increases. In a sense this is not very interesting: All roads lead to Rome. But though destinations are the same, the paths to the destination are varied and more interesting.

I’ve posted notes on how the error in the normal approximation varies for the beta, gamma, and Student t distributions.

The animation below shows the error in the normal approximation to the gamma distribution as the shape parameter grows from 3 to 23. See the gamma notes for more details.

 

Animation of error in normal approximation to gamma as shape parameter increases

(If the image above is not animated in your browser, visit the gamma notes page where the image should display correctly in all browsers.)

{ 0 comments }

I posted some notes this evening on working with probability distributions in Mathematica and R/S-PLUS.

I much prefer Mathematica’s syntax. The first time I had to read some R code I ran across a statement something like runif(1, 3, 4). I thought it was some sort of conditional executation statement: run something if some condition holds. No, the code generates a random value uniformly from the interval (3, 4). The corresponding Mathematica syntax is Random[ UniformDistribution[3,4] ].

Another example. The statement pnorm(x, m, s) in R corresponds to PDF[ NormalDistribution[m, s], x ] in Mathematica. Both evaluate the PDF of a normal random variable with mean m and standard deviation s at the point x.

It’s a matter of taste. Some people prefer terse notation, especially for things they use frequently. I’d rather type more and remember less.

{ 0 comments }

Wine, Beer, and Statistics

by John on June 27, 2008

William Gosset discovered the t-distribution while working for the Guinness brewing company. Because his employer prevented employees from publishing papers, Gosset published his research under the pseudonym Student. That’s why his distribution is often called Student’s t-distribution.

This story is fairly well know. It often appears in the footnotes of statistics textbooks. However, I don’t think many people realize why it’s not surprising that fundamental statistical research should come from a brewery, and why we don’t hear of statistical research coming out of wineries.

Beer makers pride themselves on consistency while wine makers pride themselves on variety. That’s why you’ll never hear beer fans talk about a “good year” the way wine connoisseurs do. Because they value consistency, beer makers invest more in extensive statistical quality control than wine makers do.

{ 2 comments }

Two definitions of expectation

by John on June 27, 2008

In an introductory probability class, the expected value of a random variable X is defined as

E(X) = \int_{-\infty}^\infty x\, f_X(x) \,dx

where fXis the probability density function of X. I’ll call this the analytical definition.

In a more advanced class the expected value of X is defined as

E(X) = \int_\Omega X \,dP

where (Ω, P) is a probability space. I’ll call this the measure-theoretic definition. It’s not obvious that these two definitions are equivalent. They may even seem contradictory unless you look closely: they’re integrating different functions over different spaces.

If for some odd reason you learned the measure-theoretic definition first, you could see the analytical definition as a theorem. But if, like most people, you learn the analytical definition first, the measure-theoretic version is quite mysterious. When you take an advanced course and look at the details previously swept under the rug, probability looks like an entirely different subject, unrelated to your introductory course. The definition of expectation is just one concept among many that takes some work to resolve.

I’ve written a couple pages of notes that bridge the gap between the two definitions of expectation and show that they are equivalent.

{ 1 comment }

Probability approximations

by John on June 22, 2008

When I took my first probability course, it seemed like there were an infinite number of approximation theorems to learn, all mysterious. Looking back, there were probably only two or three, and they don’t need to be mysterious.

For example, under the right circumstances you can approximate a Binomial(n, p) well with a Normal(np, np(1-p)). While the relationship between the parameters in these two distributions is obvious to the initiated, it’s not at all obvious to a beginner. It seems much clearer to say that a Binomial can be approximated by a Normal with the same mean and variance. After all, a distribution that doesn’t get the mean and variance correct doesn’t sound like a very promising approximation.

Taking it a step further, a good teacher could guide a class to discover this approximation themselves. This would take more time than simply stating the result and working an example or two, but the difference in understanding would be immense. And if you’re not going to take the time to aim for understanding, what’s the point in covering approximation theorems at all? They’re not used that often for computation anymore. In my opinion, the only reason to go over them is the insight they provide.

{ 0 comments }

Today I talked to a doctor about the design of a randomized clinical trial that would use a Bayesian monitoring rule. The probability of response on each arm would be modeled as a binomial with a beta prior. Simple conjugate model. The historical response rate in this disease is only 5%, and so the doctor had chosen a beta(0.1, 1.9) prior so that the prior mean matched the historical response rate.

For beta distributions, the sum of the two parameters is called the effective sample size. There is a simple and natural explanation for why a beta(a, b) distribution is said to contain as much information as a+b data observations. By this criterion, the beta(0.1, 1.9) distribution is not very informative: it only has as much influence as two observations. However, viewed in another light, a beta(0.1, 1.9) distribution is highly informative.

This trial was designed to stop when the posterior probability is more than  0.999 that one treatment is more effective than the other. That’s an unusually high standard of evidence for stopping a trial — a cutoff of 0.99 or smaller would be much more common — and yet a trial could stop after only six patients. If X is the probability of response on one arm and Y is the probability of response on the other, after three failures on the first treatment and three successes on the other, Pr(Y > X) > 0.999.

The explanation for how the trial could stop so early is that the prior distribution is, in an odd sense, highly informative. The trial starts with a strong assumption that each treatment is ineffective. This assumption is somewhat justified by of experience, and yet a beta(0.1, 1.9) distribution doesn’t fully capture the investigator’s prior belief.

(Once at least one response has been observed, the beta(0.1, 1.9) prior becomes essentially uninformative. But until then, in this context, the prior is informative.)

A problem with a beta prior is that there is no way to specify the mean at 0.05 without also placing a large proportion of the probability mass below 0.05. The beta prior places too little probability on better outcomes that might reasonably happen. I imagine a more diffuse prior with mode 0.05 rather than mean 0.05 would better describe the prior beliefs regarding the treatments.

The beta prior is convenient because Bayes’ theorem takes a very simple form in this case: starting from a beta(a, b) prior and observing s successes and f failures, the posterior distribution is beta(a+s, b+f).  But a prior less convenient algebraically could be more robust and better adept at representing prior information.

A more basic observation is that “informative” and “uninformative” depend on context. This is part of what motivated Jeffreys to look for prior distributions that were equally (un)informative under a set of transformations. But Jeffreys’ approach isn’t the final answer. As far as I know, there’s no universally acceptable resolution to this dilemma.

{ 1 comment }

Robust priors

by John on June 8, 2008

Yesterday I posted a working paper version of an article I’ve been working on with Jairo Fúquene and Luis Pericchi: A Case for Robust Bayesian priors with Applications to Binary Clinical Trials.

Bayesian analysis begins with a prior distribution, a function summarizing what is believed about an experiment before any data are collected. The prior is updated as data become available and becomes the posterior distribution, a function summarizing what is currently believed in light of the data collected so far. As more data are collected, the relative influence of the prior decreases and the influence of the data increases. Whether a prior is robust depends on the rate at which the influence of the prior decreases.

There are essentially three approaches to how the influence of the prior on the posterior should vary as a function of the data.

  1. Robustness with respect to the prior. When the data and the prior disagree, give more weight to the data.
  2. Conjugate priors. The influence of the prior is independent of the extent to which it agrees with the data.
  3. Robustness with respect to the data. When the data and the prior disagree, give more weight to the prior.

When I say “give more weight to the data” or “give more weight to the prior,” I’m not talking about making ad hoc exceptions to Bayes theorem. The weight given to one or the other falls out of the usual application of Bayes theorem. Roughly speaking, robustness has to do with the relative thickness of the tails of the prior and the likelihood. A model with thicker tails on the prior will be robust with respect to the prior, and a model with thicker tails on the likelihood will be robust with respect to the data.

Each of the three approaches above are appropriate in different circumstances. When priors come from well-understood physical principles, it may make sense to use a model that is robust with respect to the data, i.e. to suppress outliers. When priors are based on vague beliefs, it may make more sense to be robust with respect to the prior. Between these extremes, particularly when a large amount of data is available, conjugate priors may be appropriate.

When the data and the prior are in rough agreement, the contribution of a robust prior to the posterior is comparable to the contribution that a conjugate prior would have had. (And so using robust proper priors leads to greater variance reduction than using improper priors.) But as the level of agreement decreases, the contribution of a robust prior to the posterior also decreases.

In the paper, we show that with a binomial likelihood, the influence of a conjugate prior grows without bound as the prior mean goes to infinity. However, with a Student-t prior, the influence of the prior is bounded as the prior mean increases. For a Cauchy prior, the influence of the prior is bounded as the location parameter goes to infinity.

It’s easy to confuse a robust prior and a vague conjugate prior. Our paper shows how in a certain sense, even an “informative” Cauchy distribution is less informative than a “non-informative” conjugate prior.

{ 1 comment }

I’m teaching part of a basic medical statistics class this summer. It’s been about a decade since I’ve taught basic probability and statistics and I now have different ideas about what is important. For example, I now think it’s more important that a beginning class understand the law of small numbers than the law of large numbers.

One reason for my change of heart is that over the intervening years I’ve talked with people who have had a class like the one I’m teaching now and I have some idea what they got out of it. They might summarize their course as follows.

First we did probability. You know, coin flips and poker hands. Then we did statistics. That’s where you look up these numbers in tables and if the number is small enough then what you’re trying to prove is true, and otherwise it’s false.

Too many people get through a course in probability and statistics without understanding what probability has to do with statistics. I think we’d be better off “covering” far less material but trying to insure that students really grok two or three big ideas by the time they leave.

{ 5 comments }

Barriers to good statistical software

by John on May 16, 2008

I attended a National Cancer Institute workshop yesterday entitled “Barriers to producing well-tested, user-friendly software for cutting-edge statistical methodology.” I was pleased that everyone there realized there is a huge difference between code created for personal use and reliable software that others would willingly use. Not all statisticians appreciate the magnitude of the difference.

I was also pleased that several people at the workshop were aware of the problem of irreproducible statistical analyses. Not everyone was aware how serious or how common the problem is, but those who were aware were adamant that something needs to be done about it, such as journals requiring authors to publish the code used to analyze their data.

{ 3 comments }