Parameters and percentiles

The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution.

Graph of gamma density with 10th percentile at 30 and 80th percentile at 90

Things would be easier if you could ask subject matter experts to express their opinions in statistical terms. You could ask “If you were to represent your belief as a gamma distribution, what would the shape and scale parameters be?” But that’s ridiculous. Even if they understood the question, it’s unlikely they’d give an accurate answer. It’s easier to think in terms of percentiles.

Asking for mean and variance are not much better than asking for shape and scale, especially for a non-symmetric distribution such as a survival curve. Anyone who knows what variance is probably thinks about it in terms of a normal distribution. Asking for mean and variance encourages someone to think about a symmetric distribution.

So once you have specified a couple percentiles, such as the example this post started with, can you find parameters that meet these requirements? If you can’t meet both requirements, how close can you come to satisfying them? Does it depend on how far apart the percentiles are? The answers to these questions depend on the distribution family. Obviously you can’t satisfy two requirements with a one-parameter distribution in general. If you have two requirements and two parameters, at least it’s feasible that both can be satisfied.

If you have a random variable X whose distribution depends on two parameters, when can you find parameter values so that Prob(Xx1) = p1 and Prob(Xx2) = p2? For starters, if x1 is less than x2 then p1 must be less than p2. For example, the probability of a variable being less than 5 cannot be bigger than the probability of being less than 6. For some common distributions, the only requirement is this requirement that the x‘s and p‘s be in a consistent order.

For a location-scale family, such as the normal or Cauchy distributions, you can always find a location and scale parameter to satisfy two percentile conditions. In fact, there’s a simple expression for the parameters. The location parameter is given by

\frac{x_1 F^{-1}(p_2) - x_2 F^{-1}(p_1)}{F^{-1}(p_2) - F^{-1}(p_1)}

and the scale parameter is given by

\frac{x_2 - x_1}{F^{-1}(p_2) - F^{-1}(p_1)}

where F(x) is the CDF of the distribution representative with location 0 and scale 1.

The shape and scale parameters of a Weibull distribution can also be found in closed form. For a gamma distribution, parameters to satisfy the percentile requirements always exist. The parameters are easy to determine numerically but there is no simple expression for them.

For more details, see Determining distribution parameters from quantiles. See also the ParameterSolver software.

Update: I posted an article on CodeProject with Python code for computing the parameters described here.

Related posts

Biostatistics software

The M. D. Anderson Cancer Center Department of Biostatistics has a software download site listing software developed by the department over many years.

The home page of the download site allows you to see all products sorted by date or by name. This page also allows search. A new page lets you see the software organized by tags.

RelatedBiostatistics consultant

Bayesian clinical trials in one zip code

I recently ran across this quote from Mithat Gönen of Memorial Sloan-Kettering Cancer Center:

While there are certainly some at other centers, the bulk of applied Bayesian clinical trial design in this country is largely confined to a single zip code.

from “Bayesian clinical trials: no more excuses,” Clinical Trials 2009; 6; 203.

The zip code Gönen alludes to is 77030, the zip code of M. D. Anderson Cancer Center. I can’t say how much activity there is elsewhere, but certainly we design and conduct a lot of Bayesian clinical trials at MDACC.

Update: After over a decade working at MDACC, I left to start my own consulting business. If you’d like help with adaptive clinical trials please let me know.

More clinical trial posts

Four reasons to use Bayesian inference

The following is a direct quote from Anthony O’Hagan’s book Bayesian Inference. I’ve edited the quote only to enumerate the points.

Why should one use Bayesian inference, as opposed to classical inference? There are various answers. Broadly speaking, some of the arguments in favour of the Bayesian approach are that it is

  1. fundamentally sound,
  2. very flexible,
  3. produces clear and direct inferences,
  4. makes use of all available information.

I’ll elaborate briefly on each of O’Hagan’s points.

Bayesian inference has a solid philosophical foundation. It is consistent with certain axioms of rational inference. Non-Bayesian systems of inference, such as fuzzy logic, must violate one or more of these axioms; their conclusions are rationally satisfying to the extent that they approximate Bayesian inference.

Bayesian inference is at the same time rigid and flexible. It is rigid in the sense that all inference follows the same form: set up a likelihood and a prior, then calculate the posterior by conditioning on observed data via Bayes theorem. But this rigidity channels creativity into useful directions. It provides a template for setting up complex models when necessary.

Frequentist inferences are awkward to explain. For example, confidence intervals and p-values are tedious to define rigorously. Most consumers of confidence intervals and p-values do not know what they mean and implicitly assume Bayesian interpretations. The difference is not simply pedantic. Particularly with regard to p-values, the common understanding can be grossly inaccurate. By contrast, Bayesian counterparts are simple to define and interpret. Bayesian credible intervals are exactly what most people think confidence intervals are. And a Bayesian hypotheses test simply compares the probability of each hypothesis via Bayes factors.

Sometimes the necessity of specifying prior distributions is seen as a drawback to Bayesian inference. On the other hand, the ability to specify prior distributions means that more information can be incorporated in an inference. See Musicians, drunks, and Oliver Cromwell for a colorful illustration from Jim Berger on the need to incorporate prior information.

More posts on Bayesian statistics

Bayesian statistics is misnamed

I’m teaching an introduction to Bayesian statistics. My first thought was to start with Bayes theorem, as many introductions do. But this isn’t the right starting point. Bayes’ theorem is an indispensable tool for Bayesian statistics, but it is not the foundational principle. The foundational principle of Bayesian statistics is the decision to represent uncertainty by probabilities. Unknown parameters have probability distributions that represent the uncertainty in our knowledge of their values.

Once you decide to use probabilities to express parameter uncertainty, you inevitably run into the need for Bayes theorem to work with these probabilities. Bayes theorem is applied constantly in Bayesian statistics, and that is why the field takes its name from the theorem’s author, Reverend Thomas Bayes (1702-1761). But “Bayesian” doesn’t describe Bayesian statistics quite the same way that “Frequentist” described frequentist statistics. The term “frequentist” gets to the heart of how frequentist statistics interprets probability. But “Bayesian” refers to a Bayes theorem, a computational tool for carrying out probability calculations in Bayesian statistics. If frequentist statistics were analogously named, it might be called “Bernoullian statistics” after Jacob Bernoulli’s law of large numbers.

The term “Bayesian” statistics might imply that frequentist statisticians dispute Bayes’ theorem. That is not the case. Bayes’ theorem is a simple mathematical result. What people dispute is the interpretation of the probabilities that Bayesians want to stick into Bayes’ theorem.

I don’t have a better name for Bayesian statistics. Even if I did, the name “Bayesian” is firmly established. It’s certainly easier to say “Bayesian statistics” than to say “that school of statistics that represents uncertainty in unknown parameters by probabilities,” even though the latter is accurate.

More Bayesian posts

Four pillars of Bayesian statistics

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

More statistical posts

Finding the shortest interval with given mass

Here’s an elegant little theorem applied in statistics but useful more generally. Suppose you have a density function f(x) with one hump. Suppose a and b are two points on opposite sides of the hump with f(a) = f(b). Then [a, b] is the shortest interval with its mass. That is, any other interval of length ba will have less mass than the interval [a, b]. (Here the “mass” of an interval is just the integral of f(x) over that interval.)

Suppose we want to find the shortest interval that has a given mass k. Start by imagining a horizontal line sitting on top of the graph of f(x).

graph of gamma(4,1) pdf

Now lower this horizontal line so that it intersects the graph in two places.

graph of gamma(4,1) pdf cut at height 0.1

Draw vertical lines down from these two points of intersection to find their x-coordinates.

graph showing how to find the x coordinates of the two intersection points

In this example, the two x-coordinates are about 1.30 and 5.77. So the interval [1.30, 5.77] is the shortest interval with its mass. In other words, no other interval of length 4.47 can contain more mass than this interval does.

We can find the shortest interval of mass k by lowering this horizontal line until the interval it defines has mass k. The lower the horizontal line, the greater the mass. So for any given mass less than the total mass f(x) assigns, there is a unique height of the horizontal line that defines an interval with that mass.

This procedure could be used to find the shortest confidence interval or the shortest Bayesian credible interval. In that case the “mass” is probability, and the task is to find the shortest interval containing a specified probability. The theorem says that the shortest confidence interval or credible interval has equal probability density at each end of the interval.

A proof of this theorem is given in Statistical Inference, chapter 9. Technically, f(x) must be unimodal and positive with finite integral. A homework exercise in the same chapter outlines a simpler proof using the additional assumption that f(x) is continuous.

Related post: What is a confidence interval?