Posts tagged as:

Bayesian

Parameters and percentiles

by John on January 31, 2010

The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution.

Graph of gamma density with 10th percentile at 30 and 80th percentile at 90

Things would be easier if you could ask subject matter experts to express their opinions in statistical terms. You could ask “If you were to represent your belief as a gamma distribution, what would the shape and scale parameters be?” But that’s ridiculous. Even if they understood the question, it’s unlikely they’d give an accurate answer. It’s easier to think in terms of percentiles.

Asking for mean and variance are not much better than asking for shape and scale, especially for a non-symmetric distribution such as a survival curve. Anyone who knows what variance is probably thinks about it in terms of a normal distribution. Asking for mean and variance encourages someone to think about a symmetric distribution.

So once you have specified a couple percentiles, such as the example this post started with, can you find parameters that meet these requirements? If you can’t meet both requirements, how close can you come to satisfying them? Does it depend on how far apart the percentiles are? The answers to these questions depend on the distribution family. Obviously you can’t satisfy two requirements with a one-parameter distribution in general. If you have two requirements and two parameters, at least it’s feasible that both can be satisfied.

If you have a random variable X whose distribution depends on two parameters, when can you find parameter values so that Prob(X ≤ x1) = p1 and Prob(X ≤ x2) = p2? For starters, if x1 is less than x2 then p1 must be less than p2. For example, the probability of a variable being less than 5 cannot be bigger than the probability of being less than 6. For some common distributions, the only requirement is this requirement that the x’s and p’s be in a consistent order.

For a location-scale family, such as the normal or Cauchy distributions, you can always find a location and scale parameter to satisfy two percentile conditions. In fact, there’s a simple expression for the parameters. The location parameter is given by

\frac{x_1 F^{-1}(p_2) - x_2 F^{-1}(p_1)}{F^{-1}(p_2) - F^{-1}(p_1)}

and the scale parameter is given by

\frac{x_2 - x_1}{F^{-1}(p_2) - F^{-1}(p_1)}

where F(x) is the CDF of the distribution representative with location 0 and scale 1.

The shape and scale parameters of a Weibull distribution can also be found in closed form. For a gamma distribution, parameters to satisfy the percentile requirements always exist. The parameters are easy to determine numerically but there is no simple expression for them.

For more details, see Determining distribution parameters from quantiles. See also the ParameterSolver software.

Update: I posted an article on CodeProject with Python code for computing the parameters described here.

Related posts:

Biostatistics software
Diagram of distribution relationships
How to calculate percentiles in memory-bound applications

{ 1 comment }

Biostatistics software

by John on January 13, 2010

The M. D. Anderson Cancer Center Department of Biostatistics has a software download site listing software developed by the department over many years.

The home page of the download site allows you to see all products sorted by date or by name. This page also allows search. A new page lets you see the software organized by tags.

{ 1 comment }

A case for robust Bayesian priors

by John on November 30, 2009

A paper I wrote with Jairo Fúquene and Luis Pericchi is now available online.

A Case for Robust Bayesian Priors with Applications to Clinical Trials
Jairo Fúquene, John Cook, and Luis Pericchi
Bayesian Analysis (2009) 4, Number 4, pp. 817–846.

{ 0 comments }

Bayesian clinical trials in one zip code

by John on October 27, 2009

I recently ran across this quote from Mithat Gönen of Memorial Sloan-Kettering Cancer Center:

While there are certainly some at other centers, the bulk of applied Bayesian clinical trial design in this country is largely confined to a single zip code.

from “Bayesian clinical trials: no more excuses,” Clinical Trials 2009; 6; 203.

The zip code Gönen alludes to is 77030, the zip code of M. D. Anderson Cancer Center. I can’t say how much activity there is elsewhere, but certainly we design and conduct a lot of Bayesian clinical trials at MDACC.

Related posts:

Cartoon guide to cancer research
Stopping trials of ineffective drugs sooner
Three ways of tuning an adaptively randomized clinical trial
Population drift

{ 1 comment }

R package for robust priors

by John on May 11, 2009

Jairo Fuquene has released an R package on CRAN to accompany our paper

A Case for Robust Bayesian priors with Applications to Binary Clinical Trials
Jairo A. Fuquene P., John D. Cook, Luis Raul Pericchi

{ 2 comments }

Classical statistics in a nutshell

by John on May 4, 2009

Here’s another quote from Anthony O’Hagan’s book Bayesian Inference.

All classical inference statements … are probability statements about x given θ, phrased so as to appear to be probability statements about θ.

Emphasis in the original.

Related posts:

Four reasons to use Bayesian inference
Four pillars of Bayesian statistics

{ 0 comments }

Four reasons to use Bayesian inference

by John on April 28, 2009

The following is a direct quote from Anthony O’Hagan’s book Bayesian Inference. I’ve edited the quote only to enumerate the points.

Why should one use Bayesian inference, as opposed to classical inference? There are various answers. Broadly speaking, some of the arguments in favour of the the Bayesian approach are that it is

  1. fundamentally sound,
  2. very flexible,
  3. produces clear and direct inferences,
  4. makes use of all available information.

I’ll elaborate briefly on each of O’Hagan’s points.

Bayesian inference has a solid philosophical foundation. It is consistent with certain axioms of rational inference. Non-Bayesian systems of inference, such as fuzzy logic, must violate one or more of these axioms; their conclusions are rationally satisfying to the extent that they approximate Bayesian inference.

Bayesian inference is at the same time rigid and flexible. It is rigid in the sense that all inference follows the same form: set up a likelihood and a prior, then calculate the posterior by conditioning on observed data via Bayes theorem. But this rigidity channels creativity into useful directions. It provides a template for setting up complex models when necessary.

Frequentist inferences are awkward to explain. For example, confidence intervals and p-values are tedious to define rigorously. Most consumers of confidence intervals and p-values do not know what they mean and implicitly assume Bayesian interpretations. The difference is not simply pedantic. Particularly with regard to p-values, the common understanding can be grossly inaccurate. By contrast, Bayesian counterparts are simple to define and interpret. Bayesian credible intervals are exactly what most people think confidence intervals are. And a Bayesian hypotheses test simply compares the probability of each hypothesis via Bayes factors.

Sometimes the necessity of specifying prior distributions is seen as a drawback to Bayesian inference. On the other hand, the ability to specify prior distributions means that more information can be incorporated in an inference. See Musicians, drunks, and Oliver Cromwell for a colorful illustration from Jim Berger on the need to incorporate prior information.

Related posts:

Four pillars of Bayesian statistics
Bayesian statistics is misnamed
What is a confidence interval?
Why most published research results are false

{ 3 comments }

Bayesian statistics is misnamed

by John on April 20, 2009

I’m teaching an introduction to Bayesian statistics. My first thought was to start with Bayes theorem, as many introductions do. But this isn’t the right starting point. Bayes’ theorem is an indispensable tool for Bayesian statistics, but it is not the foundational principle. The foundational principle of Bayesian statistics is the decision to represent uncertainty by probabilities. Unknown parameters have probability distributions that represent the uncertainty in our knowledge of their values.

Once you decide to use probabilities to express parameter uncertainty, you inevitably run into the need for Bayes theorem to work with these probabilities. Bayes theorem is applied constantly in Bayesian statistics, and that is why the field takes its name from the theorem’s author, Reverend Thomas Bayes (1702-1761). But “Bayesian” doesn’t describe Bayesian statistics quite the same way that “Frequentist” described frequentist statistics. The term “frequentist” gets to the heart of how frequentist statistics interprets probability. But “Bayesian” refers to a Bayes theorem, a computational tool for carrying out probability calculations in Bayesian statistics. If frequentist statistics were analogously named, it might be called “Bernoullian statistics” after Jacob Bernoulli’s law of large numbers.

The term “Bayesian” statistics might imply that frequentist statisticians dispute Bayes’ theorem. That is not the case. Bayes’ theorem is a simple mathematical result. What people dispute is the interpretation of the probabilites that Bayesians want to stick into Bayes’ theorem.

I don’t have a better name for Bayesian statistics. Even if I did, the name “Bayesian” is firmly established. It’s certainly easier to say “Bayesian statistics” than to say “that school of statistics that represents uncertainty in unknown parameters by probabilities,” even though the latter is accurate.

Related posts:

Four pillars of Bayesian statistics
What a probability means
Plausible reasoning
The probability that Shakespeare wrote a play

{ 5 comments }

Four pillars of Bayesian statistics

by John on April 7, 2009

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

Related posts:

Probability and information
What is a confidence interval?

{ 8 comments }

Finding the shortest interval with given mass

by John on February 23, 2009

Here’s an elegant little theorem applied in statistics but useful more generally. Suppose you have a density function f(x) with one hump. Suppose a and b are two points on opposite sides of the hump with f(a) = f(b). Then [a, b] is the shortest interval with its mass. That is, any other interval of length b-a will have less mass than the interval [a, b]. (Here the “mass” of an interval is just the integral of f(x) over that interval.)

Suppose we want to find the shortest interval that has a given mass k. Start by imagining a horizontal line sitting on top of the graph of f(x).

graph of gamma(4,1) pdf

Now lower this horizontal line so that it intersects the graph in two places.

graph of gamma(4,1) pdf cut at height 0.1

Draw vertical lines down from these two points of intersection to find their x-coordinates.

graph showing how to find the x coordinates of the two intersection points

In this example, the two x-coordinates are about 1.30 and 5.77. So the interval [1.30, 5.77] is the shortest interval with its mass. In other words, no other interval of length 4.47 can contain more mass than this interval does.

We can find the shortest interval of mass k by lowering this horizontal line until the interval it defines has mass k. The lower the horizontal line, the greater the mass. So for any given mass less than the total mass f(x) assigns, there is a unique height of the horizontal line that defines an interval with that mass.

This procedure could be used to find the shortest confidence interval or the shortest Bayesian credible interval. In that case the “mass” is probability, and the task is to find the shortest interval containing a specified probability. The theorem says that the shortest confidence interval or credible interval has equal probability density at each end of the interval.

A proof of this theorem is given in Statistical Inference, chapter 9. Technically, f(x) must be unimodal and positive with finite integral. A homework exercise in the same chapter outlines a simpler proof using the additional assumption that f(x) is continuous.

Related post: What is a confidence interval?

{ 0 comments }

What is a confidence interval?

by John on February 3, 2009

At first glance, a confidence interval is simple. If we say [3, 4] is a 95% confidence interval for a parameter θ, then there’s a 95% chance that θ is between 3 and 4. That explanation is not correct, but it works better in practice than in theory.

If you’re a Bayesian, the explanation above is correct if you change the terminology from “confidence” interval to “credible” interval. But if you’re a frequentist, you can’t make probability statements about parameters.

Confidence intervals take some delicate explanation. I took a look at Andrew Gelman and Deborah Nolan’s book Teaching Statistics: a bag of tricks,  to see what they had to say about teaching confidence intervals. They begin their section on the topic by saying “Confidence intervals are complicated …” That made me feel better. Some folks with more experience teaching statistics also find this challenging to teach. And according to The Lady Testing Tea, confidence intervals were controversial when they were first introduced.

From a frequentist perspective, confidence intervals are random, parameters are not, exactly the opposite of what everyone naturally thinks. You can’t talk about the probability that θ is in an interval because θ isn’t random. But in that case, what good is a confidence interval? As L. J. Savage once said,

The only use I know for a confidence interval is to have confidence in it.

In practice, people don’t go too wrong using the popular but technically incorrect notion of a confidence interval. Frequentist confidence intervals often approximate Bayesian credibility intervals; the frequentist approach is more useful in practice than in theory.

It’s interesting to see a sort of détente between frequentist and Bayesian statisticians. Some frequentists say that the Bayesian interpretation of statistics is nonsense, but the methods these crazy Bayesians come up with often have good frequentist properties. And some Bayesians say that frequentist methods, such as confidence intervals, are useful because they can come up with results that often approximate Bayesian results.

{ 0 comments }

Hardly anyone cares about statistics directly. People more often care about decisions they need to make with the help of statistics. This suggests that the statistics and decision-making process should be explicitly integrated. The name for this integrated approach is “decision theory.” Problems in decision theory are set up with the goal of maximizing “utility,” the benefit you expect to get from a decision. Equivalently, problems are set up to minimize expected cost. Cost may be a literal monetary cost, but it could be some other measure of something you want to avoid.

I was at a conference this morning where David Draper gave an excellent talk entitled Bayesian Decision Theory in Biostatistics: the Utility of Utility.  Draper presented an example of selecting variables for a statistical model. But instead of just selecting the most important variables in a purely statistical sense, he factored in the cost of collecting each variable. So if two variables make nearly equal contributions to a model, for example, the procedure would give preference to the variable that is cheaper to collect. In short, Draper recommended a cost-benefit analysis rather than the typical (statistical) benefit-only analysis. Very reasonable.

Why don’t people always take this approach? One reason is that it’s hard to assign utilities to outcomes. Dollar costs are often easy to account for, but it can be much harder to assign values to benefits. For example, you have to ask “Benefit for whom?” In a medical context, do you want to maximize the benefit to patients? Doctors? Insurance companies? Tax payers? Regulators? Statisticians? If you want to maximize some combination of these factors, how do you weight the interests of the various parties?

Assigning utilities is hard work, and you can never make everyone happy. No matter how good of a job you do, someone will criticize you. Nearly everyone agrees in the abstract that considering utilities is the way to go, but in practice it is hardly ever done. Anyone who proposes a way to quantify utility is immediately shot down by people who have a better idea. The net result is that rather than using a reasonable but  imperfect idea of utility, no utility is used at all. Or rather no explicit definition of utility is used. There is usually some implicit idea of utility, chosen for mathematical convenience, and that one wins by default. In general, people much prefer to leave utilities implicit.

In the Q&A after his talk, Draper said something to the effect that the status quo persists for a very good reason: thinking is hard work, and it opens you up to criticism.

{ 2 comments }

The probability that Shakespeare wrote a play

by John on December 18, 2008

Some people object to asking about the probability that Shakespeare wrote this or that play. One objection is that someone has already written the play, either Shakespeare or someone else. If Shakespeare wrote it, then the probability is one that he did. Otherwise the probability is zero. By this reasoning, no one can make probability statements about anything that has already happened. Another objection is that probability only applies to random processes. We cannot apply probability to questions about document authorship because documents are not random.

I just ran across a blog post by Ted Dunning that weighs in on this question. He writes

The statement “It cannot be probability …” is essentially a tautology. It should read, “We cannot use the word probability to describe our state of knowledge because we have implicitly accepted the assumption that probability cannot be used to describe our state of knowledge”.

He goes on to explain that if we think about statements of knowledge in terms of probabilities, we get a consistent system, so we might as well reason as if it’s OK to use probability theory.

The uncertainty about the authorship of the play does not exist in history — an omniscient historian would know who wrote it. Nor does it exist in nature — the play was not created by a random process. The uncertainty is in our heads. We don’t know who wrote it. But if we use numbers to represent our uncertainty, and we agree to certain common-sense axioms about how this should be done, we inevitably get probability theory.

As E. T. Jaynes once put it, “probabilities do not describe reality — only our information about reality.”

{ 1 comment }

Three math diagrams

by John on October 24, 2008

Here are three pages containing diagrams that each summarize several theorems.

The distribution relationships page summarizing around 40 relationships between around 20 statistical distributions.

The modes of convergence page has three diagrams like the following that explain when one kind of convergence implies another: almost everywhere, almost uniform, Lp, and convergence in measure.

The page conjugate prior relationships is similar to the probability distribution page, but concentrates on relationships in Bayesian statistics.

What are some of your favorite diagrams? Do you have any suggestions for diagrams that you’d like to see someone make?

{ 1 comment }

Diagram of conjugate prior relationships

by John on October 8, 2008

Here is a diagram to summarize some well-known conjugate prior relationships.

See conjugate prior relationships for details regarding distributions and posterior parameters.

{ 0 comments }