Posts tagged as:

Bayesian

The universal solvent of statistics

by John on February 1, 2012

Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any situation: “What would you do if you had all the data?” For me, that “what would you do” question is one of the universal solvents of statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.

Cheshire Cat

Related post: Irrelevant uncertainty

{ 6 comments }

Six analysis and probability diagrams

by John on January 17, 2012

Here are a few diagrams I’ve created that summarize relationships in analysis and probability. Click on a thumbnail image to go to a page with the full image and explanatory text.

Special functions

Gamma and related functions

Probability distributions

Conjugate priors

Convergence theorems

Bessel functions

{ 3 comments }

Interpreting statistics

by John on January 13, 2012

From Matt Briggs:

I challenge you to find me in any published statistical analysis, outside of an introductory textbook, a confidence interval given the correct interpretation. If you can find even one instance where the [frequentist] confidence interval is not interpreted as a [Bayesian] credible interval, then I will eat your hat.

Most statistical analysis is carried out by people who do not interpret their results correctly. They carry out frequentist procedures and then give the results a Bayesian interpretation. This is not simply a violation of an academic taboo. It means that people generally underestimate the uncertainty in their conclusions.

Related posts:

Most published research results are false
Classical statistics in a nutshell

{ 16 comments }

Bad logic, but good statistics

by John on November 28, 2011

Ad hominem arguments are bad logic, but good (Bayesian) statistics. A statement isn’t necessarily false because it comes from an unreliable source, though it is more likely to be false.

Some people are much more likely to know what they’re talking about than others, depending on context. You’re more likely to get good medical advice from a doctor than from an accountant, though the former may be wrong and the latter may be right. (Actors are not likely to know what they’re talking about when giving advice regarding anything but acting, though that doesn’t stop them.)

Ad hominem guesses are a reasonable way to construct a prior, but the prior needs to be updated with data. Given no other data, the doctor is more likely to know medicine than the accountant is. Assuming a priori that both are equally likely to be correct may be “fair,” but it’s not reasonable. However, as you gather data on the accuracy of each, you could change your mind. The posterior distribution could persuade you that you’ve been talking to a quack doctor or an accountant who is unusually knowledgeable of medicine.

Related post: Musicians, drunks, and Oliver Cromwell

{ 5 comments }

Leading digits of factorials

by John on October 19, 2011

Suppose you take factorials of a lot of numbers and look at the leading digit of each result. You could argue that there’s no apparent reason that any digit would be more common than any other, so you’d expect each of the digits 1 through 9 would come up 1/9 of the time. Sounds plausible, but it’s wrong.

The leading digits of factorials follow Benford’s law as described in the previous post. In fact, factorials follow Benford’s law even better than physical constants do. Here’s a graph of the leading digits of the factorials of 1 through 500.

In the remainder of this post, I’ll explain why Benford’s law should apply to factorials, make an aside on statistics, and point out an interesting feature of the Python code used to generate the chart above.

Why Benford’s law applies

Here’s a hand-waving explanation. One way to justify Benford’s law is to say that physical constants are uniformly distributed, but on a logarithmic scale. The same is true for factorials, and it’s easier to see why.

The leading digits of the logarithms depend on on their logarithms in base 10. The gamma function extends the factorial function and it is log-convex. The logarithm of the gamma function is fairly flat (see plot here), and so the leading digits of the log-gamma function applied to integers are uniformly distributed on a logarithmic scale.  (I’ve mixed logs base 10 and natural logs here, but that doesn’t matter. All logarithms are the same up to a multiplicative constant. So if a plot is nearly linear on a log10 scale, it’s nearly linear on a natural log scale.)

Update: Graham gives a link in the comments below to a paper proving that factorials satisfy Benford’s law exactly in the limit.

Uniform on what scale?

This example brings up an important principle in statistics. Some say that if you don’t have a reason to assume anything else, use a uniform distribution. For example, some say that a uniform prior is the ideal uninformative prior for Bayesian statistics. But you have to ask “Uniform on what scale?” It turns out that the leading digits of physical constants and factorials are indeed uniformly distributed, but on a logarithmic scale.

Python integers and floating point

I used nearly the same code to produce the chart above as I used in its counterpart in the previous post. However, one thing had to change: I couldn’t compute the leading digits of the factorials the same way. Python has extended precision integers, so I can compute 500! factorial without overflowing. Using floating point numbers, I could only go up to 170!. But when I used my previous code to find the leading digit, it first tried to apply log10 to an integer larger than the largest representable floating point number and failed. Converting numbers such as 500! to floating point numbers will overflow. (See Floating point numbers are a leaky abstraction.)

The solution was to find the leading digit using only integer operations.

def leading_digit_int(n):
    while n > 9:
        n = n/10
    return n

This code works fine for numbers like 500! or even larger.

Related posts:

Benford’s law and SciPy
Physical constants and factorials
Anatomy of a floating point number

{ 13 comments }

A Bayesian view of Amazon Resellers

by John on September 27, 2011

I was buying a used book through Amazon this evening. Three resellers offered the book at essentially the same price. Here were their ratings:

  • 94% positive out of 85,193 reviews
  • 98% positive out of 20,785 reviews
  • 99% positive out of 840 reviews

Which reseller is likely to give the best service? Before you assume it’s the seller with the highest percentage of positive reviews, consider the following simpler scenario.

Suppose one reseller has 90 positive reviews out of 100. The other reseller has two reviews, both positive. You could say one has 90% approval and the other has 100% approval, so the one with 100% approval is better. But this doesn’t take into consideration that there’s much more data on one than the other. You can have some confidence that 90% of the first reseller’s customers are satisfied. You don’t really know about the other because you have only two data points.

A Bayesian view of the problem naturally incorporates the amount of data as well as its average. Let θA be the probability of a customer being satisfied with company A’s service. Let θB be the corresponding probability for company B. Suppose before we see any reviews we think all ratings are equally likely. That is, we start with a uniform prior distribution θA and θB. A uniform distribution is the same as a beta(1, 1) distribution.

After observing 90 positive reviews and 10 negative reviews, our posterior estimate on θA has a beta(91, 11) distribution. After observing 2 positive reviews, our posterior estimate on θB has a beta(3, 1) distribution. The probability that a sample from θA is bigger than a sample from θB is 0.713. That is, there’s a good chance you’d get better service from the reseller with the lower average approval rating.

beta(91,11) versus beta(3,1)

Now back to our original question. Which of the three resellers is most likely to satisfy a customer?

Assume a uniform prior on θX, θY, and θZ, the probabilities of good service for each reseller. The posterior distributions on these variables have distributions beta(80082, 5113), beta(20370, 417), and beta(833, 9).

These beta distributions have such large parameters that we can approximate them by normal distributions with the same mean and variance. (A beta(a, b) random variable has mean a/(a+b) and variance ab/((a+b)2(a+b+1)).) The variable with the most variance, θZ, has standard deviation 0.003. The other variables have even smaller standard deviation. So the three distributions are highly concentrated at their mean values with practically non-overlapping support. And so a sample from θX or θY is unlikely to be higher than a sample from θZ.

In general, going by averages alone works when you have a lot of customer reviews. But when you have a small number of reviews, going by averages alone could be misleading.

Thanks to Charles McCreary for suggesting the xkcd comic.

Related links:

Inequality Calculator
Calculating random inequalities
Exact calculation of beta inequalities

{ 28 comments }

Big data and humility

by John on September 22, 2011

One of the challenges with big data is to properly estimate your uncertainty. Often “big data” means a huge amount of data that isn’t exactly what you want.

As an example, suppose you have data on how a drug acts in monkeys and you want to infer how the drug acts in humans. There are two sources of uncertainty:

  1. How well do we really know the effects in monkeys?
  2. How well do these results translate to humans?

The former can be quantified, and so we focus on that, but the latter may be more important. There’s a strong temptation to believe that big data regarding one situation tells us more than it does about an analogous situation.

I’ve seen people reason as follows. We don’t really know how results translate from monkeys to humans (or from one chemical to a related chemical, from one market to an analogous market, etc.). We have a moderate amount of data on monkeys and we’ll decimate it and use that as if it were human data, say in order to come up with a prior distribution.

Down-weighting by a fixed ratio, such as 10 to 1, is misleading. If you had 10x as much data on monkeys, would you as much about effects in humans as if the original smaller data set were collected on people? What if you suddenly had “big data” involving every monkey on the planet. More data on monkeys drives down your uncertainty about monkeys, but does nothing to lower your uncertainty regarding how monkey results translate to humans.

At some point, more data about analogous cases reaches diminishing return and you can’t go further without data about what you really want to know. Collecting more and more data about how a drug works in adults won’t help you learn how it works in children. At some point, you need to treat children. Terabytes of analogous data may not be as valuable as kilobytes of highly relevant data.

Related posts:

Big data is not enough
Does gaining weight make you taller?

{ 10 comments }

Bayes isn’t magic

by John on September 6, 2011

If a study is completely infeasible using traditional statistical methods, Bayesian methods are probably not going to rescue it. Bayesian methods can’t squeeze blood out of a turnip.

The Bayesian approach to statistics has real advantages, but sometimes these advantages are oversold. Bayesian statistics is still statistics, not magic.

{ 7 comments }

Markov chains don’t converge

by John on August 10, 2011

I often hear people often say they’re using a burn-in period in MCMC to run a Markov chain until it converges. But Markov chains don’t converge, at least not the Markov chains that are useful in MCMC. These Markov chains wander around forever exploring the domain they’re sampling from. Any point that makes a “bad” starting point for MCMC is a point you might reach by burn-in.

Not only that, Markov chains can’t remember how they got where they are. That’s their defining property. So if your burn-in period ends at a point x, the chain will perform exactly as if you had simply started at x.

When someone says a Markov chain has converged, they mean that the chain has entered a high-probability region. I’ll explain in a moment why that’s desirable. But the belief/hope is that a burn-in period will put a Markov chain in a high-probability region. And it probably will, but there are a couple reasons why this isn’t necessarily the best thing to do.

  1. Burn-in may be ineffective. You could use use optimization to be certain that you’re starting in such a region. Burn-in offers no such assurance. See Burn-in and other MCMC folklore.
  2. Burn-in may be inefficient. Casting aside worries that burn-in may not do what you want, it can be an inefficient way to find a high-probability region. MCMC isn’t designed to optimize a density function but rather to sample from it.

Why use burn-in? MCMC practitioners often don’t know how to do optimization, and in any case the corresponding optimization problem may be difficult. Also, if you’ve got the MCMC code in hand, it’s convenient to use it to find a starting point as well as for sampling.

So why does it matter whether you start your Markov chain in a high-probability region? In the limit, it doesn’t matter. But since you’re averaging some function of some finite number of samples, your average will be a better approximation if you start at a typical point in the density you’re sampling. If you start at a low probability location, your average may be more biased.

Samples from Markov chains don’t converge, but averages of functions applied to these samples may converge. When someone says a Markov chain has converged, they mean they’re at a point where the average of a finite number of function applications will be a better approximation of the thing they want to compute than if they’d started at a low probability point.

It’s not just a matter of imprecise language when people say a Markov chain has converged. It sometimes betrays a misunderstanding of how Markov chains work.

{ 13 comments }

Teaching Bayesian stats backward

by John on April 20, 2011

Most presentations of Bayesian statistics I’ve seen start with elementary examples of Bayes’ Theorem. And most of these use the canonical example of testing for rare diseases. But the connection between these examples and Bayesian statistics is not obvious at first. Maybe this isn’t the best approach.

What if we begin with the end in mind? Bayesian calculations produce posterior probability distributions on parameters. An effective way to teach Bayesian statistics might be to start there. Suppose we had probability distributions on our parameters. Never mind where they came from. Never mind classical objections that say you can’t do this. What if you could? If you had such distributions, what could you do with them?

For starters, point estimation and interval estimation become trivial. You could, for example, use the distribution mean as a point estimate and the area between two quantiles as an interval estimate. The distributions tell you far more than  point estimates or interval estimates could; these estimates are simply summaries of the information contained in the distributions.

It makes logical sense to start with Bayes’ Theorem since that’s the tool used to construct posterior distributions. But I think it makes pedagogical sense to start with the posterior distribution and work backward to how one would come up with such a thing.

Bayesian statistics is so named because Bayes’ Theorem is essential to its calculations. But that’s a little like classical statistics Central Limitist statistics because it relies heavily on the Central Limit Theorem.

The key idea of Bayesian statistics is to represent all uncertainty by probability distributions. That idea can be obscured by an early emphasis on calculations.

Related posts:

Interview with David Spiegelhalter
Occam’s razor and Bayes’ theorem
Four reasons to use Bayesian inference

{ 11 comments }

Significance testing and Congress

by John on April 14, 2011

The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.

  1. If a person is an American, he is not a member of Congress.
  2. This person is a member of Congress.
  3. Therefore he is not American.

The initial premise is false, but the reasoning is correct if we assume the initial premise is true.

The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?

  1. If a person is an American, he is probably not a member of Congress.
  2. This person is a member of Congress.
  3. Therefore he is probably not American.

What went wrong? The probability is backward. We want to know the probability that someone is American given he is a member of Congress, not the probability he is a member of Congress given he is American.

Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.

  1. If the null hypothesis is correct, then these data are highly unlikely.
  2. These data have occurred.
  3. Therefore, the null hypothesis is highly unlikely.

Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.

Related posts:

How insignificant is significance testing?
Five criticisms of significance testing
Most published research results are false
Classical statistics in a nutshell

{ 8 comments }

Like Laplace, only more so

by John on February 17, 2011

The Laplace distribution is pointy in the middle and fat in the tails relative to the normal distribution.This post is about a probability distribution that is more pointy in the middle and fatter in the tails.

[click to continue...]

{ 8 comments }

The end of hard-edged science?

by John on February 14, 2011

Bradley Efron says that science is moving away from things like predicting sunrise times and toward predicting things like the weather. The trend is away from studying precisely predictable systems, what Efron calls “hard-edged science,” and toward studying systems “where predictability is tempered by a heavy dose of randomness.”

Hard-edged science still dominates public perceptions, but the attention of modern scientists has swung heavily toward rainfall-like subjects, the kind where random behavior plays a major role. … Deterministic Newtonian science is majestic, and the basis of modern science too, but a few hundred years of it pretty much exhausted nature’s storehouse of precisely predictable events. Subjects like biology, medicine, and economics require a more flexible scientific world view, the kind we statisticians are trained to understand.

Certainly there is increased interest in systems containing “a heavy dose of randomness” but can we really say that we have “pretty much exhausted nature’s storehouse of precisely predictable effects”?

Source: Modern Science and the Bayesian-Frequentist Controversy

Related posts:

Scientific results fading over time
Occam’s razor and Bayes’ theorem
The law of medium numbers

{ 11 comments }

Interview with David Spiegelhalter

by John on February 2, 2011

Samuel Hansen interviews David Spiegelhalter on his mathematical podcast Strongly Connected Components. From the show notes:

On today’s episode of Strongly Connected Components Samuel Hansen called up the Winton Professor for the Public Understanding of Risk, as well as Senior Scientist in the MRC Biostatistics Unit, David Spiegelhalter. They discussed the true meaning of risk, the importance of the Bayesian Method, how to get a lot of citations, and even a bit about the bookies.

{ 1 comment }

A couple preprints

by John on January 20, 2011

Here are a couple new preprints.

Block-adaptive randomization.
A proposed method for limiting the size of runs in a response-adaptive clinical trial.

Skeptical and optimistic robust priors for clinical trials.
Joint work with Jairo Fúquene and Luis Pericchi from University of Puerto Rico.

{ 1 comment }