Like Laplace, only more so

Posted on 17 February 2011 by John

The Laplace distribution is pointy in the middle and fat in the tails relative to the normal distribution.This post is about a probability distribution that is more pointy in the middle and fatter in the tails.

Here are pictures of the normal and Laplace (a.k.a. double exponential) distributions.

Normal:

Laplace:

The normal density is proportional to exp(− x²/2) and the Laplace distribution is proportional to exp(-|x|). Near the origin, the normal density looks like 1 − x²/2 and the Laplace density looks like 1 − |x|. And as x gets large, the normal density goes to zero much faster than the Laplace.

Now let’s look at the distribution with density

f(x) = log(1 + 1/x²)

I don’t know a name for this. I asked on Cross Validated whether there was a name for this distribution and no knew of one. The density is related to the bounds on a density presented in this paper. Here’s a plot.

The density is unbounded near the origin, blowing up like −2 log( |x| ) as x approaches 0, and so is more pointed than the Laplace density. As x becomes large, log(1 + x⁻²) is asymptotically x⁻² so the distribution has the same tail behavior as a Cauchy distribution, much heavier tailed than the Laplace density.

Here’s a plot of this new density and the Laplace density together to make the contrast more clear.

As William Huber pointed out in his answer on Cross Validated, this density has a closed-form CDF:

F(x) = 1/2 + (arctan(x) − x log( sin( arctan(x) ) ))/π

The paper mentioned above used a similar density as a Bayesian prior distribution in situations where many observations were expected to be small, though large values were expected as well.

The end of hard-edged science?

Posted on 14 February 2011 by John

Bradley Efron says that science is moving away from things like predicting sunrise times and toward predicting things like the weather. The trend is away from studying precisely predictable systems, what Efron calls “hard-edged science,” and toward studying systems “where predictability is tempered by a heavy dose of randomness.”

Hard-edged science still dominates public perceptions, but the attention of modern scientists has swung heavily toward rainfall-like subjects, the kind where random behavior plays a major role. … Deterministic Newtonian science is majestic, and the basis of modern science too, but a few hundred years of it pretty much exhausted nature’s storehouse of precisely predictable events. Subjects like biology, medicine, and economics require a more flexible scientific world view, the kind we statisticians are trained to understand.

Certainly there is increased interest in systems containing “a heavy dose of randomness” but can we really say that we have “pretty much exhausted nature’s storehouse of precisely predictable effects”?

Source: Modern Science and the Bayesian-Frequentist Controversy

Interview with David Spiegelhalter

Posted on 2 February 2011 by John

Samuel Hansen interviews David Spiegelhalter on his mathematical podcast Strongly Connected Components. [Link no longer available.] From the show notes:

On today’s episode of Strongly Connected Components Samuel Hansen called up the Winton Professor for the Public Understanding of Risk, as well as Senior Scientist in the MRC Biostatistics Unit, David Spiegelhalter. They discussed the true meaning of risk, the importance of the Bayesian Method, how to get a lot of citations, and even a bit about the bookies.

A couple preprints

Posted on 20 January 2011 by John

Here are a couple new preprints.

Block-adaptive randomization.
A proposed method for limiting the size of runs in a response-adaptive clinical trial.

Skeptical and optimistic robust priors for clinical trials.
Joint work with Jairo Fúquene and Luis Pericchi from University of Puerto Rico.

Occam’s razor and Bayes’ theorem

Posted on 12 January 2011 by John

Occam’s razor says that if two models fit equally well, the simpler model is likely to be a better description of reality. Why should that be?

A paper by Jim Berger suggests a Bayesian justification of Occam’s razor: simpler hypotheses have higher posterior probabilities when they fit well.

A simple model makes sharper predictions than a more complex model. For example, consider fitting a linear model and a cubic model. The cubic model is more general and fits more data. The linear model is more restrictive and hence easier to falsify. But when the linear and cubic models both fit, Bayes’ theorem “rewards” the linear model for making a bolder prediction. See Berger’s paper for a details and examples.

From the conclusion of the paper:

Ockham’s razor, far from being merely an ad hoc principle, can under many practical situations in science be justified as a consequence of Bayesian inference. Bayesian analysis can shed new light on what the notion of “simplest” hypothesis consistent with the data actually means.

Bayesian methods at the end

Posted on 16 December 2010 by John

I was looking at the preface of an old statistic book and read this:

The Bayesian techniques occur at the end of each chapter; therefore they can be omitted if time does not permit their inclusion.

This approach is typical. Many textbooks present frequentist statistics with a little Bayesian statistics at the end of each section or at the end of the book.

There are a couple ways to look at that. One is simply that Bayesian methods are optional. They must not be that important or they’d get more space. The author even recommends dropping them if pressed for time.

Another way to look at this is that Bayesian statistics must be simpler than frequentist statistics since the Bayesian approach to each task requires fewer pages.

Probability that a number is prime

Posted on 6 October 2010 by John

The fastest ways to test whether a number is prime have some small probability of being wrong. Said another way, it’s easier to tell whether a number is “probably” prime than to tell with certainty that it’s prime. This post looks briefly at algorithms for primality testing then focuses on what exactly it means to say a number is “probably prime.”

The simplest probabilistic method for testing primes is based on Fermat’s Little Theorem. This theorem says that if p is prime then a^p-1 has a remainder of 1 when divided by p. If the remainder when dividing a^n-1 by n is anything other than 1, we know we do not have a prime. If the remainder is 1, we don’t necessarily have a prime, but the chances that the number is prime have gone up. (At this point, some statisticians are cringing at my wording. I’ll get to these objections in a minute.)

If I repeat the test for several different values of a, the probability that my number is prime keeps going up. In principle I could keep testing with new values of a until I’m sufficiently satisfied that my number is prime. That’s not quite true, however, since there are some numbers known as Carmichael numbers that will slip through the test for every value of a. The smallest Carmichael number is 561. But these numbers are sufficiently rare that you are extraordinarily unlikely to run into one by accident when testing large numbers for primality.

Primality testing involves some very interesting mathematics. I may blog about that sometime, but in this post I want to look at what it means to say a number is “probably prime.” This issue illustrates the differences between two schools of statistical interpretation.

A strict frequentist statistician would argue as follows. Suppose you have a large number N, and you run it through a probabilistic primality test. If it fails, the probability of it being prime is 0 because the number is certainly composite. If it passes, the probability that it is prime is either 0 or 1, but we just don’t know which. If it is prime, the probability that it is prime is 1. If it’s not prime, the probability is 0. It’s not very helpful to ask for the probability of a specific number being prime because there’s nothing random going on.

However, just before you write off the statistician as being worthless, he gives a further explanation. He explains that you can indeed talk about the probability of a number being prime if you rephrase your question. Suppose you pick a number at random from some range. Then you can speak of the probability that you chose a prime number. The number’s properties are not random but your choice is random. And further you can ask about the conditional probability that your selection will be prime given that it passes a primality test.

Perhaps you don’t find the argument above satisfying. You still want to say “I’ve got a number here and it passed the primality tests. Surely I can say it’s probably prime.” A Bayesian statistician may agree with you. There are varieties of Bayesians, but one school of thought says that probabilities represent uncertainty. Randomness is one cause of uncertainty, but not the only one or even the most important one. E. T. Jaynes was a major proponent of this view. He argued that much of what people say about randomness is nonsense and that we should focus on information instead. As he once said “probabilities do not describe reality, only our information about reality.”

Of course a particular number is either prime or not, but our knowledge of whether it is prime is uncertain. So the probability refers to our mental state, not to the number. (In a sense this isn’t too different from the frequentist saying the probability applies to our process of choosing numbers, not the number we drew.)

One criticism of Bayesian statistics is that Bayesian analysis begins with a prior distribution. So our Bayesian statistician would ask “What is your probability that the number will be prime before you select it?” A sensible answer would be to use the Prime Number Theorem to come up with an initial guess. For example, if you’re picking a number with 200 digits, the theorem says that the proportion of 200-digit numbers that are prime is roughly 1/(200 log 10) or about 1/460.

But what if someone chooses a prior that isn’t as clever? They’ll get a different probability that a particular number is prime! But that’s to be expected. If a probability represents uncertainty, some people will have more uncertainty than others because people know different things. What if someone secretly created your 200-digit number by multiplying two 100-digit numbers? His prior probability that the product is prime will be zero because he has no uncertainty.

Leaving aside primality for a moment, suppose I flip a fair coin. Before I flip the coin, a frequentist and a Bayesian agree that the probability that it will come up heads is 1/2. Now suppose the coin has fallen to the floor and only I can see the outcome. My probability of heads is now either 0 or 1. The Bayesian would say that his probability is still 1/2 because his information hasn’t changed. A strict frequentist would say it is now improper to assign a probability to the outcome. The coin’s status has changed from random to fixed but unknown, and one may not use probabilities to quantify uncertain knowledge.

Going back to primality testing, recall we first said 1/460 was a good prior distribution on the probability of a 200-digit number being prime. But we could improve this by saying the prior probability is 0 for even numbers and 1/230 for odd numbers since we can easily rule out half of the 200 digit numbers. Our prior probability has become more refined because it now incorporates more knowledge, namely the fact that big even numbers are not prime. But why stop there?

We could rule out multiples of 3 as well and put all the positive probability on numbers not divisible by 2 or 3. We could continue, and at each step our prior distribution would change. Some people are disturbed by the thought of the prior changing like this, but it’s perfectly consonant with the idea of a prior distribution expressing prior knowledge. The more work you do up front, the less uncertainty you have.

We could even continue this process, in theory, until we know the primality status of every 200-digit number and there’s no uncertainty left. We’d be done before we got around to conducting our probabilistic primality test. In practice we’d never finish if we tried to test every 200 digit number for primality. There would be nowhere to store the results even if we could compute them. There are fewer than 10^80 atoms in the universe, so if we stored one result per atom, we could only store a table of primes up to about 80 digits.

Setting aside statistical fine points, suppose we are now quite certain that our number is prime. Say we report that the probability of our number not being prime is less than 1 chance in 10⁵⁰. Is that really believable? For such a tiny probability, isn’t it more likely that there was a bug in our computer program that produced the probability? Or if we assume the software is perfect, we have to admit that computer hardware has errors. These errors are rare, but are they as rare as one chance in 10⁵⁰? Or for that matter, what is the probability that radiation may have caused a bit to flip somewhere in the hardware as we were computing?

Suppose we want to know the primality of a number because we’re going to use the number in an encryption algorithm. Then our ultimate concern is whether the encryption keeps our data secure. If the probability of the primality algorithm failing is orders of magnitude less than the probability of other security threats, we should focus our attention elsewhere.

Update: See How likely is it that a probable prime is prime?.

Technical report for robust prior example

Posted on 2 September 2010 by John

On Monday I wrote a post giving an illustration of robust priors. I’ve written a technical report that gives the proofs behind the statements in that post.

Asymptotic results for Normal-Cauchy model

Robust prior illustration

Posted on 30 August 2010 by John

In Bayesian statistics, prior distributions “get out of the way” as data accumulate. But some prior distributions get out of the way faster than others. The influence of robust priors decays faster when the data conflict with the prior.

Consider a single sample y from a Normal(θ, σ²) distribution. We consider two priors on θ, a conjugate Normal(0, τ²) prior and a robust Cauchy(0, 1) prior. We will look at the posterior mean of θ under each prior as y increases.

With the normal prior, the posterior mean value of θ is

y τ²/( τ² + σ²).

That is, the posterior mean is always a fixed fraction of y. If the prior variance τ² is large relative to the variance σ² of the sampling distribution, the fraction will be closer to 1, but it will always be less than 1.

With the Cauchy prior, the posterior mean is

y − O(1/y)

as y increases. (See these notes if you’re unfamiliar with “big-O” notation.) So the larger y becomes, the closer the posterior mean of θ comes to the value of the data y.

In the graph, the green line on bottom plots the posterior mean of θ with a normal prior as a function of y . The blue line on top is y. The red line in the middle is posterior mean of θ with a Cauchy prior. Note how the red line starts out close to the green line. That is, for small values of y, the posterior mean is nearly the same under the normal and Cauchy priors. But as y increases, red line approaches the blue line. The Cauchy prior has less influence as y increases.

In this graph σ = τ = 1. The results would be qualitatively the same for any values of σ and θ. If τ were larger relative to σ, the bottom line would be steeper, but the middle curve would still asymptotically approach the top line.

You can also show that with multiple samples, the posterior mean of θ converges more quickly to the empirical mean of the data when using a Cauchy prior than when using a normal prior if the mean is sufficiently large.

Update: See Asymptotic results for Normal-Cauchy model for proofs of the claims in this post.

Related post: Robust priors

Estimating the chances of something that hasn’t happened yet

Posted on 30 March 2010 by John

Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?

To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. Your data suggest perfect pitch in children is at least rare. But how rare?

The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.

Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.

If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.

What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1 − p)ⁿ = 0.05 for p. Taking logs of both sides, n log(1 − p) = log(0.05) ≈ -3. Since log(1 − p) is approximately −p for small values of p, we have p ≈ 3/n.

The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 − exp(−3) as N gets large, and 1 − exp(−3) ≈ 0.95.

Bayesian

Like Laplace, only more so

Related posts

The end of hard-edged science?

More science posts

Interview with David Spiegelhalter

A couple preprints

Occam’s razor and Bayes’ theorem

Related links

Bayesian methods at the end

Related posts

Probability that a number is prime

Related posts

Technical report for robust prior example

Robust prior illustration

Estimating the chances of something that hasn’t happened yet

More probability articles