From the category archives:

Statistics

Technical report for robust prior example

by John on September 2, 2010

On Monday I wrote a post giving an illustration of robust priors. I’ve written a technical report that gives the proofs behind the statements in that post.

Asymptotic results for Normal-Cauchy model

{ 0 comments }

Robust prior illustration

by John on August 30, 2010

In Bayesian statistics, prior distributions “get out of the way” as data accumulate. But some prior distributions get out of the way faster than others. The influence of robust priors decays faster when the data conflict with the prior.

Consider a single sample y from a Normal(θ, σ2) distribution. We consider two priors on θ, a conjugate Normal(0, τ2) prior and a robust Cauchy(0, 1) prior. We will look at the posterior mean of θ under each prior as y increases.

With the normal prior, the posterior mean value of θ is

y τ2/( τ2 + σ2).

That is, the posterior mean is always a fixed fraction of y. If the prior variance τ2 is large relative to the variance σ2 of the sampling distribution, the fraction will be closer to 1, but it will always be less than 1.

With the Cauchy prior, the posterior mean is

y – O(1/y)

as y increases. (See these notes if you’re unfamiliar with “big-O” notation.) So the larger y becomes, the closer the posterior mean of θ comes to the value of the data y.

In the graph, the green line on bottom plots the posterior mean of θ with a normal prior as a function of y . The blue line on top is y. The red line in the middle is posterior mean of θ with a Cauchy prior. Note how the red line starts out close to the green line. That is, for small values of y, the posterior mean is nearly the same under the normal and Cauchy priors. But as y increases, red line approaches the blue line. The Cauchy prior has less influence as y increases.

In this graph σ = τ = 1. The results would be qualitatively the same for any values of σ and θ. If τ were larger relative to σ, the bottom line would be steeper, but the middle curve would still asymptotically approach the top line.

You can also show that with multiple samples, the posterior mean of θ converges more quickly to the empirical mean of the data when using a Cauchy prior than when using a normal prior if the mean is sufficiently large.

Update: See Asymptotic results for Normal-Cauchy model for proofs of the claims in this post.

Related posts:

Robust priors

{ 1 comment }

What distribution does my data have?

by John on August 11, 2010

“Which distribution describes my data?” Variations on that question pop up regularly on various online forums. Sometimes the person asking the question is looking for a goodness of fit test but doesn’t know the jargon “goodness of fit.” But more often they have something else in mind. They’re thinking of some list of familiar, named distribution families — normal, gamma, Poisson, etc. — and want to know which distribution from this list best fits their data. So the real question is something like the following:

Which distribution from the well-known families of probability distributions fits my data best?

Statistics classes can give the impression that there is a short list of probability distribution families, say the list in the index of text book for that class, and that something from one of those families will always fit any data set. This impression starts to seem absurd when stated explicitly. It raises two questions.

  1. What exactly is the list of well-known distributions?
  2. Why should a distribution from this list fit your data?

As for the first question, there is some consensus as to what the well-known distributions are. The distribution families in this diagram would make a good start. But the question of which distributions are “well known” is a sociological question, not a mathematical one. There’s nothing intrinsic to a distribution that makes it well-known. For example, most statisticians would consider the Kumaraswamy distribution obscure and the beta distribution well-known, even though the two are analytically similar.

You could argue that the canonical set of distributions is somewhat natural by a chain of relations. The normal distribution is certainly natural due to the central limit theorem. The chi-squared distribution is natural because the square of a normal random variable has a chi-squared distribution. The F distribution is related to the ratio of chi-squared variables, so perhaps it ought to be included. And so on and so forth. But each link in the chain is a little weaker than the previous. Also, why this chain of relationships and not some other?

Alternatively, you could argue that the distributions that made the canon are there because they have been found useful in practice. And so they have.  But had people been interested in different problems, a somewhat different set of distributions would have been found useful.

Now on to the second question: Why should a famous distribution fit a particular data set?

Suppose a police artist asked a witness which U. S. president a criminal most closely resembled. The witness might respond

Well, she didn’t look much like any of them, but if I have to pick one, I’d pick John Adams.

The U. S. presidents form a convenient set of faces. You can find posters of their faces in many classrooms. The U. S. presidents are historically significant, but a police artist would do better to pick a different set of faces as a first pass in making a sketch.

I’m not saying it is unreasonable to want to fit a famous distribution to your data. Given two distributions that fit the data equally well, go with the more famous distribution. This is a sort of celebrity version of Occam’s razor. It’s convenient to use distributions that other people recognize. Famous distributions often have nice mathematical properties and widely available software implementations. But the list of famous distributions can form a Procrustean bed that we force our data to fit.

The extreme of Procrustean statistics is a list of well-known distributions with only one item: the normal distribution. Researchers often apply a normal distribution where it doesn’t fit at all. More dangerously, experienced statisticians can assume a normal distribution when the lack of fit isn’t obvious. If you implicitly assume a normal distribution, then any data point that doesn’t fit the distribution is an outlier. Throw out the outliers and the normal distribution fits well! Nassim Taleb calls the normal distribution the “Great Intellectual Fraud” in his book The Black Swan because people so often assume the distribution fits when it does not.

{ 23 comments }

Cosines and correlation

by John on June 17, 2010

Preview

This post will explain a connection between probability and geometry. Standard deviations for independent random variables add according to the Pythagorean theorem. Standard deviations for correlated random variables add like the law of cosines. This is because correlation is a cosine. Update: Here is a Spanish translation of this post.

Independent variables

First, let’s start with two independent random variables X and Y. Then the standard deviations of X and Y add like sides of a right triangle.

diagram

In the diagram above, “sd” stands for standard deviation, the square root of variance. The diagram is correct because the formula

Var(X+Y) = Var(X) + Var(Y)

is analogous to the Pythagorean theorem

c2 = a2 + b2.

Dependent variables

Next we drop the assumption of independence. If X and Y are correlated, the variance formula is analogous to the law of cosines.

diagram

The generalization of the previous variance formula to allow for dependent variables is

Var(X+Y) = Var(X) + Var(Y) + 2 Cov(X, Y).

Here Cov(X,Y) is the covariance of X and Y. The analogous law of cosines is

c2 = a2 + b2 – 2 a b cos(θ).

If we let a, b, and c be the standard deviations of X, Y, and X+Y respectively, then cos(θ) = -ρ where ρ is the correlation between X and Y defined by

ρ(X, Y) = Cov(X, Y) / sd(X) sd(Y).

When θ is π/2 (i.e. 90°) the random variables are independent. When θ is larger, the variables are positively correlated. When θ is smaller, the variables are negatively correlated. Said another way, as θ increases from 0 to π (i.e. 180°), the correlation increases from -1 to 1.

The analogy above is a little awkward, however, because of the minus sign. Let’s rephrase it in terms of the supplementary angle φ = π – θ. Slide the line representing the standard deviation of Y over to the left end of the horizontal line representing the standard deviation of X.

diagram

Now cos(φ) = ρ = correlation(X, Y).

When φ is small, the two line segments are pointing in nearly the same direction and the random variables are highly positively correlated. If φ is large, near π, the two line segments are pointing in nearly opposite directions and the random variables are highly negatively correlated.

Connection explained

Now let’s see the source of the connection between correlation and the law of cosines. Suppose X and Y have mean 0. Think of X and Y as members of an inner product space where the inner product <X, Y> is E(XY). Then

<X+Y, X+Y> = < X, X> + < Y, Y> + 2<X, Y >.

In an inner product space,

<X, Y > = || X || || Y || cos φ

where the norm || X || of a vector is the square root of the vector’s inner product with itself. The above equation defines the angle φ between two vectors. You could justify this definition by seeing that it agrees with ordinary plane geometry in the plane containing the three vectors X, Y, and X+Y.

Related links

Math and statistics links

{ 8 comments }

Generating Poisson random values

by John on June 14, 2010

The most direct way of generating random samples from a Poisson distribution is efficient for some parameters and inefficient for others. Wikipedia attributes the following algorithm to Donald Knuth:

    init:
         Let L ← exp(−λ), k ← 0 and p ← 1.
    do:
         k ← k + 1.
         Generate uniform random number u in [0,1] and let p ← p × u.
    while p > L.
    return k − 1.

Here’s the idea behind the algorithm. The time between arrivals in a Poisson process is exponentially distributed. Count how many arrivals there are in an interval by simulating the times between arrivals and adding them up until the time sum spills over the interval.

The problem with this approach is that the expected run time of the loop is proportional to the parameter λ. This algorithm is commonly used despite its terrible performance for large arguments.

Below is an algorithm that has expected run time independent of the argument λ. The algorithm is fairly simple, though it takes a moderate amount of theoretical justification to explain. It goes back to 1979 and so may not the most efficient algorithm available. It is definitely not the most efficient algorithm if you need to generate a large number of samples using the same parameter λ. The paper describing the algorithm recommends using Knuth’s algorithm for values of λ less than 30 and this algorithm for larger values of λ.

c = 0.767 - 3.36/lambda
beta = PI/sqrt(3.0*lambda)
alpha = beta*lambda
k = log(c) - lambda - log(beta)

forever
{
	u = random()
	x = (alpha - log((1.0 - u)/u))/beta
	n = floor(x + 0.5)
	if (n < 0)
		continue
	v = random()
	y = alpha - beta*x
	lhs = y + log(v/(1.0 + exp(y))^2)
	rhs = k + n*log(lambda) - log(n!)
	if (lhs <= rhs)
		return n
}

This is an accept-reject method based on a normal approximation. The performance improves as λ increases because the quality of the normal approximation improves. (Actually, it uses a logistic approximation to a normal approximation!) Theoretically it could be caught in an infinite loop, as with all accept-reject methods. However, the expected number of iterations is bounded. The continue statement means that if n is negative, the algorithm goes back to the top of the loop. The random() function is assumed to return a uniform random variable between 0 and 1.

A possible obstacle to turning the pseudocode above into actual code is computing log(n!). Naively computing n! and taking the log of the result will overflow for moderately large values of n. If you have a function available that computes the log of the gamma function, such as lgamma in C’s math.h, then evaluate that function at n+1. Otherwise, see How to compute log factorial.

The algorithm above is “method PA” from “The Computer Generation of Poisson Random Variables” by A. C. Atkinson, Journal of the Royal Statistical Society Series C (Applied Statistics) Vol. 28, No. 1. (1979), pages 29-35. Method PA is on page 32.

Related posts:

C# random number generation code
How to test a random number generator

{ 3 comments }

The logistic distribution looks very much like a normal distribution. Here’s a plot of the density for a logistic distribution.

This suggests we could approximate a logistic distribution by a normal distribution. (Or the other way around: sometimes it would be handy to approximate a normal distribution by a logistic. Always invert.)

But which normal distribution approximates a logistic distribution? That is, how should we pick the variance of the normal distribution?

The logistic distribution is most easily described by its distribution function (CDF):

F(x) = exp(x) / (1 + exp(x)).

To find the density (PDF), just differentiate the expression above. You can add a location and scale parameter, but we’ll keep it simple and assume the location (mean) is 0 and the scale is 1.  Since the logistic distribution is symmetric about 0, we set the mean of the normal to 0 so it is also symmetric about 0. But how should we pick the scale (standard deviation) σ for the normal?

The most natural thing to do, or at least the easiest thing to do, is to match moments: pick the variance of the normal so that both distributions have the same variance. This says we should use a normal with variance σ2 = π2/3 or σ = π/√3 . How well does that work? The following graph gives the answer. The logistic density is given by the solid blue line and the normal density is given by the dashed orange line.

Not bad, but we could do better. We could search for the value of σ that minimizes the difference between the two densities. The minimum occurs around σ = 1.6. Here is the revised graph using that value.

The maximum difference is almost three times smaller when we use σ = 1.6 rather than σ = π/√ 3 ≈ 1.8.

What if we want to minimize the difference between the distribution (CDF) functions rather than the density (PDF) functions? It turns out we end up at about the same spot: set σ to approximately 1.6. The two optimization problems don’t have exactly the same solution, but the two solutions are close.

The maximum difference between the distribution function of a logistic and the distribution of a normal with σ = 1.6 is about 0.017. If we used moment matching and set σ = π/√3, the maximum difference would be about 0.022. So moment matching does a better job of approximating the CDFs than approximating the PDFs. But we don’t need to decide between the two criteria since setting σ = 1.6 approximately minimizes both measures of the approximation.

Related posts:

Cosine approximation to the normal density
Always invert

{ 7 comments }

Statistical autopsy

by John on May 13, 2010

From R. A. Fisher, 1938:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

{ 3 comments }

Here’s a simple approximation to the normal distribution I just ran across. The density function is

f(x) = (1 + cos(x))/2π

over the interval (-π, π). The plot below graphs this density with a solid blue line. For comparison, the density of a normal distribution with the same variance is plotted with a dashed orange line.

PDF plots

The approximation is good enough to use for teaching. Students may benefit from doing an exercise twice, once with this approximation and then again with the normal distribution. Having an approximation they can integrate in closed form may help take some of the mystery out of the normal distribution.

The approximation may have practical uses. The agreement between the PDFs isn’t great. However, the agreement between the CDFs (which is more important) is surprisingly good. The maximum difference between the two CDFs is only 0.018. (The differences between the PDFs oscillate, and so their integrals, the CDFs, are closer together.)

I ran across this approximation here. It goes back to the 1961 paper “A cosine approximation to the normal distribution” by D. H. Raab and E. H. Green, Psychometrika, Volume 26, pages 447-450.

Update 1: See the paper referenced in the first comment. It gives a much more accurate approximation using a logistic function. The cosine approximation is a little simpler and may be better for teaching. However, the logistic approximation has infinite support. That could be an advantage since students might be distracted by the finite support of the cosine approximation.

The logistic approximation for the standard normal CDF is

F(x) = 1/(1 + exp(-0.07056 x3 – 1.5976 x))

and has a maximum error of 0.00014 at x = ± 3.16.

Update 2:

How might you use this approximation the other way around, approximating a cosine by a normal density? See Always invert.

Related posts:

Rolling dice for normal samples
Sums of uniform random variables

{ 7 comments }

Did automobile accidents decrease?

by John on April 22, 2010

Here’s a homework problem from a class I taught:

… In past years, the average number of accidents per year was 15, and this year it was 10. Is it justified to claim that the accident rate has dropped?

The naive answer is of course the rate has dropped. Ten is less than fifteen. This reminds me of a joke attributed to Abraham Lincoln:

Q: If you call a tail a leg, how many legs does a dog have?

A: Four. Calling a tail a leg doesn’t make it one.

But the homework problem isn’t asking whether ten is less than fifteen. Part of the purpose of the exercise is to state the problem well. The real question is whether there is good evidence that the fundamental causes of automobile accidents has changed for the better or whether there is a fair chance that a random fluctuation caused the decrease. It turns out the latter is the case given the model (Poisson) that the question suggests using.

Think of this example next time you hear politicians say that some measure improved during their administration: economic growth, employment, crime rates, etc. The basic question is whether in fact the measure changed. The next question is whether the change was more likely a coincidence or a genuine improvement. And if there was a real improvement, ask whether the politician deserves any credit.

(The homework exercise came from Statistical Inference, problem 8.2.)

{ 7 comments }

Routine statistics

by John on April 21, 2010

From David Cox:

There are no routine statistical questions, only questionable statistical routines.

{ 0 comments }

Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?

To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. But your data suggest perfect pitch in children is at least rare. But how rare?

The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.

Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.

If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.

What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-p)n = 0.05 for p. Taking logs of both sides, n log(1-p) = log(0.05) ≈ -3. Since log(1-p) is approximately -p for small values of p, we have p ≈ 3/n.

The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) ≈ 0.95.

Update: Italian translation of this post.

Related posts:

What is a confidence interval?
Probability mistake can give a good approximation
Four reasons to use Bayesian inference

{ 14 comments }

Does gaining weight make you taller?

by John on March 12, 2010

In his autobiography, The Pleasures of Statistics, Frederick Mosteller gives an amusing example of why observational studies are no substitute for doing experiments.

We are all familiar with the idea that we can estimate height in male adults from their weight. … But not one of us believes that adding 20 pounds by eating and minimizing exercise will add an inch to our height.

The problem is not simply that the direction of causality backward, it’s that we cannot use a static description to predict what will happen if we change something.

Although regression situations may give one the illusion of finding out what would happen if we changed something, in the absence of an experiment they offer merely offer guesses.

He summarizes his point by quoting George Box:

To find out what happens to a system when you interfere with it, you have to interfere with it (and not just passively observe it).

Remember this next time you hear claims such as every dollar spent on X saves so many dollars spent on Y. Or every minute spent exercising increases your life expectancy by so many minutes. Or every time you do some activity you increase or decrease your risk of cancer by so much. First of all, these kinds of statements are linear extrapolations on situations that are not linear. Second, they may be observations that do not describe what will happen when you change something. They may be no more true than the idea that gaining weight makes you taller.

Here’s an example of how observation and intervention differ. Lottery winners often go bankrupt within a couple years of receiving their prize. If you suddenly make someone a millionaire, they’re not a typical millionaire.

Related posts:

Numerator-only data
Randomized trials of parachute use

{ 4 comments }

Numerator-only data

by John on March 11, 2010

I learned a useful new phrase today: numerator-only data. This is data without anything to compare it to, no denominator. I ran across the term in Frederick Mosteller’s autobiography. He illustrates the problem with the following old joke.

“Why do the white horses eat more than the black horses?”
“Don’t know. Why?”
“Because we have ten times as many white horses and black horses.”

Numerator-only data is data that leaves you asking “compared to what?” If I tell you the NASDAQ stock index closed at 2368 today, is that good or bad? The number by itself means nothing. Is that up or down compared to last week? Last year? If I tell you, for example, that the record high value was 5047, that gives you a denominator to compare it to.

{ 6 comments }

Does lightning prefer metal or wood?

by John on March 5, 2010

The video below features a demonstration that lightning is as likely to strike wood as metal.

I want to focus on one line from the video. After showing simulated lightning strikes that hit a wooden rod five times and a copper rod five times, the narrator says

It’s five all, proof that metal does not attract lightning.

No, such an experiment would prove no such thing. I imagine the researchers conducted a much larger experiment and selected a representative sample. And I’m willing to accept their conclusion that metal does not attract lightning. But I would not accept such a conclusion from an experiment with 10 samples. What the experiment proves is that, under their experimental conditions, lightning will sometimes strike wood even a metal rod is nearby.

I have two complementary criticisms of this made-for-video science.

  1. The results could easily happen if their conclusion were not true.
  2. The results could easily not have happened if there conclusion were true.

Suppose in reality, lightning will not always strike the metal rod, but will prefer the metal. Suppose in the long run, lightning will strike the metal rod 60% of the time. It would not be unusual in that case to do an experiment with 10 strikes and find that half or more of the strikes hit wood.

Now suppose the researchers are exactly correct. In the long run, lightning has no preference for one rod or the other. What would viewers have thought if they showed a clip of 10 strikes, of which 6 hit metal and 4 hit wood? Many would have howled in protest. If lightning really had no preference for metal, the result should have been an even split, right? This is an example of the Law of Small Numbers. People underestimate the variability of small samples.

If the probability of lightning striking each rod is 50%, then in a sequence of experiments each containing 10 strikes, most will not have an exact 5-5 split. If you flip 10 fair coins, the most likely outcome is a 5-5 split, but this will happen only about 1/4 of the time. It’s more likely that you’ll get near a 5-5 split, sometimes with more heads and sometimes with more tails.

The exact 5-5 split in the video is good showmanship, but it’s misleading science.

Related posts:

Law of small numbers
Example of the law of small numbers
Law of medium numbers

{ 2 comments }

p-values are inconsistent

by John on March 3, 2010

If there’s evidence that an animal is a bear, you’d think there’s even more evidence that it’s a mammal. It turns out that p-values fail this common sense criterion as a measure of evidence.

I just ran across a paper of Mark Schervish1 that contains a criticism of p-values I had not seen before. p-values are commonly used as measures of evidence despite the protests of many statisticians. It seems reasonable that a measure of evidence would have the following property. If a hypothesis H implies another hypothesis H’, then evidence in favor of H’ should be at least as great as evidence in favor of H.

Here’s one of the examples from Schervish’s paper. Suppose data come from a normal distribution with variance 1 and unknown mean μ. Let H be the hypothesis that μ is contained in the interval (-0.5, 0.5). Let H’ be the hypothesis that μ is contained in the interval (-0.82, 0.52). Then suppose you observe x = 2.18. The p-value for H is 0.0502 and the p-value for H’ is 0.0498. This says there is more evidence to support the hypothesis H that μ is in the smaller interval than there is to support the hypothesis H’ that μ is in the larger interval. If we adopt α = 0.05 as the cutoff for significance, we would reject the hypothesis that -0.82 < μ < 0.52 but accept the hypothesis that -0.5 < μ < 0.5. We’re willing to accept that we’ve found a bear, but doubtful that we’ve found a mammal.

1 Mark J. Schervish. “P values: What They Are and What They Are Not.” The American Statistician, August 1996, Vol. 50, No. 3.

Update: I added the details of the p-value calculation here.

Related posts:

How loud is the evidence
The cult of significance testing
Most published research results are false

{ 20 comments }