From the category archives:

Statistics

Elements of Statistical Learning

by John on October 14, 2009

The authors of the classic The Elements of Statistical Learning have made their book available for download as a PDF.

Elements of Statistical Learning

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Thanks to Gregor Gorjanc for the tip.

{ 3 comments }

Achievement is not normal

by John on September 29, 2009

Angela Duckworth gave a 90-second talk entitled Why Achievement Isn’t Normal.

She’s using the term “normal” in the sense of the normal (Gaussian) distribution, the bell curve. With normally distributed attributes, such as height, most people are near the middle and very few are far from the middle. Also, the distribution is symmetric: as many people are likely to be above the middle as below.

Achievement is not like that in many fields. The highest achievers achieve far more than average. The best programmers may be 100 times more productive than average programmers. The wealthiest people have orders of magnitude more wealth than average. Best selling authors far outsell average authors.

Angela Duckworth says achievement is not normal, it’s log-normal. The log-normal distribution is skewed to the right. It has a long tail, meaning that values far from the mean are fairly common. The idea of using a long-tailed distribution makes sense, but I don’t understand the justification for the log-normal distribution in particular given in the video. This is not to disparage the speaker. No one can give a detailed derivation of a statistical distribution in 90 seconds. I’ll give a plausibility argument below. If you’re not interested in the math, just scroll down to the graph at the bottom.

The factors that contribute to achievement are often multiplicative. That is, advantages multiply rather than add. If your first book is a success, more people will give your second book a chance. Your readership doesn’t simply add, as if each book were written by a different person. Instead, your audience compounds. Web sites with more inbound links get a higher search engine rank. More people find these sites because of their ranking, and so more people link to them, and the ranking goes up. Skills like communication and organization don’t just contribute additively as they would on a report card; they are multipliers that amplify your effectiveness in other areas.

The log-normal distribution has two parameters: μ and σ. These look like the mean and standard deviation parameters, but they are not the mean and and standard deviation of the log-normal. If X is a log-normal(μ , σ) random variable, then log(X) has a normal(μ, σ) distribution. The parameters μ and σ are not the mean and standard deviation of X but of log(X).

The product of two log-normal distributions is log-normal because the sum of two normal distributions is normal. So if the contributions to achievement are multiplicative, log-normal distributions will be convenient to model achievement.

I said earlier that log-normal distributions are skewed. I’ve got something of a circular argument if I start with the assumption that the factors that contribute to achievement are skewed and then conclude that achievement is skewed. But log-normal distributions have varying degrees of skewness. When σ is small, the distribution is approximately normal. So you could start with individual factors that have a nearly normal distribution, modeled by a log-normal distribution. Then you can show that as you multiply these together, you get a distribution more skewed than it’s inputs.

Suppose you have n random variables that have a log-normal(1, σ) distribution. Their product will have a log-normal(n, √n σ) distribution. As n increases, the distribution of the product becomes more skewed. Here is an example. The following graph shows the density of a log-normal(1, 0.2) distribution.

plot of log-normal(1, 0.2) density

Here is the distribution of the product of nine independent copies of the above distribution, a log-normal(9, 0.6) distribution.

plot of log-normal(9, 0.6) density

So even though the original distribution is symmetric and concentrated near the middle, the product of nine independent copies has a long tail to the right.

Related posts:

Small advantages show up in the extremes
Variation in male and female Olympic performance: Part 1, Part 2
Evaluate people at their best or at their worst?

{ 5 comments }

The negative binomial distribution is interesting because it illustrates a common progression of statistical thinking. My aim here is to tell a story, not to give details; the details are available here. The following gives a progression of three perspectives.

First view: Counting

The origin of the negative binomial is very concrete. It is unfortunate that the name makes the distribution seem more abstract than it is. (What could possibly be negative about a binomial distribution? Sounds like abstract nonsense.)

Suppose you have decided to practice basketball free throws. You’ve decided to practice until you have made 20 free throws. If your probability of making a single free throw is p, how many shots will you have to attempt before you make your goal of 20 successes? Obviously you’ll need at least 20 attempts, but you might need a lot more. What is the expected number of attempts you would need? What’s the probability that you’ll need more than 50 attempts? These questions could be answered by using a negative binomial distribution. A negative binomial probability distribution with parameters r and p gives the probabilities of various numbers of failures before the rth success when each attempt has probability of success p.

Second view: Cheap generalization

After writing down the probability mass function for the negative binomial distribution as described above, somebody noticed that the number r didn’t necessarily have to be an integer. The distribution was motivated by integer values of r, counting the number of failures before the rth success, but the resulting formula makes sense even when r is not an integer. It doesn’t make sense to wait for 2.87 successes; you can’t interpret the formula as counting events unless r is an integer, but the formula is still mathematically valid.

The probability mass function involves a binomial coefficient. These coefficients were first developed for integer arguments but later extended to real and even complex arguments. See these notes for definitions and these notes for how to calculate the general coefficients. The probability mass function can be written most compactly when one of the binomial coefficient has a negative argument. See page two of these notes for an explanation. There’s no intuitive explanation of the negative argument. It’s just a consequence of some algebra.

What’s the point in using non-integer values of r? Just because we can? No, there are practical reasons, and that leads to our third view.

Third view: Modeling overdispersion

Next we take the distribution above and forget where it came from. It was motivated by counting successes and failures, but now we forget about that and imagine the distribution falling from the sky in its general form described above. What properties does it have?

The negative binomial distribution turns out to have a very useful property. It can be seen as a generalization of the Poisson distribution. (See this distribution chart. Click on the dashed arrow between the negative binomial and Poisson boxes.)

The Poisson is the simplest distribution for modeling count data. It is in some sense a very natural distribution and it has nice theoretical properties. However, the Poisson distribution has one severe limitation: its variance is equal to its mean. There is no way to increase the variance without increasing the mean. Unfortunately, in many data sets the variance is larger than the mean. That’s where the negative binomial comes in. When modeling count data, first try the simplest thing that might work, the Poisson. If that doesn’t work, try the next simplest thing, negative binomial.

When viewing the negative binomial this way, a generalization of the Poisson, it helps to use a new parameterization. The parameters r and p are no longer directly important. For example, if we have empirical data with mean 20.1 and variance 34.7, we would naturally be interested in the negative binomial distribution with this mean and variance. We would like a parametrization that reflects more directly the mean and variance and one that makes the connection with the Poisson more transparent. That is indeed possible, and is described in these notes.

Update: Here’s a new post giving a fourth view of the negative binomial distribution — a continuous mixture of Poisson distributions. This view explains why the negative binomial is related to the Poisson and yet has greater variance.

Related links:

Notes on the negative binomial distribution
General binomial coefficients
Diagram of distribution relationships
Upper and lower bounds on binomial coefficients

{ 4 comments }

Make up your own rules of probability

by John on September 18, 2009

Keith Baggerly and Kevin Coombes just wrote a paper about the analysis errors they commonly see in bioinformatics articles. From the abstract:

One theme that emerges is that the most common errors are simple (e.g. row or column offsets); conversely, it is our experience that the most simple errors are common.

The full title of the article by Keith Baggerly and Kevin Coombes is “Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology.” The article will appear in the next issue of Annals of Applied Statistics and is available here. The key phrase in the title is forensic bioinformatics: reverse engineering statistical analysis of bioinformatics data. The authors give five case studies of data analyses that cannot be reproduced and infer what analysis actually was carried out.

One of the more egregious errors came from the creative application of probability. One paper uses innovative probability results such as

P(ABCD) = P(A) + P(B) + P(C) + P(D) – P(A) P(B) P(C) P(D)

and

P(AB) = max( P(A), P(B) ).

Baggerly and Coombes were remarkably understated in their criticism: “None of these rules are standard.” In less diplomatic language, the rules are wrong.

To be fair, Baggerly and Coombes point out

These rules are not explicitly stated in the methods; we inferred them either from formulae embedded in Excel files … or from exploratory data analysis …

So, the authors didn’t state false theorems; they just used them. And nobody would have noticed if Baggerly and Coombes had not tried to reproduce their results.

Related posts:

Irreproducible analysis
Highlights from Reproducible Ideas
Reproducible Ideas blog winding down

{ 6 comments }

Smoking

by John on September 11, 2009

Seth Godin has a blog post this morning in which he says

Smoking a pack a day for twenty years is a great way to be sure you’ll die early.

The point of his post was not the dangers of smoking. His point was that “What we do in the long run, over time, drip by drip” matters more than what we do sporadically and I certainly agree. But I disagree with Seth’s comment on smoking.

Smoking certainly cuts your life short on average. But smoking is like playing Russian roulette: Most of the time, you’re OK. Most smokers do not get lung cancer. Smoking does not ensure that you’ll die early. And that may be why smokers ignore warnings. They can point to plenty of fellow smokers who were not killed by smoking. For example, if I wanted to smoke I could point out that my parents smoked and did not die of smoking-related causes. (Another smoker in my family, however, did die of lung cancer.)

People are most strongly motivated by consequences that are immediate and certain. Given a choice between the certain pleasure of enjoying a cigarette now versus a risk of lung cancer years from now, smokers choose the former.

It’s not very effective to tell someone, especially someone young, that if they smoke they will get lung cancer. For one thing, it’s not true: they probably will not get lung cancer. But they do increase their chances of cancer, and even more so their chances of emphysema, heart disease, etc. Still, those are probabilities of future events. Teenagers may be more motivated by the thought of their fingernails turning yellow or their clothes stinking.

Update: I want to be clear that I’m not defending smoking. I couldn’t wait to move out of the smoke-filled house I grew up in. Nor am I trying to down-play the health risks of smoking. The harmful effects are extraordinary well established. As Fletcher Knebel said back in 1961, smoking is the leading cause of statistics. Half a century later we’re still spending money on studies to confirm what we already know.

Related posts:

Cartoon guide to cancer research
Nearly everyone is above average

{ 8 comments }

The IOT test

by John on August 31, 2009

In his book Flaw of Averages, Sam Savage describes the IOT test of statistical significance.

Joe Berkson, a statistician at the Mayo Clinic, developed his own criterion, which he termed the IOT Test, or Inter Ocular Trauma Test, requiring a graph that hit you between the eyes.

{ 1 comment }

Physical explanation of median

by John on August 12, 2009

This post from Statpics shows how to visualize the median of a set of numbers via a necklace draped over a pulley.

{ 0 comments }

R Q&A

by John on July 23, 2009

There is an organized effort to promote the StackOverflow site for questions and answers around the R programming language. It’s working: the amount of R activity on StackOverflow has greatly increased lately.

If you’re familiar with StackOverflow but not R, you might want to take a look at the R Project web site and these notes about the R language.

If you’re familiar with R but not StackOverflow, allow me to introduce you. StackOverflow is a web site for questions and answers related to programming. The site is open to all programming languages and environments, but it’s pretty strict about sticking to programming questions. (StackOverflow has two sister sites for other computing questions: ServerFault for system administration and IT issues, and Superuser for almost anything else related to computing.)

I’d like to see the R community take advantage of StackOverflow’s platform. According to Metcalfe’s law, the value of a network is proportional to the square of the number of users in the network. As more people go to StackOverflow for R Q&A, everyone gets better information and faster responses.

Related posts:

Five kinds of subscripts in R
The R book I wish someone would write
Civic duty on StackOverflow

{ 4 comments }

Incredibly simple approximation

by John on June 29, 2009

Suppose you need to find the slope of a line through a set of data. You can get a surprisingly good approximation by simply fitting a line to the first and last points. This is known as “Bancroft’s rule.” It seems too good to be true. Of course just fitting a line to just two points is not as good as using all the data, but unless you have a fairly large amount of data, it’s not too much worse either. It’s good enough for a quick estimate.

Just how good is this estimate compared to using all the data? We’ll look at the technical details of an example below.

Suppose you have a regression model y = α + βx + ε where ε is random noise. Suppose ε is normally distributed with mean 0 and variance σ2. Let b be the least squares estimator of β. The variance in b is

\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

Now suppose we have observations yi corresponding to xi = 0, 1, 2, …, 2n. The average value of x is n, and the denominator in the expression for the variance of the slope estimator is 2(12 + 22 + 32 + … + n2) = (2n3 + 3n2 + n)/3. If we just use the data at x = 0 and x = 2n, the denominator is (0 – n)2 + (2nn)2 = 2n2.

If we divide the estimator variance based on Bancroft’s rule by the estimator variance using all the data, the σ2 terms cancel and we are left with n/3 + 1/2 + 1/6n. So Bancroft’s rule increases the variance in the estimate for the slope by roughly n/3 compared to using all the data. Thus it increases the confidence interval by roughly the square root of n/3. So if you had 12 data points, the confidence interval would be about twice as wide. Said another way, the estimate based on all the data is only twice as good as the estimate based on just the first and last points.

Related posts:

Probability mistake can give a good approximation
Rolling dice for normal samples
Approximate problems and approximate solutions
Comparing two ways to fit a line to data

{ 2 comments }

John Tukey’s median of medians

by John on June 23, 2009

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Canonical examples from robust statistics
Efficiency of median versus mean

Other posts on John Tukey:

Approximate problems and approximate solutions
John Tukey and Aristotle
Tukey tallying
When discoveries stay discovered

{ 4 comments }

Statistical lexicon

by John on May 23, 2009

See Andrew Gelman’s post Handy statistical lexicon for a list of useful aphorisms. Here’s one of my favorites.

Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing.

{ 0 comments }

R package for robust priors

by John on May 11, 2009

Jairo Fuquene has released an R package on CRAN to accompany our paper

A Case for Robust Bayesian priors with Applications to Binary Clinical Trials
Jairo A. Fuquene P., John D. Cook, Luis Raul Pericchi

{ 2 comments }

Classical statistics in a nutshell

by John on May 4, 2009

Here’s another quote from Anthony O’Hagan’s book Bayesian Inference.

All classical inference statements … are probability statements about x given θ, phrased so as to appear to be probability statements about θ.

Emphasis in the original.

Related posts:

Four reasons to use Bayesian inference
Four pillars of Bayesian statistics

{ 0 comments }

Four reasons to use Bayesian inference

by John on April 28, 2009

The following is a direct quote from Anthony O’Hagan’s book Bayesian Inference. I’ve edited the quote only to enumerate the points.

Why should one use Bayesian inference, as opposed to classical inference? There are various answers. Broadly speaking, some of the arguments in favour of the the Bayesian approach are that it is

  1. fundamentally sound,
  2. very flexible,
  3. produces clear and direct inferences,
  4. makes use of all available information.

I’ll elaborate briefly on each of O’Hagan’s points.

Bayesian inference has a solid philosophical foundation. It is consistent with certain axioms of rational inference. Non-Bayesian systems of inference, such as fuzzy logic, must violate one or more of these axioms; their conclusions are rationally satisfying to the extent that they approximate Bayesian inference.

Bayesian inference is at the same time rigid and flexible. It is rigid in the sense that all inference follows the same form: set up a likelihood and a prior, then calculate the posterior by conditioning on observed data via Bayes theorem. But this rigidity channels creativity into useful directions. It provides a template for setting up complex models when necessary.

Frequentist inferences are awkward to explain. For example, confidence intervals and p-values are tedious to define rigorously. Most consumers of confidence intervals and p-values do not know what they mean and implicitly assume Bayesian interpretations. The difference is not simply pedantic. Particularly with regard to p-values, the common understanding can be grossly inaccurate. By contrast, Bayesian counterparts are simple to define and interpret. Bayesian credible intervals are exactly what most people think confidence intervals are. And a Bayesian hypotheses test simply compares the probability of each hypothesis via Bayes factors.

Sometimes the necessity of specifying prior distributions is seen as a drawback to Bayesian inference. On the other hand, the ability to specify prior distributions means that more information can be incorporated in an inference. See Musicians, drunks, and Oliver Cromwell for a colorful illustration from Jim Berger on the need to incorporate prior information.

Related posts:

Four pillars of Bayesian statistics
Bayesian statistics is misnamed
What is a confidence interval?
Why most published research results are false

{ 3 comments }

Bayesian statistics is misnamed

by John on April 20, 2009

I’m teaching an introduction to Bayesian statistics. My first thought was to start with Bayes theorem, as many introductions do. But this isn’t the right starting point. Bayes’ theorem is an indispensable tool for Bayesian statistics, but it is not the foundational principle. The foundational principle of Bayesian statistics is the decision to represent uncertainty by probabilities. Unknown parameters have probability distributions that represent the uncertainty in our knowledge of their values.

Once you decide to use probabilities to express parameter uncertainty, you inevitably run into the need for Bayes theorem to work with these probabilities. Bayes theorem is applied constantly in Bayesian statistics, and that is why the field takes its name from the theorem’s author, Reverend Thomas Bayes (1702-1761). But “Bayesian” doesn’t describe Bayesian statistics quite the same way that “Frequentist” described frequentist statistics. The term “frequentist” gets to the heart of how frequentist statistics interprets probability. But “Bayesian” refers to a Bayes theorem, a computational tool for carrying out probability calculations in Bayesian statistics. If frequentist statistics were analogously named, it might be called “Bernoullian statistics” after Jacob Bernoulli’s law of large numbers.

The term “Bayesian” statistics might imply that frequentist statisticians dispute Bayes’ theorem. That is not the case. Bayes’ theorem is a simple mathematical result. What people dispute is the interpretation of the probabilites that Bayesians want to stick into Bayes’ theorem.

I don’t have a better name for Bayesian statistics. Even if I did, the name “Bayesian” is firmly established. It’s certainly easier to say “Bayesian statistics” than to say “that school of statistics that represents uncertainty in unknown parameters by probabilities,” even though the latter is accurate.

Related posts:

Four pillars of Bayesian statistics
What a probability means
Plausible reasoning
The probability that Shakespeare wrote a play

{ 5 comments }