From the category archives:

Statistics

Incredibly simple approximation

by John on June 29, 2009

Suppose you need to find the slope of a line through a set of data. You can get a surprisingly good approximation by simply fitting a line to the first and last points. This is known as “Bancroft’s rule.” It seems too good to be true. Of course just fitting a line to just two points is not as good as using all the data, but unless you have a fairly large amount of data, it’s not too much worse either. It’s good enough for a quick estimate.

Just how good is this estimate compared to using all the data? We’ll look at the technical details of an example below.

Suppose you have a regression model y = α + βx + ε where ε is random noise. Suppose ε is normally distributed with mean 0 and variance σ2. Let b be the least squares estimator of β. The variance in b is

\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

Now suppose we have observations yi corresponding to xi = 0, 1, 2, …, 2n. The average value of x is n, and the denominator in the expression for the variance of the slope estimator is 2(12 + 22 + 32 + … + n2) = (2n3 + 3n2 + n)/3. If we just use the data at x = 0 and x = 2n, the denominator is (0 - n)2 + (2n - n)2 = 2n2.

If we divide the estimator variance based on Bancroft’s rule by the estimator variance using all the data, the σ2 terms cancel and we are left with n/3 + 1/2 + 1/6n. So Bancroft’s rule increases the variance in the estimate for the slope by roughly n/3 compared to using all the data. Thus it increases the confidence interval by roughly the square root of n/3. So if you had 12 data points, the confidence interval would be about twice as wide. Said another way, the estimate based on all the data is only twice as good as the estimate based on just the first and last points.

Related posts:

Probability mistake can give a good approximation
Rolling dice for normal samples
Approximate problems and approximate solutions
Comparing two ways to fit a line to data

{ 1 comment }

John Tukey’s median of medians

by John on June 23, 2009

Yesterday I got an email from Jestin Abraham asking a question about Tukey’s “median of medians” paper from 1978. (The full title is “The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples.”) Jestin thought I might be familiar with the paper since I’ve written about Tukey several times, but I’d never heard of it.

Tukey’s “ninther” or “median of medians” procedure is quite simple. Understanding the problem he was trying to solve is a little more difficult.

Suppose you are given nine data points: y1, y2, …, y9. Let yA be the median of the first three samples, yB the median of the next three samples, and yC the median of the last three samples. The “ninther” of the data set is the median of yA, yB, and yC, hence the “median of medians.” If the data were sorted, the ninther would simply be the median, but in general it will not be.

For example, suppose your data are 3, 1, 4, 4, 5, 9, 9, 8, 2.  Then

yA = median( 3, 1, 4 ) = 3
yB = median( 4, 5, 9 ) = 5
yC = median( 9, 8, 2 ) = 8

and so the ninther is median( 3, 5, 8 ) = 5. The median is 4.

That’s Tukey’s solution, so what was his problem? First of all, he’s trying to find an estimate for the central value of a large data set. Assume the data come from a symmetric distribution so that the mean equals the median. He’s looking for a robust estimator of the mean, an estimator resistant to the influence of outliers. That’s why he’s using an estimator that is more like the median than the mean.

Why not just use the median? Computing the sample median requires storing all data points and then sorting them to pick the middle value. Tukey wants to do his computation in one pass without storing the data. Also, he wants to do as few comparisons and as few arithmetic operations as possible. His ninther procedure uses no arithmetic operations and only order comparisons. He shows that it uses only about 1.1 comparisons per data point on average and 1.33 comparisons per data point in the worst case.

How well does Tukey’s ninther perform? He shows that if the data come from a normal distribution, the ninther has about 55% efficiency relative to the sample mean. That is, the variances of his estimates are a little less than twice the variances of estimates using the sample mean. But the purpose of robust statistics is efficient estimation in case the data do not come from a normal distribution but from a distribution with thicker tails. The relative efficiency of the ninther improves when data do come from distributions with thicker tails.

Where do large data sets come in? So far we’ve only talked about analyzing data sets with nine points. Tukey’s idea was to use the ninther in conjunction with the median. For some large number M, you could estimate the central value of 9M data points by applying the ninther to groups of 9 points and take the median of the M ninthers. This still requires computing the median of M points, but the memory requirement has been reduced by a factor of 9. Also, the sorting time has been reduced by more than a factor of 9 since sorting n points takes time proportional to n log n.

For even larger data sets, Tukey recommended breaking the data in to sets of 81 points and computing the ninther of the ninthers. Then 81M data points could be processed by storing and sorting M values.

Tukey gave M = 1,000,000 as an example of what he called an “impractically large data set.” I suppose finding the median of 81 million data points was impractical in 1978, though it’s a trivial problem today. Perhaps Tukey’s ninther is sill useful for an embedded device with extremely limited resources that must process enormous amounts of data.

Other posts on robust statistics:

Canonical examples from robust statistics
Efficiency of median versus mean

Other posts on John Tukey:

Approximate problems and approximate solutions
John Tukey and Aristotle
Tukey tallying
When discoveries stay discovered

{ 4 comments }

Statistical lexicon

by John on May 23, 2009

See Andrew Gelman’s post Handy statistical lexicon for a list of useful aphorisms. Here’s one of my favorites.

Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing.

{ 0 comments }

R package for robust priors

by John on May 11, 2009

Jairo Fuquene has released an R package on CRAN to accompany our paper

A Case for Robust Bayesian priors with Applications to Binary Clinical Trials
Jairo A. Fuquene P., John D. Cook, Luis Raul Pericchi

{ 2 comments }

Classical statistics in a nutshell

by John on May 4, 2009

Here’s another quote from Anthony O’Hagan’s book Bayesian Inference.

All classical inference statements … are probability statements about x given θ, phrased so as to appear to be probability statements about θ.

Emphasis in the original.

Related posts:

Four reasons to use Bayesian inference
Four pillars of Bayesian statistics

{ 0 comments }

Four reasons to use Bayesian inference

by John on April 28, 2009

The following is a direct quote from Anthony O’Hagan’s book Bayesian Inference. I’ve edited the quote only to enumerate the points.

Why should one use Bayesian inference, as opposed to classical inference? There are various answers. Broadly speaking, some of the arguments in favour of the the Bayesian approach are that it is

  1. fundamentally sound,
  2. very flexible,
  3. produces clear and direct inferences,
  4. makes use of all available information.

I’ll elaborate briefly on each of O’Hagan’s points.

Bayesian inference has a solid philosophical foundation. It is consistent with certain axioms of rational inference. Non-Bayesian systems of inference, such as fuzzy logic, must violate one or more of these axioms; their conclusions are rationally satisfying to the extent that they approximate Bayesian inference.

Bayesian inference is at the same time rigid and flexible. It is rigid in the sense that all inference follows the same form: set up a likelihood and a prior, then calculate the posterior by conditioning on observed data via Bayes theorem. But this rigidity channels creativity into useful directions. It provides a template for setting up complex models when necessary.

Frequentist inferences are awkward to explain. For example, confidence intervals and p-values are tedious to define rigorously. Most consumers of confidence intervals and p-values do not know what they mean and implicitly assume Bayesian interpretations. The difference is not simply pedantic. Particularly with regard to p-values, the common understanding can be grossly inaccurate. By contrast, Bayesian counterparts are simple to define and interpret. Bayesian credible intervals are exactly what most people think confidence intervals are. And a Bayesian hypotheses test simply compares the probability of each hypothesis via Bayes factors.

Sometimes the necessity of specifying prior distributions is seen as a drawback to Bayesian inference. On the other hand, the ability to specify prior distributions means that more information can be incorporated in an inference. See Musicians, drunks, and Oliver Cromwell for a colorful illustration from Jim Berger on the need to incorporate prior information.

Related posts:

Four pillars of Bayesian statistics
Bayesian statistics is misnamed
What is a confidence interval?
Why most published research results are false

{ 3 comments }

Bayesian statistics is misnamed

by John on April 20, 2009

I’m teaching an introduction to Bayesian statistics. My first thought was to start with Bayes theorem, as many introductions do. But this isn’t the right starting point. Bayes’ theorem is an indispensable tool for Bayesian statistics, but it is not the foundational principle. The foundational principle of Bayesian statistics is the decision to represent uncertainty by probabilities. Unknown parameters have probability distributions that represent the uncertainty in our knowledge of their values.

Once you decide to use probabilities to express parameter uncertainty, you inevitably run into the need for Bayes theorem to work with these probabilities. Bayes theorem is applied constantly in Bayesian statistics, and that is why the field takes its name from the theorem’s author, Reverend Thomas Bayes (1702-1761). But “Bayesian” doesn’t describe Bayesian statistics quite the same way that “Frequentist” described frequentist statistics. The term “frequentist” gets to the heart of how frequentist statistics interprets probability. But “Bayesian” refers to a Bayes theorem, a computational tool for carrying out probability calculations in Bayesian statistics. If frequentist statistics were analogously named, it might be called “Bernoullian statistics” after Jacob Bernoulli’s law of large numbers.

The term “Bayesian” statistics might imply that frequentist statisticians dispute Bayes’ theorem. That is not the case. Bayes’ theorem is a simple mathematical result. What people dispute is the interpretation of the probabilites that Bayesians want to stick into Bayes’ theorem.

I don’t have a better name for Bayesian statistics. Even if I did, the name “Bayesian” is firmly established. It’s certainly easier to say “Bayesian statistics” than to say “that school of statistics that represents uncertainty in unknown parameters by probabilities,” even though the latter is accurate.

Related posts:

Four pillars of Bayesian statistics
What a probability means
Plausible reasoning
The probability that Shakespeare wrote a play

{ 5 comments }

Four pillars of Bayesian statistics

by John on April 7, 2009

Anthony O’Hagan’s book Bayesian Inference lists four basic principles of Bayesian statistics at the end of the first chapter:

  1. Prior information. Bayesian statistics provides a systematic way to incorporate what is known about parameters before an experiment is conducted. As a colleague of mine says, if you’re going to measure the distance to the moon, you know not to pick up a yard stick. You always know something before you do an experiment.
  2. Subjective probability. Some Bayesians don’t agree with the subjective probability interpretation, but most do, in practice if not in theory. If you write down reasonable axioms for quantifying degrees of belief, you inevitably end up with Bayesian statistics.
  3. Self-consistency. Even critics of Bayesian statistics acknowledge that Bayesian statistics has a rigorous self-consistent foundation. As O’Hagan says in his book, the difficulties with Bayesian statistics are practical, not foundational, and the practical difficulties are being resolved.
  4. No adhockery. Bruno de Finetti coined the term “adhockery” to describe the profusion of frequentist methods. More on this below.

This year I’ve had the chance to teach a mathematical statistics class primarily focusing on frequentist methods. Teaching frequentist statistics has increased my appreciation for Bayesian statistics. In particular, I better understand the criticism of frequentist adhockery.

For example, consider point estimation. Frequentist statistics to some extent has standardized on minimum variance unbiased estimators as the gold standard. But why? And what do you do when such estimators don’t exist?

Why focus on unbiased estimators? Granted, lack of bias sounds like a good thing to have. All things being equal, it would be better to be unbiased than biased. But all things are not equal. Sometimes unbiased estimators are ridiculous. Why only consider biased vs. unbiased rather, a binary choice, rather than degree of bias, a continuous choice? Efficiency is also important, and someone may reasonably accept a small amount of bias in exchange for a large increase in efficiency.

Why minimize expected mean squared error? Efficiency in classical statistics is typically measured by expected mean squared error. But why not minimize some other measure of error? Why use an exponent of 2 and not 1, or 4, or 2.738? Or why limit yourself to power functions at all? The theory is simplest for squared error, and while this is a reasonable choice in many applications, it is still an arbitrary choice.

How much emphasis should be given to robustness? Once you consider robustness, there are infinitely many ways to compromise between efficiency and robustness.

Many frequentists are asking the same questions and are investigating alternatives. But I believe these alternatives are exactly what de Finetti had in mind: there are an infinite number of ad hoc choices you can make. Bayesian methods are criticized because prior distributions are explicitly subjective. But there are myriad subjective choices that go into frequentist statistics as well, though these choices are often implicit.

There is a great deal of latitude in Bayesian statistics as well, but the latitude is confined to fit within a universal framework: specify a likelihood and prior distribution, then update the model with data to compute the posterior distribution. There are many ways to construct a likelihood (exactly as in frequentist statistics), many ways to specify a prior, and many ways to summarize the information contained in the posterior distribution. But the basic framework is fixed. (In fact, the framework is inevitable given certain common-sense rules of inference.)

Related posts:

Probability and information
What is a confidence interval?

{ 8 comments }

Canonical examples from robust statistics

by John on March 11, 2009

The following definition of robust statistics comes from P. J. Huber’s book Robust Statistics.

… any statistical procedure should possess the following desirable features:

  1. It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
  2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly. …
  3. Somewhat larger deviations from the model should not cause a catastrophe.

Classical statistics focuses on the first of Huber’s points, producing methods that are optimal subject to some criteria. This post looks at the canonical examples used to illustrate Huber’s second and third points.

[click to continue...]

{ 0 comments }

Two common ways to estimate the center of a set of data are the sample mean and the sample median. The sample mean is sometimes more efficient, but the sample median is always more robust. (I’m going to cut to the chase first, then go back and define basic terms like “median” and “robust” below.)

When the data come from distributions with thick tails, the sample median is more efficient. When the data come from distributions with a thin tail, like the normal, the sample mean is more efficient. The Student-t distribution illustrates both since it goes from having thick tails to having thinner tails as the degrees of freedom, denoted ν, increase.

When ν = 1, the Student-t is a Cauchy distribution and the sample mean wanders around without converging to anything, though the sample median behaves well. As ν increases, the Student-t becomes more like the normal and the relative efficiency of the sample median decreases.

Here is a plot of the asymptotic relative efficiency (ARE) of the median compared to the mean for samples from a Student-t distribution as a function of the degrees of freedom ν. The vertical axis is ARE and the horizontal axis is ν.

The curve crosses the top horizontal line at 4.67879. For values of ν less than that cutoff, the median is more efficient. For larger values of ν, the mean is more efficient. As ν gets larger, the relative efficiency of the median approaches the corresponding relative efficiency for the normal, 2/π = 0.63662, indicated by the bottom horizontal line.

Backing up

The sample mean is just the average of the sample values. The median is the middle value when the data are sorted.

Since data have random noise, statistics based on the data are also random. Statistics are generally less random than the data they’re computed from, but they’re still random. If you were to compute the mean, for example, many times, you’d get a different result each time. The estimates bounce around. But there are multiple ways of estimating the same thing, and some ways give estimates bounce around less than others. These are said to be more efficient. If your data come from a normal example, the sample median bounces around about 25% more than the sample mean. (The variance of the estimates is about 57% greater, so the standard deviations are about 25% greater.)

But what if you’re wrong? What if you think the data are coming from a normal distribution but they’re not. Maybe they’re coming from another distribution, say a Student-t. Or maybe they’re coming from a mixture of normals. Say 99% of the time the samples come from a normal distribution with one mean, but 1% of the time they come from a normal distribution with another mean. Now what happens? That is the question robustness is concerned with.

Say you have 100 data points, and one of them is replaced with ∞. What happens to the average? It becomes infinite. What happens to the median? Either not much or nothing at all depending on which data point was changed. The sample median is more robust than the mean because it is more resilient to this kind of change.

Asymptotic relative efficiency (ARE) is a way of measuring how much statistics bounce around as the amount of data increases. If I take n data points and look at √n times the difference between my estimator and the thing I’m estimating, often that becomes approximately normally distributed as n increases. If I do that for two different estimators, I can take the ratio of the variances of the normal distributions that this process produces for each. That’s the asymptotic relative efficiency.

Often efficiency and robustness are in tension and you have to decide how much efficiency you’re willing to trade off for how much robustness. ARE gives you a way of measuring the loss in efficiency if you’re right about the distribution of the data but choose a more robust, more cautious estimator. Of course if you’re significantly wrong about the distribution of the data (and often you are!) then you’re better off with the more robust estimator.

{ 1 comment }

Interesting post from Brendan O’Connor:

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

{ 2 comments }

StackOverflow reputation statistics

by John on March 2, 2009

What is the distribution of StackOverflow user reputation scores? Here’s a graph of the number of users with reputations between 0 and 100, 100 and 200, …, 900 and 1000. (The scores go out much further, but the curve looks flat compared to the peak unless you zoom in further.)

This graph was based on a snapshot of the user reputations one day last week. The largest group, 15,219 users, had reputation less than 100. There were 2,494 users with reputation between 100 and 200, etc. The number of users in a 100-point reputation range generally decreases as the reputation score increases. The majority of users have reputation less than 100, and yet the top percentile have reputations over 4,800 and the highest reputation was 38,700. This sort of extreme disparity suggests a power law distribution.

The test for whether the reputation scores follow a power law is to plot the logarithms of the number of people with each score and look for a straight line. And after an initial steep drop off, the logs of the counts do fall roughly on a straight line.

(The graph goes out to scores below 7,700. Beyond that point, there are a few empty 100-point ranges. I stopped at 7,700 to avoid taking logs of zeros.)

The average reputation was 364, though the average does not mean much with a power law distribution. Instead of a bell shape centered around the average, there is a long tail. The average is not the middle because there is no middle to a power law.

Update: As pointed out in the comments, I should have plotted with the log of the reputation score to test for a power law distribution. Whether or not there is a power law here, however, there is a long tail and there’s no useful “middle.”

Other posts about StackOverflow:

Voting patterns on StackOverflow
StackOverflow question statistics

Other posts about power laws:

Networks and power laws
Rate of regularizing English verbs
Metabolism and power laws

{ 5 comments }

The data may not contain the answer

by John on February 18, 2009

Mark Reid sent me a link to a couple quotes by John Tukey that I had not seen before. First,

To statisticians, hubris should mean the kind of pride that fosters and inflated idea of one’s powers and thereby keeps one from being more than marginally helpful to others. … The feeling of “Give me (or more likely even, give my assistant) the data, and I will tell you what the real answer is!” is one we must all fight against again and again, and yet again.

Also,

The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Here are some more posts about John Tukey:

Approximate problems and approximate solutions
Innovation IV
Tukey tallying
How to linearize data for regression

{ 2 comments }

Here’s an easy error to fall into in statistics. Suppose I have n samples from a normal(μ, σ2) distribution, say n = 16, and σ is unknown. What is the distribution of the average of the samples? A common mistake is to say Student-t: if σ is known, the sample mean has a normal distribution, otherwise it has a t distribution.

But that’s wrong. Your ignorance of σ does not change the distribution of the data. There’s no spooky quantum effect that changes the data based on your knowledge. A linear combination of independent normal random variables is another normal random variable, so the sample mean has a normal distribution, whether or not you know its variance. Your knowledge or ignorance of σ doesn’t change the distribution of the data; it changes what you’re likely to want to do regarding the data. When the variance is unknown, you use procedures involving the sample variance rather than the distribution variance. This doesn’t change the distribution of the data but it changes the distribution you construct (implicitly) in your analysis of the data.

{ 3 comments }

What is a confidence interval?

by John on February 3, 2009

At first glance, a confidence interval is simple. If we say [3, 4] is a 95% confidence interval for a parameter θ, then there’s a 95% chance that θ is between 3 and 4. That explanation is not correct, but it works better in practice than in theory.

If you’re a Bayesian, the explanation above is correct if you change the terminology from “confidence” interval to “credible” interval. But if you’re a frequentist, you can’t make probability statements about parameters.

Confidence intervals take some delicate explanation. I took a look at Andrew Gelman and Deborah Nolan’s book Teaching Statistics: a bag of tricks,  to see what they had to say about teaching confidence intervals. They begin their section on the topic by saying “Confidence intervals are complicated …” That made me feel better. Some folks with more experience teaching statistics also find this challenging to teach. And according to The Lady Testing Tea, confidence intervals were controversial when they were first introduced.

From a frequentist perspective, confidence intervals are random, parameters are not, exactly the opposite of what everyone naturally thinks. You can’t talk about the probability that θ is in an interval because θ isn’t random. But in that case, what good is a confidence interval? As L. J. Savage once said,

The only use I know for a confidence interval is to have confidence in it.

In practice, people don’t go too wrong using the popular but technically incorrect notion of a confidence interval. Frequentist confidence intervals often approximate Bayesian credibility intervals; the frequentist approach is more useful in practice than in theory.

It’s interesting to see a sort of détente between frequentist and Bayesian statisticians. Some frequentists say that the Bayesian interpretation of statistics is nonsense, but the methods these crazy Bayesians come up with often have good frequentist properties. And some Bayesians say that frequentist methods, such as confidence intervals, are useful because they can come up with results that often approximate Bayesian results.

{ 0 comments }