Protestant PCs, Catholic Macs

Here’s a post that’s sure to be controversial. Not only am I bringing up religion, I’m comparing Macs and PCs.

I heard someone refer to an article saying Macs are Catholic and PCs are Protestant. I haven’t seen the article in question, so I can only imagine what the analogy it had in mind. I doubt that Catholics are more likely to buy Macs than Protestants are, though I suppose that’s possible. I imagine the analogy has to do with centralized authority. There is one vendor for Mac hardware and untold numbers of vendors for PC hardware.  Also, Steve Jobs makes a more plausible Pope-figure than Bill Gates.

The analogy may also have to do with taste. Although I am a Protestant, I’m willing to concede that Catholic churches seem to have better taste in architecture and music. And although I am a PC user, I’ll admit that Apple generally has better taste than Microsoft.

Update: See link to an article by Umberto Eco that Sandra kindly provided in the comments.

Quantifying the error in the central limit theorem

When I was preparing for a statistics class I’m teaching now, I wrote up some notes on the error in the central limit theorem (CLT) for a few common distributions. Under mild assumptions, the CLT says that if you take any distribution and average enough samples from it, the result is nearly a normal (Gaussian) distribution. The more samples you take, the closer the average is to being normal. That means you can use a normal distribution to approximate the distribution of an average of other distributions. That raises a couple questions.

  1. What can you say about the approximation error in general?
  2. What can you say about the approximation error in important special cases?

In other words, if I take some large but finite number of samples, can I get a numerical bound on the difference between the distribution of my average and the normal distribution? And if I’m not averaging just any old distributions but well-known distributions (binomial, Poisson, gamma, etc.) can I do better than in general?

The Berry-Esséen theorem answers the first question. If the distributions you’re averaging have a finite third absolute central moment ρ, then the maximum error when averaging n samples is bounded by C ρ / σ3 n1/2 where C is a constant less than 0.8 and σ is the standard deviation of the distributions.

There is a variation on the Berry-Esséen theorem that gives the error for a particular x rather than the maximum error. The error for a particular x is bounded by D ρ / (1 + |x|3) σ3 n1/2. The constant D is known to be less than 31.  This gives an improvement over the maximum error estimates when x is large. However, this may not be so useful. The absolute error in the CLT approximation is small for large x, but only because we’re approximating one small probability by another. The relative error in the approximation may be enormous for large x.

I was primarily interested in the second question above, sharper error estimates for well-known distributions. I was surprised that I couldn’t find much written on the subject. There are some results along these lines, but apparently not many. According to one recent and rather large book on this subject, “no systematic studies along this direction seem to have been done.”

Here are a few pages I wrote about the errors in normal approximations with more emphasis on numerical examples rather than on theoretical error bounds. Here are the notes by distribution family.

Also, here are notes on applying the Berry-Esséen theorem to the normal approximation to the Poisson.

Programmer hit by a bus

Programmers often express their (over)dependence on each other by asking what would happen if one of their colleagues were hit by a bus. For example, “Who is going to maintain this code if Mike gets hit by a bus?” or “That’s great that Joe can fix this, but what would we do if Joe were hit by a bus?” The figure of speech is always being hit by a bus, never any other tragedy. I suppose being hit by a bus is sufficiently unlikely that it’s safe to joke about.

I had never heard of a programmer literally being hit by a bus until this afternoon. Thomas Guest wrote about a colleague who was actually hit by a bus. Of course that makes the metaphor more serious.

There’s nothing funny about actually being hit by a bus. And I am relieved to hear that his colleague was able to return to work. But I was amused that apparently on the other side of the Atlantic they imagine folks being hit by double-decker buses based on the photo Thomas includes in his post:

red double-decker bus

When I’ve joked about coworkers being hit by a bus, I’ve always imagined something more like a Greyhound.

Greyhound bus

Theoretical explanation for numerical results

In an earlier post, I compared three methods of computing sample variance. All would give the same result in exact arithmetic, but on a computer they can give very different results. Here I will explain why the results were so bad using the sum of squares formula

s^2 = \frac{1}{n(n-1)}\left(nsum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_i\right)^2\right)

but were much better using the direct formula

s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2

(The third method, Welford’s method, is preferable to both of the above methods, but its accuracy is more similar to the direct formula.)

In the numerical examples from the previous post, I computed the sample variance of xi = N + ui where N was a large number (106, 109, or 1012) and the ui were one million (106) samples drawn uniformly from the unit interval (0, 1). In exact arithmetic, adding N to a sample does not change its variance, but in floating point arithmetic it matters a great deal.

When N equaled 109, the direct formula correctly reported 0.083226511 while the sum of squares method gave -37154.734 even though variance cannot be negative. Why was one method so good and the other so bad? Before we can answer that, we have to recall the cardinal rule of numerical computing:

If x and y agree to m significant figures, up to m significant figures can be lost in computing x-y.

We’ll use the rule as stated above, though it’s not exactly correct. Since arithmetic is carried out in binary, the rule more accurately says

If x and y agree to m bits, up to m bits of precision can be lost in computing x-y.

Now back to our problem. When we use the direct formula, we subtract the mean from each sample. The mean will be around N + 1/2 and each sample will be between N and N+1. So a typical sample agrees with the mean to around 9 significant figures since N = 109. A double precision floating point number contains about 15 significant figures, so we retain about 6 figures in each calculation. We’d expect our final result to be accurate to around 6 figures.

Now let’s look at the sum of squares formula. With this approach, we are computing the sample variance as the difference of

\frac{1}{n-1} \sum_{i=1}^n x_i^2

and

\frac{1}{n(n-1)}\left( \sum_{i=1}^n x_i\right)^2

These two expressions are both on the order of N2, or 1018. The true sample variance is on the order of 10-1, so we’re subtracting two numbers that if calculated exactly would agree to around 19 significant figures. But we only have 15 significant figures in a double and so it should not be surprising that our answer is rubbish.

Now let’s go back and look at the case where N = 106. When we use the direct method, we expect to lose 6 significant figures in subtracting off the mean, and so we would retain 9 significant figures. When we use the sum of squares formula, the two sums that we subtract are on the order of 1012 and so we’re trying to calculate a number on the order of 10-1 as the difference of two numbers on the order of 1012. We would expect in this case to lose 13 significant figures in the subtraction. Since we start with 15, that means we’d expect about 2 significant figures in our answer. And that’s exactly what happened.

Update: See related posts on simple regression coefficients and Pearson’s correlation coefficient.

Wine and politics

Yesterday I ran across an interview with Tyler Colman, a.k.a. Dr. Vino. He discussed his book Wine Politics, subtitled How Governments, Environmentalists and Mobsters Influence the Wines We Drink. The book grew out of Colman’s PhD thesis in political economy.

One of the things I found interesting in the interview was his comparison of regulation in the US and Europe. The common perception is that Europe is more heavily regulated than the US. At least with respect to wine, that’s not entirely true. Wine production is more heavily regulated in Europe, but wine distribution is more heavily regulated in the US. Here’s a quote to that effect from the interview.

It’s easier to ship a bottle of wine from Bordeaux to Berlin than it is from Napa to Nashville.

I was surprised how difficult it can be to ship wine within the US. On the other hand, the US places practically no restrictions on how wine is produced. In Europe, especially in France, wine production is incredibly regulated.

When I was in Bordeaux a couple years ago I had a chance to meet a vintner who produces a “garage wine.” He explained that, for example, government regulations only allow you to make wine from grapes grown on your own property; you cannot swap a bushel of grapes with a neighboring vineyard. He complained that restrictions on irrigation make it hard to compete with foreign wineries that do not face such restrictions.

See also Wine, Beer, and Statistics.

Comparing three methods of computing standard deviation

On Tuesday I posted a note about computing sample variance or sample standard deviation. I was recommending a method first proposed by B. P. Welford in 1962 and described by Donald Knuth in his magnum opus.

A comment on that post suggested that computing the variance directly from the definition

s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2

might be just as accurate, though it requires two passes through the data. My initial reaction was that this would not be as accurate, but when I thought about it more I wasn’t so sure. So I did an experiment.

I compared three methods of computing sample variance: Welford’s method, the method I blogged about; what I’ll call the sum of squares method, using the formula below

s^2 = \frac{1}{n(n-1)}\left(n\sum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_i\right)^2\right)

and what I’ll call the direct method, the method that first computes the sample mean then sums the squares of the samples minus the mean.

I generated 106 (1,000,000) samples from a uniform(0, 1) distribution and added 109 (1,000,000,000) to each sample. I did this so the variance in the samples would be small relative to the mean. This is the case that causes accuracy problems. Welford’s method and the direct method were correct to 8 significant figures: 0.083226511. The sum of squares method gave the impossible result -37154.734.

Next I repeated the experiment shifting the samples by 1012 instead of 109. This time Welford’s method gave 0.08326, correct to three significant figures. The direct method gave 0.3286, no correct figures. The sum of squares method gave 14233226812595.9.

Based on the above data, it looks like Welford’s method and the direct method are both very good, though Welford’s method does better in extreme circumstances. But the sum of squares method is terrible.

Next I looked at an easier case, shifting the uniform samples by 106. This time Welford’s method was correct to 9 figures and the direct method correct to 13 figures. The sum of squares method was only correct to 2 figures.

I repeated the three cases above with normal(0, 1) samples and got similar results.

In summary, the sum of squares method is bad but the other two methods are quite good. The comment that prompted this post was correct.

The advantage of Welford’s method is that it requires only one pass. In my experiments, I generated each sample, updated Welford’s variables, and threw the sample away. But to use the direct method, I had to store the 1,000,000 random samples before computing the variance.

Update: See this followup post for a theoretical explanation for why the sum of squares method is so bad while the other methods are good. Also, the difficulties in computing sample variance show up in computing regression coefficients and in computing Pearson’s correlation coefficient.

Computing the inverse of the normal CDF

Someone asked me this week for C++ code to compute the inverse of the normal (Gaussian) distribution function. The code I usually use isn’t convenient to give away because it’s part of a large library, so I wrote a stand-alone function using an approximation out of Abramowitz and Stegun (A&S). There are a couple things A&S takes for granted, so I decided to write up the code in the spirit of a literate program to explain the details. The code is compact and portable. It isn’t as fast as possible nor as accurate as possible, but it’s good enough for many purposes.

A literate program to compute the inverse of the normal CDF

R programming coming from other languages

The R programming language is fairly easy to learn once you get in the right mindset, but until you get in that mindset the language is bewildering. If you come to R from any popular programming language, you’re in for a number of surprises. I recently revised some notes I first posted about a year ago to help programmers coming to R from other languages.