On Tuesday I posted a note about computing sample variance or sample standard deviation. I was recommending a method first proposed by B. P. Welford in 1962 and described by Donald Knuth in his magnum opus.

A comment on that post suggested that computing the variance directly from the definition

might be just as accurate, though it requires two passes through the data. My initial reaction was that this would not be as accurate, but when I thought about it more I wasn’t so sure. So I did an experiment.

I compared three methods of computing sample variance: Welford’s method, the method I blogged about; what I’ll call the sum of squares method, using the formula below

and what I’ll call the direct method, the method that first computes the sample mean then sums the squares of the samples minus the mean.

I generated 10^{6} (1,000,000) samples from a uniform(0, 1) distribution and added 10^{9} (1,000,000,000) to each sample. I did this so the variance in the samples would be small relative to the mean. This is the case that causes accuracy problems. Welford’s method and the direct method were correct to 8 significant figures: 0.083226511. The sum of squares method gave the impossible result -37154.734.

Next I repeated the experiment shifting the samples by 10^{12} instead of 10^{9}. This time Welford’s method gave 0.08326, correct to three significant figures. The direct method gave 0.3286, no correct figures. The sum of squares method gave 14233226812595.9.

Based on the above data, it looks like Welford’s method and the direct method are both very good, though Welford’s method does better in extreme circumstances. But the sum of squares method is terrible.

Next I looked at an easier case, shifting the uniform samples by 10^{6}. This time Welford’s method was correct to 9 figures and the direct method correct to 13 figures. The sum of squares method was only correct to 2 figures.

I repeated the three cases above with normal(0, 1) samples and got similar results.

In summary, the sum of squares method is bad but the other two methods are quite good. The comment that prompted this post was correct.

The advantage of Welford’s method is that it requires only one pass. In my experiments, I generated each sample, updated Welford’s variables, and threw the sample away. But to use the direct method, I had to store the 1,000,000 random samples before computing the variance.

**Update**: See this followup post for a theoretical explanation for why the sum of squares method is so bad while the other methods are good. Also, the difficulties in computing sample variance show up in computing regression coefficients and in computing Pearson’s correlation coefficient.

Welford’s method corresponds closely to a similar calculation for Pearson’s correlation coefficient that is accomplished in a single pass.

http://en.wikipedia.org/wiki/Correlation

Single pass computations that are numerically stable become important when thousands of combinations of overdetermined systems must be correlated and decomposed. The computational overhead of two pass algorithms becomes prohibitive with such problems.

Thanks for pointing out your tests on these methods.

I’ll second that. Nice work!

Another thought: In your example here you are adding a large constant to demonstrate the inaccuracy of the sum of squares method. Isn’t there also an issue here of the accuracy of the internal binary representation (mantissa) of the number? By adding 10^12 to a value between 0 and 1, you are increasing the mantissa by 12 digits (something else in binary), and likely causing the less significant ones to be truncated in the internal representation. This could also cause different results for the variance because the data may be inaccurately represented in the first place. (Did that make any sense? Forgive me but I’ve been away from this sort of thing for a long time).

This seems to be another reason why Welford’s method is superior.

I understand what you’re saying. The way I think about is is that the y’s are raw data. They happen to have been generated by this N + uniform process, but you could ignore that. Given the y’s, to the precision we have them, go find their sample variance.

John,

This has to be one of the coolest algorithms ever!!! I need to work out the math, but it seems that it should also be possible to keep a running mean and variance while removing data. Wouldn’t that allow to do real-time elimination of outliers, and give a more accurate running average of the data stream?

I’m thinking of stuff like a bicycle’s speed’o’meter, or a heart rate monitor: along your ride or run the sensor moves, or something odd happens, and suddenly the readings drop to zero, or double… Maybe hardware is better now, but that used to happen very often when I did triathlons, some five years ago. Even if it ony lasts for a short while, it may mess your average readings big time, which may be the thing you wanted to base your day’s training on.

I’ll try to elaborate more on the idea, but meanwhile cast the question to ye, the statisticians: is there any fundamental theoretical flaw in the above idea?

How did you compute the sample mean in the direct method? sum x[i]/n, (sum x[i])/n, welford’s method, or some iterative offset? Since the direct method outperformed welford’s for small offsets, I wouldn’t be entirely surprised to find out it was because the mean was inaccurate at large offsets.

Hi Jaime

Yes you are correct, you can implement statistical analysers that have no memory requirements apart from a small number of state variables. Great for time series like stock market. Totally cool. This whole field leads into something called “optimal estimation theory”. The most famous algorithm is prhaps the Kalman Filter, you want to rapidly predict the best path of a missile in real time as accurately and rapidly as possible, so you don’t under-correct or over correct the mean. I recall a back issue of Byte magazine discussing this way back in the 80’s implemented on an HP programmable calculator. I wrote a version in BASIC but you dont need to be a code monkey, a spreadsheet will do it.

Keith

The Welford (Knuth) method is very nice and can be used to compute a running variance of a time series. A different formula appears in S. Ross,

Simulation(p. 116 of the 2nd edition, Eqns 7.6 and 7.7). At least I think it is different: I haven’t determined been able to derive Ross’s formula from Welford’s.The interesting thing about the Ross formula is that you can use it to create a

vectorizedcomputation that can compute the running variance of a time series in a single operation (no loops) in languages such as R, MATLAB, and SAS/IML. Ross’s formula and the vectorized computation are both presented and discussed in a blog post:“Compute a running mean and variance.”

Hello,

thank you for that very interesting post, I’m not surprised it is now referenced by a lot of other articles on the web.

I’ve played around with tests similar to yours and I realized that Welford’s algorithm behave differently (and less well) when the samples are sorted.

For instance with 10^6 samples generated from a uniform(0,1) distribution, plus a shift of 10^9, I get 0.08318 with the direct two-pass algorithm and 0.09143 with Welford’s (the correct value being 0.08317).

If I shift the same samples by 10^10, I get 0.084 (direct) and 0.303 (Welford’s), still with sorted samples, so Welford’s one is starting to get really wrong here.

I don’t really know why, and I haven’t spent time trying to understand Welford’s algorithm, but I thought it might be interesting to share.

Arthur

Apache Commons Math’s Storeless Variance calculation might need an update.

(http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/descriptive/moment/Variance.html)

Here is an extension of Welford’s Algorithm for parallel computation of moments:

https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf (basically, they formulate the relevant recursive formulas for two sets, instead of a set and a singleton).