How to compute standard deviation accurately

The most convenient way to compute sample variance by hand may not work in a program. Sample variance is given by

\sigma^2 = \frac{1}{ n(n-1)}\left(n \sum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_k\right)^2\right)

If you compute the two summations and then carry out the subtraction above, you might be OK. Or you might have a large loss of precision. You might get a negative result even though in theory the quantity above cannot be negative. If you want the standard deviation rather than the variance, you may be in for an unpleasant surprise when you try to take your square root.

There is a simple but non-obvious way to compute sample variance that has excellent numerical properties. The algorithm was first published back in 1962 but is not as well known as it should be. Here are some notes explaining the algorithm and some C++ code for implementing the algorithm.

Accurately computing running variance

The algorithm has the added advantage that it keeps a running account of the mean and variance as data are entered sequentially.

Related posts

3 thoughts on “How to compute standard deviation accurately

  1. Interesting. I can see how the sum-of-squared values could get large and be subject to substantial round off error, but would it be more accurate that the other typical formula? (ie: http://www.une.edu.au/WebStat/unit_materials/c4_descriptive_statistics/image14.gif)

    This formula is less efficient, requiring two passes through the data to get the variance, but it keeps a running sum just as Knuth’s algorithm does. My intuition is that it should be just accurate.

  2. Your intuition is correct. I didn’t agree when I first read your post, but I did some experiments and the formula you mention is as accurate, except in extreme circumstances. I wrote a follow up post about the results.

  3. I read somewhere that the original formula you quote after “Sample variance is given by” was recommended for use on rotary mechanical calculators of the type used by RA Fisher and others in the 1920s. Stats books of the time recommended it, and even when digital computers using floating point came along in the 60s the old formula kept getting reprinted in the textbooks.

Comments are closed.