Compute skewness and kurtosis in one pass

If you compute the standard deviation of a data set by directly implementing the definition, you’ll need to pass through the data twice: once to find the mean, then a second time to accumulate the squared differences from the mean. But there is an equivalent algorithm that requires only one pass and that is more accurate than the most direct method. You can find the code for implementing it here.

You can also compute the higher sample moments in one pass. I’ve extended the previous code to compute skewness and kurtosis in one pass as well.

The new code also lets you split your data, say to process it in parallel on different threads, and then combine the statistics, in the spirit of map-reduce.

Lastly, I’ve posted analogous code for simple linear regression.

7 thoughts on “Computing skewness and kurtosis in one pass”

Rein Halbersma

11 June 2013 at 07:00

Boost.Accumulators does this for a whole bunch of other statistics.

Rein Halbersma

11 June 2013 at 07:19

BTW, your C++ code has quite a bit of room for performance improvements. The RunningStats and RunningRegression objects are 40 and 96 bytes each, and their operator+ takes both parameters by-value, rather than by-reference-to-const. Furthermore, the canonical way to overload operators is to define operator+ in terms of operator+=, instead of the other way around. This will save at least one 40 or 96 byte assignment per observation. Such can be the benefits of using libraries :-)

Ed Davies

11 June 2013 at 11:55

Why does RunningStat store both m_oldM and m_newM when, after initialization, they’re always the same between calls? Ditto m_old/newS?

Krzysztof Sakrejda

11 June 2013 at 14:29

My new strategy for optimizing code is going to be to post it online as bait for C++ gurus on break.

Steven H. Noble

11 June 2013 at 16:43

So beyond streaming there are also distributed formulas for moments. This is how the moments group works for algebird (see: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/MomentsGroup.scala)

James

6 November 2015 at 05:27

Comparing the results of the RunningStats skewness() and Kurtosis() functions on the http://www.johndcook.com/blog/skewness_kurtosis/ page gave different results to results in Excel, and Python’s Pandas DataFrame results.

They use corrected sample rather than population formulas, so you could extend the code

double RunningStats::SkewnessCS() const
{
// Calculate the population skewness:
double g = sqrt(double(n)) * M3/ pow(M2, 1.5);
// Return the corrected sample skewness:
return sqrt( double(n*n-1))*g / double(n-2);
}

double RunningStats::KurtosisCS() const
{
// Calculate the population excess kurtosis
double g = double(n)*M4 / (M2*M2) – 3.0;
// Return the corrected sample excess kurtosis
return double(n-1) / double( (n-2)*(n-3) ) * ( double(n+1)*g + 6 );
}

Caio Hamamura

29 January 2021 at 17:08

Thank you James for correcting the metrics to sample instead of population!

Comments are closed.