Computing skewness and kurtosis in one pass

If you compute the standard deviation of a data set by directly implementing the definition, you’ll need to pass through the data twice: once to find the mean, then a second time to accumulate the squared differences from the mean. But there is an equivalent algorithm that requires only one pass and that is more accurate than the most direct method. You can find the code for implementing it here.

You can also compute the higher sample moments in one pass. I’ve extended the previous code to compute skewness and kurtosis in one pass as well.

The new code also lets you split your data, say to process it in parallel on different threads, and then combine the statistics, in the spirit of map-reduce.

Lastly, I’ve posted analogous code for simple linear regression.

7 thoughts on “Computing skewness and kurtosis in one pass

  1. BTW, your C++ code has quite a bit of room for performance improvements. The RunningStats and RunningRegression objects are 40 and 96 bytes each, and their operator+ takes both parameters by-value, rather than by-reference-to-const. Furthermore, the canonical way to overload operators is to define operator+ in terms of operator+=, instead of the other way around. This will save at least one 40 or 96 byte assignment per observation. Such can be the benefits of using libraries :-)

  2. Why does RunningStat store both m_oldM and m_newM when, after initialization, they’re always the same between calls? Ditto m_old/newS?

  3. Comparing the results of the RunningStats skewness() and Kurtosis() functions on the http://www.johndcook.com/blog/skewness_kurtosis/ page gave different results to results in Excel, and Python’s Pandas DataFrame results.

    They use corrected sample rather than population formulas, so you could extend the code

    double RunningStats::SkewnessCS() const
    {
    // Calculate the population skewness:
    double g = sqrt(double(n)) * M3/ pow(M2, 1.5);
    // Return the corrected sample skewness:
    return sqrt( double(n*n-1))*g / double(n-2);
    }

    double RunningStats::KurtosisCS() const
    {
    // Calculate the population excess kurtosis
    double g = double(n)*M4 / (M2*M2) – 3.0;
    // Return the corrected sample excess kurtosis
    return double(n-1) / double( (n-2)*(n-3) ) * ( double(n+1)*g + 6 );
    }

Comments are closed.