Iterative calculation of lies, er stats

A few weeks ago I wrote about my adventures in cross platform floating point consistency in Avida. The main routine responsible for the deviation at the time was a statistics output function that calculated the skewness and kurtosis for various properties. My first thought at the time was that perhaps the compilers were reordering operations differently, leading to rounding variations. Ultimately I worked around the problem (see how), but in the process I took a good long look skewness and kurtosis methods.

The class in question, cDoubleSum, is intended to be a lightweight iterative/running statistics utility. Values can be added to the collection as they are observed and the statistical properties, such as variance, skewness, and kurtosis, are updated accordingly. The implementation does this by keeping a running tally of the sum, sum of squared values, sum of cubed values, and the sum of x^4. The values are then used to calculate the desired values.

void Add(double value, double weight = 1.0)
  double w_val = value * weight;
  n += weight;
  s1 += w_val;
  s2 += w_val * w_val;
  s3 += w_val * w_val * w_val;
  s4 += w_val * w_val * w_val * w_val;

The skewness and kurtosis methods are implemented as the following:

double cDoubleSum::Skewness() const
  return (n > 2.0) ?
    (n * s3 - (3.0 * s2) * s1 + ((2.0 * s1) * s1) * s1 / n) /
      ((n - 1.0) * (n - 2.0))
    : INF_ERR;
double cDoubleSum::Kurtosis() const {
  return (n > 3.0) ?
    (n + 1.0) * (n * s4 - (4.0 * s3) * s1 + (((6.0 * s2) * s1) * s1) /
      n - (((((3.0 * s1) * s1) * s1) / n) * s1) / n) /
      ((n - 1.0) * (n - 2.0) * (n - 3.0))
    : INF_ERR;

Since the code has existed for quite a few years, its exact lineage is unknown. I have done extensive searching to try to understand how these equations work and to find where they came from. In asking around it was suggested that the source might have been “Numerical Recipes in C”. However, the book only describes the basic two pass method that requires all values exist in memory during calculation. Even worse, the equations used by cDoubleSum do not seem to calculate correct values.

After starring at the code and many versions of the equations, I decided it would be better to go ahead and implement my own version of the class. The result was cRunningStats.

class cRunningStats
  double m_n;  // count
  double m_m1; // mean
  double m_m2; // second moment
  double m_m3; // third moment
  double m_m4; // fourth moment
  cRunningStats() : m_n(0.0), m_m1(0.0), m_m2(0.0), m_m3(0.0), m_m4(0.0)
    { ; }
  void Push(double x)
    double d = (x - m_m1);
    double d_n = d / m_n;
    double d_n2 = d_n * d_n;
    m_m4 += d * d_n2 * d_n * ((m_n - 1) * ((m_n * m_n) - 3 * m_n + 3)) +
            6 * d_n2 * m_m2 - 4 * d_n * m_m3;
    m_m3 += d * d_n2 * ((m_n - 1) * (m_n - 2)) - 3 * d_n * m_m2;
    m_m2 += d * d_n * (m_n - 1);
    m_m1 += d_n;
  double Mean() { return m_m1; }
  double StdDeviation() { return sqrt(Variance()); }
  double StdError() { return (m_n > 1.0) ? sqrt(Variance() / m_n) : 0.0; }
  double Variance() { return (m_n > 1.0) ? (m_m2 / (m_n - 1.0)) : 0.0; }
  double Skewness() { return sqrt(m_n) * m_m3 / pow(m_m2, 1.5); }
  double Kurtosis() { return m_n * m_m4 / (m_m2 * m_m2); }

I have examined various primary sources, however the wikipedia page "Algorithms for calculating variance" is a good starting place if you are interested in where the math comes from. The methods used should be numerically stable (i.e. not prone to severe loss of precision), reasonably high performance (only a single divide operation during update), and, best of all, return correct values. There are a number of good pages on the web that describe pieces of this problem, including some that provide details for how such equations can be used to parallelize variance calculation. Still, I thought it would be useful to have a nice concise presentation in C++ code. Note that there are a few more methods, inline, and const qualifiers in the actual Avida implementation, but the math is the same.

A quick performance comparison has cRunningStats taking about 20% more time on a set of 100M random numbers versus cDoubleSum (same random number seed). The value correctness tradeoff is, of course, well worth the cost. However, if anyway knows how to make the cDoubleSum methods work correctly, I’d love to know the answer. As things stand now, I have simply removed the bad code.

License Note: Feel free to use cRunningStats code as public domain. An in source code attribution comment with a referral URL would be nice, but is not required. The full class is available in Avida, licensed under the GPL.
This entry was posted in Computer Science. Bookmark the permalink.

Comments are closed.