Summing an array of floating point numbers

Adding up an array of numbers is simple: just loop over the array, adding each element to an accumulator.

Usually that’s good enough, but sometimes you need more precision, either because your application need a high-precision answer, or because your problem is poorly conditioned and you need a moderate-precision answer [1].

There are many algorithms for summing an array, which may be surprising if you haven’t thought about this before. This post will look at Kahan’s algorithm, one of the first non-obvious algorithms for summing a list of floating point numbers.

We will sum a list of numbers a couple ways using C’s float type, a 32-bit integer. Some might object that this is artificial. Does anyone use floats any more? Didn’t moving from float to double get rid of all our precision problems? No and no.

In fact, the use of 32-bit float types is on the rise. Moving data in an out of memory is now the bottleneck, not arithmetic operations per se, and float lets you pack more data into a giving volume of memory. Not only that, there is growing interest in floating point types even smaller than 32-bit floats. See the following posts for examples:

And while sometimes you need less precision than 32 bits, sometimes you need more precision than 64 bits. See, for example, this post.

Now for some code. Here is the obvious way to sum an array in C:

    float naive(float x[], int N) {
        float s = 0.0;
        for (int i = 0; i < N; i++) {
            s += x[i];
        }
        return s;
    }

Kahan’s algorithm is not much more complicated, but it looks a little mysterious and provides more accuracy.

    float kahan(float x[], int N) {
        float s = x[0];
        float c = 0.0;
        for (int i = 1; i < N; i++) {
            float y = x[i] - c;
            float t = s + y;
            c = (t - s) - y;
            s = t;
        }
        return s;
    }

Now to compare the accuracy of these two methods, we’ll use a third method that uses a 64-bit accumulator. We will assume its value is accurate to at least 32 bits and use it as our exact result.

    double dsum(float x[], int N) {
        double s = 0.0;
        for (int i = 0; i < N; i++) {
            s += x[i];
        }
        return s;
    }

For our test problem, we’ll use the reciprocals of the first 100,000 positive integers rounded to 32 bits as our input. This problem was taken from [2].

    #include <stdio.h&gt:
    int main() {
        int N = 100000;
        float x[N];
        for (int i = 1; i <= N; i++)
            x[i-1] = 1.0/i;
  
        printf("%.15g\n", naive(x, N));
        printf("%.15g\n", kahan(x, N));  
        printf("%.15g\n", dsum(x, N));
    }

This produces the following output.

    12.0908508300781
    12.0901460647583
    12.0901461953972

The naive method is correct to 6 significant figures while Kahan’s method is correct to 9 significant figures. The error in the former is 5394 times larger. In this example, Kahan’s method is as accurate as possible using 32-bit numbers.

Note that in our example we were not exactly trying to find the Nth harmonic number, the sum of the reciprocals of the first N positive numbers. Instead, we were using these integer reciprocals rounded to 32 bits as our input. Think of them as just data. The data happened to be produced by computing reciprocals and rounding the result to 32-bit floating point accuracy.

But what if we did want to compute the 100,000th harmonic number? There are methods for computing harmonic numbers directly, as described here.

Related posts

[1] The condition number of a problem is basically a measure of how much floating point errors are multiplied in the process of computing a result. A well-conditioned problem has a small condition number, and a ill-conditioned problem has a large condition number. With ill-conditioned problems, you may have to use extra precision in your intermediate calculations to get a result that’s accurate to ordinary precision.

[2] Handbook of Floating-Point Arithmetic by Jen-Michel Muller et al.

6 thoughts on “Summing an array of floating point numbers

  1. I reformatted the code when I pasted it into the blog post. Maybe that’s when the return got lost. Anyway, thanks. I just fixed it.

  2. My first thought was to sort the data from smallest to largest and that gave 12.0901527404785, 12.0901460647583, and 12.0901461953972.

  3. XML character entity error in the test listing (include statement).

    More importantly, the naive method is accurate to five digits (not six), and Kahan’s method is accurate to seven (not nine). Which is reassuring, since floats have at most seven decimal digits of precision.

  4. How does kahan compare to dsum in terms of run time? Naively I would expect 4 operations on 16-bit numbers to take longer than 1 operation on 32-bit numbers, but perhaps the conversion itself is more expensive? Or does it depend on the specific language and/or hardware implementation?

  5. @Nathan: Kahan’s algorithm would normally be used to get more accuracy out of the available precision. I wrote my example as if only floats were available, then cheated to get a gold standard value to compare to. If you have doubles, you’d likely use dsum, and if that’s not accurate enough, you could use Kahan’s algorithm using doubles.

Leave a Reply

Your email address will not be published. Required fields are marked *