3 thoughts on “Five tips for floating point programming

  1. Nice article! At first glance I thought you missed pointing out the problem summing floating point values of very different magnitude, but on re-reading I saw your example.

    I’ve run into that problem primarily in calculating series in which the terms can be small relative to the total. It is of course possible to lose precision that way but I think more often it leads to unproductive CPU cycles. Either way it can be something best avoided.

    For example, if the terms are positive and strictly decreasing there comes a time when the terms are so small that they are below the precision of the total. At that point (probably before then) either you are losing precision or wasting time if you are simply adding each term to the total. You should either test the terms against a threshold to decide when to quit (to avoid unnecessary work) or accumulate sequences of terms in sub-totals before adding them to the overall total. The latter strategy allows terms of similar (but relatively small) magnitude to be summed without losing precision until their total is large enough relative to the overall total to avoid loss of precision. If you’re really concerned this strategy can be nested of course.

    Likewise, the terms of a series often begin quite small, increase to some maximum, then decrease again. For example, think about terms which involve a Poisson distribution with large mean. Often the terms are small enough in the beginning that may be safely ignored, but this can be hard to determine at the outset. A better strategy would involve starting near the maximum of the terms and then sum outwards in both directions. Note that you don’t need to calculate the maximum exactly to get a good starting point.

  2. Your example of precision in simple numerical operations is much more important than many realize. I think John Venier summed it up nicely in his post, as it’s a nice analogy to Taylor’s Theorem with Remainder.

    For instance, if you are using an iterative method to search for f(x) = 0, the method you use to handle the subtraction as you “bounce off” the precision limits of the floating point library or hardware may determine if you converge to a solution or oscillate between to positions on the graph. (Numerical Analysis, James Ortega, ISBN 0-12-528560-4).

Leave a Reply

Your email address will not be published. Required fields are marked *