Most programmers are at one extreme or another when it comes to floating point arithmetic. Some are blissfully ignorant that anything can go wrong, while others believe that danger lurks around every corner when using floating point.

The limitations of floating point arithmetic are something to be aware of, and ignoring these limitations can cause problems, like crashing airplanes. On the other hand, floating point arithmetic is usually far more reliable than the application it is being used in.

It’s well known that if you multiply two floating point numbers, *a* and *b*, then divide by *b*, you might not get exactly *a* back. Your result might differ from *a* by one part in ten quadrillion (10^16). To put this in perspective, suppose you have a rectangle approximately one kilometer on each side. Someone tells you the exact area and the exact length of one side. If you solve for the length of the missing side using standard (IEEE 754) floating point arithmetic, your result could be off by as much as the width of a helium atom.

Most of the problems attributed to “round off error” are actually approximation error. As Nick Trefethen put it,

If rounding errors vanished, 90% of numerical analysis would remain.

Still, despite the extreme precision of floating point, in some contexts rounding error is a practical problem. You have to learn in what context floating point precision matters and how to deal with its limitations. **This is no different than anything else in computing**.

For example, most of the time you can imagine that your computer has an unlimited amount of memory, though sometimes you can’t. It’s common for a computer to have enough memory to hold the entire text of Wikipedia (currently around 12 gigabytes). This is usually far more memory than you need, and yet for some problems it’s not enough.

**More on floating point computing**:

The main complication with floating point is that comparing for equality becomes more complicated. You can’t rely on the “==” operator, you shouldn’t expect the number zero to pop up, and often you need to put some thought into what you’re really intending to do when you check for “equality.” Where did these numbers come from? How big is a negligible error?

I’d say the main complication is addition and subtraction. If two numbers agree to N bits, you can lose N bits of precision in taking their difference.

I have no confirmation of this but Alex Stepanov mentioned that von Neumann was against the use of floating point (I guess he advocated for a manual –algorithmically intelligent!– scaling of fixed point arithmetic.) I am slowly converting to this position, floating point is such a messy and arbitrary representation that for many problems it is easier to fallback to dividing the problem in different scales. Unfortunately hardware (in its arbitrariness) encourages the opposite. There is no easy solution, floating point is an illusion of an easy solution until it doesn’t work then you have to use dark magic (the evidence is that most numerical algorithms implementations have at some point some magic constants, like “epsilon” or “verysmall” or “tol”).

I’d say there are two parts to this. One is just knowing that floating point numbers are inexact. 0.1 + 0.2 does not give exactly 0.3. The main issue here is that you can’t use direct equality comparison, as Michael Abrahams noted.

The second is that you can get loss of significance, which can be dramatic in the worst cases. This is where the fear comes in. Most people don’t have a full understanding of how or when this happens (it’s generally in subtraction, or equivalently, in addition of terms with opposite signs). And understanding how to detect it and how to properly restructure a computation to avoid it can often be a research topic on its own.

So there are good reasons to fear the floating point, but it’s not *just* because of its inexactness, as so many people seem to think. In fact, most issues with cancellation go away if you use arbitrary precision floating point numbers, which are still inexact (these are less performant and more difficult to implement, though, so it’s not a free ride).

By the way, in floating point you can indeed rely on the == operator. If nothing overflows, you do addition, subtraction, and multiplication of 32-bit integers using 64-bit floating point numbers, and get the same answer that you would have had you used integer arithmetic instead.

I suggest adding a link to what I’d consider the definitive guide, “What Every Computer Scientist Should Know About Floating-Point Arithmetic”:

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

I wonder why there is not much more research done on alternative models like interval arithmetic or the several models of RCF. They may be harder to implement in hardware and slower, but not sure if that is just due to the lack of efforts put into optimizing them.

John, I think that your advice “You have to learn in what context floating point precision matters and how to deal with its limitations. This is no different than anything else in computing.” is a great advice even if it is very general.

I would love to read real-life production-code where the application of, say, Kahan summation, avoid the crash of the airplane.

To me, the advice in one of the above comment, to read “What Every Computer Scientist Should Know About Floating-Point Arithmetic” is useless… I have tried to read it with no luck and I am afraid its audience is someone who already know *a lot* of floating point computation.

Instead, I have found more useful the book:

OVERTON, Michael L. Numerical computing with IEEE floating point arithmetic. Siam, 2001.

To conclude, a very concise advice by Bjarne Stroustrup (4.5 Floating-Point Types [dcl.float]) to C++ programmers:

The exact meaning of single-, double-, and extended-precision is implementation-defined.

Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation.

If you don’t have that understanding, get advice, take the time to learn, or use double and hope for the best.

Others have already pointed to arithmetic comparison and loss of precision as two primary areas of difficulty for many programmers. I’d add a third: understanding the inherently logarithmic representation of numbers, particularly when mixed with linear representations.

This comes out when programmers understand the variance of arithmetic results and the difficulties comparing them, but often then reach for epsilon values as a solution. _That_ mistake is one of trying to use a fixed value in an inherently logarithmic space.

Not of primary concern, but often secondary: programmers need to lose their fear of infinite values and of division by zero. I’ve worked on many problems where infinite values collapse computation down to an elegant solution and eliminate a weedy mess of predicate logic. One example, computing boolean operations on axis-aligned bounding boxes (multi-dimensional ranges).