Joel Spolsky coined the term leaky abstraction for programming concepts that usually shield you from messy details but sometimes break down. A perfect abstraction is a black box that you never have to open. A leaky abstraction is a black box that you have to open up occasionally.

Floating point numbers, the computer representations of real numbers, are leaky abstractions. They work remarkably well: you can usually pretend that a floating point type is a mathematical real number. But sometimes you can’t. The abstraction leaks, though not very often.

Most explanations I’ve heard for the limitations of machine numbers are pedantic. “There are only a finite number of floating point numbers so they can’t represent real numbers well.” That’s not much help. It doesn’t explain why floating point numbers actually *do* represent real numbers sufficiently well for most applications, and it doesn’t suggest where the abstraction might leak.

A standard floating point number has roughly 16 decimal places of precision and a maximum value on the order of 10^{308}, a 1 followed by 308 zeros. (According to IEEE standard 754, the typical floating point implementation.)

Sixteen decimal places is a lot. Hardly any measured quantity is known to anywhere near that much precision. For example, the constant in Newton’s Law of Gravity is only known to six significant figures. The charge of an electron is known to 11 significant figures, much more precision than Newton’s gravitational constant, but still less than a floating point number. **So when are 16 figures not enough**? One problem area is subtraction. The other elementary operations — addition, multiplication, division — are very accurate. As long as you don’t overflow or underflow, these operations often produce results that are correct to the last bit. But subtraction can be anywhere from exact to completely inaccurate. If two numbers agree to n figures, you can lose up to n figures of precision in their subtraction. This problem can show up unexpectedly in the middle of other calculations. For an example, see this post on calculating standard deviation. See also computing derivatives, the second example in Five Tips for Floating Point Programming.

What about overflow or underflow? **When do you need numbers bigger than 10 ^{308}**? Often you don’t. But in probability calculations, for example, you need them all the time unless you’re clever. It’s common in probability to compute a medium-sized number that is the product of an astronomically large number and an infinitesimally small number. The final result fits into a computer just fine, but the intermediate numbers might not due to overflow or underflow. For example, the maximum floating point number on most computers is somewhere between 170 factorial and 171 factorial. Such large factorials often appear in applications, often in ratios with other large factorials. See Avoiding Overflow, Underflow, and Loss of Precision for tricks on how to work with factorials that would overflow if computed directly.

Often you can afford to be blissfully ignorant of the details of floating point arithmetic, but sometimes you cannot. A great place to learn more is David Goldberg’s paper What Every Computer Scientist Should Know About Floating-Point Arithmetic.

**Update**: See follow-up post, Anatomy of a floating point number.

Nice post. This is a huge problem in sums too, especially when the addends are different by many orders of magnitude.

Don’t forget a Bayesian classifier — it involves multiplying large numbers of small probabilities, then normalizing them. After getting strange results, and switching to BigDecimal, I learned that the usual solution is to do all calculations in log space.

Did anybody else notice the error in Figure D-1 of David Goldberg’s paper (http://docs.sun.com/source/806-3568/ncg_goldberg.html#1374) ? The “0” should be at the left of the number line, not at the hash mark corresponding to 1/2.

Not to mention the problems that arise because we think in decimal but compute in binary. Does 0.1 + 0.2 = 0.3? No, not in binary floats, it doesn’t.

This also ties in with your post a while back on accurately calculation the variance.

David Eisner: I updated the links to point to an updated version of Goldberg’s paper. Thanks for pointing out the error.

I disagree with your idea that floating-point is supposed to represent “real numbers”. They represent rational numbers.

If you want to approximate reals with rationals, that’s your choice, but you should at least understand the difference, and why you can get into trouble thereby.

Jack: Every floating point number does represent a rational number, so I could see your point. But I’d still say that floating point numbers are trying to approximate real numbers. Arithmetic operations are designed to be approximately correct in the analytical sense of the real line, not exactly correct in the algebraic sense of a quotient field. I think fixed-point numbers are a better model for rational numbers.

I’m surprised to not see the “move to the logspace” basic trick… which is to compute exp(sum log x_i) in place of prod(x_i). Works great for probabilities – they tend to be from some distribution and then central limit works for you…

Also normalizing small (or large) numbers to sum up to, say, 1.0 is better accomplished by first subtracting the max of all the logarithms.

Tomas: I agree that the “move to the logspace” trick is very important. I talk about it here in the context of computing binomial probabilities.

And of course Kevin did talk about it first, my oversight.

Tom Ochs wrote a series of articles about exactly this sort of thing in “Computer Language” magazine some years ago.

When your errors are multiplicative, and you have choice in the order in which to perform operations, another “trick” is to arrange the computation so that the greatest norm operators are applied last. Here’s an example http://yaroslavvb.blogspot.com/2010/09/order-matters.html

Also, operations with floating point numbers are not pure functional on different machines/compilers.

There’s a whole discipline — numerical computation — about calculating with floating point numbers efficiently and maximizing both accuracy and precision. A lot of the hard work is done with pencil and paper analysis before any code is written. The techniques discussed above are just a few of the things done in numerical analysis.

Many links to sun.com are now useless…including yours to What Every Computer Scientist Should Know About Floating-Point Arithmetic (corrected).

Thanks! I’ve updated the link with the Oracle one.

Hi John, I disagree with your statement “… problem area is subtraction… addition, multiplication, division — are very accurate.”

Subtraction and addition are equivalent (we can “flip the sign” of operands) a – b a + (-b). To illustrate, in decimal, not normalised, 6 digits precision:

3.14159 minus 0.00159265 is 3.14000 while

3.14000 plus 0.00159265 is 3.14159.

Both operations lose 0.00000265 and would appear (to my intuition) in general to have similar leak-potential.

Do I have a point, or have I missed yours?

When I say subtraction is problematic and addition is not, I have in mind the elementary ideas of these operations. For example, here I’d call 7 – 5 or 7 + (-5) a subtraction.

Another way to say it is that adding numbers of the same sign does not cause a loss of precision, but adding numbers of opposite sign may.

“Hardly any measured quantity is known to anywhere near that much precision.”

The obvious trouble spots, then, are non-measured quantities. Not every number comes from a ruler.

It’s not hard at all to state a math problem in a few words which can easily be solved by a high school student, but not by a computer which relies on 64-bit arithmetic.

I know its an old topic but it still hasn’t a general solution yet.

Physical constants aren’t know to 16 decimal places and their value are oftenly not exactly representable in floating point (even 64 bit large ones)

It would seem that since, as you say, physical constants aren’t known to double precision that double precision should be enough for everything. But you often need more precision in intermediate results to produce output with roughly the same precision as input. Sometimes double precision in intermediate results is not enough to produce single precision results.