The previous post looked at an example of strange floating point behavior taking from book End of Error. This post looks at another example.

This example, by Siegfried Rump, asks us to evaluate

333.75 *y*^{6} + *x*^{2} (11 *x*^{2} *y*^{2} − *y*^{6} − 121 *y*^{4} − 2) + 5.5 *y*^{8} + *x*/(2*y*)

at *x* = 77617 and *y* = 33096.

Here we evaluate Rump’s example in single, double, and quadruple precision.

#include <iostream> #include <quadmath.h> using namespace std; template <typename T> T f(T x, T y) { T x2 = x*x; T y2 = y*y; T y4 = y2*y2; T y6 = y2*y4; T y8 = y2*y6; return 333.75*y6 + x2*(11*x2*y2 - y6 - 121*y4 - 2) + 5.5*y8 + x/(2*y); } int main() { int x = 77617; int y = 33096; cout << f<float>(x, y) << endl; cout << f<double>(x, y) << endl; cout << (double) f<__float128>(x, y) << endl; }

This gives three answers,

-1.85901e+030 -1.18059e+021 1.1726

none of which is remotely accurate. The exact answer is −54767/66192 = −0.827…

Python gives the same result as C++ with double precision, which isn’t surprising since Python floating point numbers are C doubles under the hood.

Where does the problem come from? The usual suspect: subtracting nearly equal numbers. Let’s split Rump’s expression into two parts.

*s* = 333.75 *y*^{6} + *x*^{2}(11*x*^{2}*y*^{2} − *y*^{6} − 121*y*^{4} − 2)

t = 5.5*y*^{8} + *x*/(2*y*)

and look at them separately. We get

s = -7917111340668961361101134701524942850.00 t = 7917111340668961361101134701524942849.1726...

The values −*s* and *t* agree to 36 figures, but quadruple precision has only 34 figures [1]. Subtracting these two numbers results in catastrophic loss of precision.

Rump went to some effort to conceal what’s going on, and the example is contrived to require just a little more than quad precision. However, his example illustrates things that come up in practice. For one, the values of *x* and *y* are not that big, and so we could be mislead into thinking the terms in the polynomial are not that big. For another, it’s not clear that you’re subtracting nearly equal numbers. I’ve been caught off guard by both of these in real applications.

Floating point numbers are a leaky abstraction. They can trip you up if you don’t know what you’re doing. But the same can be said of everything in computing. We can often ignore finite precision just as we can often ignore finite memory, for example. The proper attitude toward floating point numbers is somewhere between blissful ignorance and unnecessary fear.

## Similar posts

[1] A quad precision number has 128 bits: 1 sign bit, 15 for exponents, and 112 for the fractional part. Because the leading zero is implicit, this gives a quad 113 bits of precision. Since log_{10}(2^{113}) = 34.01…, this means a quad has 34 decimals of precision.

I love these posts. Keep them coming.

Hmm. That expression is simple enough that, if you had an arbitrary precision integer library*, you could calculate the result exactly. (You’d need to multiply by 100 to make the coefficients integers and do something sensible with the x/2y term.)

Of course, that lacks generality something fierce, but generality wasn’t the question. And the intermediate results wouldn’t be all that big, 30 digits or so.

*: Memory has it that certain premodern LISP implementations (e.g. MACLISP) did have infinite precision integer arithmetic, so I’d think that one of the zillions of modern languages would too…

Hi John,

There is a tool called Herbie that can be used to detect when things may go awry with floating point expressions. Have you try to see if some of these examples with it?

https://herbie.uwplse.org/

There’s an easy way to do this in Smalltalk : just use fractions to keep the necessary precision and end up with the exact result! My blog post: https://endormitoire.wordpress.com/2019/11/24/smalltalk-et-les-nombres/

Cool example, which tricks one to think double or even long double precision is enough, but it isn’t!

So, I tried this example using multiple-precision GNU MP library:

mpf_class mp_x(77617);

mpf_class mp_y(33096);

std::cout << f(mp_x, mp_y) << std::endl;

And it outputs the correct answer: -0.827396060.

Thanks for your post!