Floating point oddity

Here is an example fiendishly designed to point out what could go wrong with floating point arithmetic, even with high precision.

I found the following example by Jean-Michel Muller in John Gustafson’s book End of Error. The task is to evaluate the following function at 15, 16, 17, and 9999.

\begin{align*} e(x) &= \frac{\exp(x) - 1}{x} \\ q(x) &= \left| x - \sqrt{x^2 + 1}\right| - \frac{1}{x + \sqrt{x^2 + 1}} \\ h(x) &= e(q(x)^2) \\ \end{align*}

Here e(0) is defined by continuity to be 1. That is, we define e(0) to be the limit of e(x) as x approaches 0. That limit exists, and it equals 1, so we define e(0) to be 1.

If you directly implement the functions above in C++, you will get 0 as the output, whether you use single, double, or even quadruple precision as the following code shows. However, the correct answer in each case is 1.

#include <iostream>
#include <math.h>
#include <quadmath.h>

using namespace std;

template <typename T> T e(T x) {
  return x == 0. ? 1. : (exp(x) - 1.)/x;

template <typename T> T q(T x) {
  return fabs(x - sqrt(x*x + 1.)) - 1./(x + sqrt(x*x + 1.));

template <typename T> T h(T x) {
  return e( q(x)*q(x) );

int main() {

  int x[] = {15, 16, 17, 9999};

  for (int i = 0; i < 4; i++) {
      cout << h(     float(x[i]) ) << endl;
      cout << h(    double(x[i]) ) << endl;
      cout << h(__float128(x[i]) ) << endl;

A little algebra shows that the function q(x) would return 0 in exact arithmetic, but not in floating point arithmetic. It returns an extremely small but non-zero number, and the numerator of (exp(x) - 1.)/x evaluates to 0.

If q(x) returned exactly zero, h(x) would correctly return 1. Interestingly, if q(x) were a little less accurate, returning a little larger value when it should return 0, h would be more accurate, returning a value close to 1.

I tried replacing exp(x) - 1 with expm1(x). This made no difference. (More on expm1 here.)

Incidentally, bc -l gives the right result, no matter how small you set scale.

define ee(x) { if (x == 0) return 1 else return (e(x) - 1)/x }
define abs(x) { if (x > 0) return x else return -x }
define q(x) { return abs(x - sqrt(x^2 + 1)) - 1/(x + sqrt(x^2 + 1)) }
define h(x) { return ee( q(x)^2 ) }

Update: See the next post for another example, this time evaluating a polynomial.

Related posts

12 thoughts on “Floating point oddity

  1. Is exp() the problem? It’s going to take and return a double regardless of how wide of a float you give it and how wide of a float you interpret the result to be. On your system, does using expl() or an equivalent that uses quad-precision instead of extended-precision give the correct result?

  2. This post has been posted to Hacker News under the title “Floating Point C vs C++.” The reason is that initially I thought there was a difference between my C and C++ implementations, but that turned out to be an error on my part. After fixing the error, I changed the post title. But the post hit HN while I was editing. Sorry for any confusion.

  3. I tried your code on my setup and it produced 0 as you claimed, except that
    for whatever peculiar reason I got a “1” for the float case when x=9999.
    Also, when I tried expm1 replacement instead of exp it fixed the problem for all cases. Maybe it is something about your setup?

  4. My specific results may be particular to my setup, but the problem gives odd results on a wide variety of platforms according to Gustafson, and Muller before him.

    Gustafson says that William Kahan, contributor to and critic of the IEEE standard, popularized Muller’s example.

  5. Cleaning up the actual equations as you would in real life helps quite a bit. I don’t know the purpose of the post, but it is clear how to fix it.

  6. Hmmm. With Julia

    julia> using Quadmath
    [ Info: Precompiling Quadmath [be4d8f0f-7fa4-5f49-b795-2f01399ab2dd]

    julia> e(x) =(exp(x)-1)/x
    e (generic function with 1 method)

    julia> q(x)=abs(x – sqrt(x^2 -1)) – 1/(x + sqrt(x^2 + 1))
    q (generic function with 1 method)

    julia> h(x)=e(q(x)^2)
    h (generic function with 1 method)

    julia> h(15)

    julia> h(16)

    julia> h(17)

    julia> h(9999)

    julia> x=Float128(9999)

    julia> h(x)

    Basically, at certain arguments, your functions suffer from a catastrophic loss of precision. I recall worrying about this in grad school computing forward/backward difference numerical derivatives. Generally, when you subtract nearly equal quantities, such as exp(x)-1 near x->0, the whole q(x) function, you should expect that you are not going to have much precision left.

    This isn’t so much that floating point is “bad”, and arbitrary precision isn’t good either. Its more about understanding how floating point works, and why it really isn’t a 1 to 1 mapping of the reals, rationals, or other categories of numbers.

    I wrote something like this a long time ago ( https://scalability.org/2007/11/why-am-i-surprised-that-people-found-this-surprising/ ) about a very simple summation. One might point out that the sum I chose in this case was not convergent, and one needs to be especially careful when playing with non-convergent or conditionally convergent series numerically. I explored the same thing later with my Riemann Zeta Function tests (https://github.com/joelandman/rzf ).

  7. I’m not sure bc is doing what you’re supposing it’s doing: in my case it returns 1, but the ee() function used directly gives an error: e() does not exist. I wonder if a 0 or 1 is coming into play due to error, not due to calculation.

  8. @Ed: If the function e() doesn’t exist, you launched bc without the -l flag. You almost always want the -l flag, so it’s annoying that it’s not the default.

  9. @Alex: One lesson from the example is that it pays to clean up equations if you can, and that this can be more beneficial than extra precision. Another lesson is that in real applications, you might run into something analogous where you can’t make the problem go away with a little algebra.

  10. For whatever it’s worth, I tried the equivalent code in Matlab (for single, double anyway) and I get a slightly different answer (all 0’s on output) than the C++ result I got ( 1 on float, 0 in other cases).

    Using expm1 still fixed the problem for me, though (all 1’s on output).

  11. This reminds me of the bad-old-days of the late ’80’s, trying to fit high-level math into a slow (2 MHz) 8-bit embedded processor, and trying to generate results in real-time.

    On one project I had a scientist insist that 32-bit float was required for a given algorithm. I had to do days of work to show my hand-crafted fixed-point solution not only had a 10x performance gain, but did so with better precision (less precision loss).

    Turned out that the floating point precision on his PDP-11 was worse than the IEEE library I had for the 8-bit processor, meaning I was actually working against myself to have my fixed-point approach beat IEEE 32-bit floating point. I should have used his accuracy data for his PDP runs against that my 8-bit fixed-point runs.

  12. It’s somewhat surprising that switching to expm1 doesn’t fix this. For a small value of x, the e function (with exp – 1 replaced by expm1) seems like it should give something very close to 1 since expm1(x) = x + x^2/2… is very close to x, and dividing by x doesn’t lose any precision.

Leave a Reply

Your email address will not be published. Required fields are marked *