Anatomy of a floating point number

In my previous post, I explained that floating point numbers are a leaky abstraction. Often you can pretend that they are mathematical real numbers, but sometimes you cannot. This post peels back the abstraction and explains exactly what a floating point number is. (Technically, this post describes an IEEE 754 double precision floating point number, by far the most common kind of floating point number in practice.)

A floating point number has 64 bits that encode a number of the form ± p × 2e. The first bit encodes the sign, 0 for positive numbers and 1 for negative numbers. The next 11 bits encode the exponent e, and the last 52 bits encode the precision p. The encoding of the exponent and precision require some explanation.

The exponent is stored with a bias of 1023. That is, positive and negative exponents are all stored in a single positive number by storing e + 1023 rather than storing e directly. Eleven bits can represent integers from 0 up to 2047. Subtracting the bias, this corresponds to values of e from -1023 to +1024. Define emin = -1022 and emax = +1023. The values emin – 1 and emax + 1 are reserved for special use. More on that below.

Floating point numbers are typically stored in normalized form. In base 10, a number is in normalized scientific notation if the significand is ≥ 1 and < 10. For example, 3.14 × 102 is in normalized form, but 0.314 × 103 and 31.4 × 102 are not. In general, a number in base β is in normalized form if it is of the form p × βe where 1 ≤ p < β. This says that for binary, i.e. β = 2, the first bit of the significand of a normalized number is always 1. Since this bit never changes, it doesn’t need to be stored. Therefore we can express 53 bits of precision in 52 bits of storage. Instead of storing the significand directly, we store f, the fractional part, where the significand is of the form 1.f.

The scheme above does not explain how to store 0. Its impossible to specify values of f and e so that 1.f × 2e = 0. The floating point format makes an exception to the rules stated above. When e = emin – 1 and f = 0, the bits are interpreted as 0. When e = emin – 1 and f ≠ 0, the result is a denormalized number. The bits are interpreted as 0.f × 2emin. In short, the special exponent reserved below emin is used to represent 0 and denormalized floating point numbers.

The special exponent reserved above emax is used to represent ∞ and NaN. If e = emax + 1 and f = 0, the bits are interpreted as ∞. But if e = emax + 1 and f ≠ 0, the bits are interpreted as a NaN or “not a number.” See IEEE floating point exceptions for more information about ∞ and NaN.

Since the largest exponent is 1023 and the largest significant is 1.f where f has 52 ones, the largest floating point number is 21023(2 – 2-52) = 21024 – 2971 ≈ 21024 ≈ 1.8 × 10308. In C, this constant is defined as DBL_MAX, defined in <float.h>.

Since the smallest exponent is -1022, the smallest positive normalized number is 1.0 × 2-1022 ≈ 2.2 × 10-308. In C, this is defined as DBL_MIN. However, it is not the smallest positive number representable as a floating point number, only the smallest normalized floating point number. Smaller numbers can be expressed in denormalized form, albeit at a loss of significance. The smallest denormalized positive number occurs with f has 51 0′s followed by a single 1. This corresponds to 2-52*2-1022 = 2-1074 ≈ 4.9 × 10-324. Attempts to represent any smaller number must underflow to zero.

C gives the name DBL_EPSILON to the smallest positive number ε such that 1 + ε ≠ 1 to machine precision. Since the significant has 52 bits, it’s clear that DBL_EPSILON = 2-52 ≈ 2.2 × 10-16. That is why we say a floating point number has between 15 and 16 significant (decimal) figures.

For more details see What Every Computer Scientist Should Know About Floating-Point Arithmetic.

First post in this series: Floating point numbers are a leaky abstraction

14 thoughts on “Anatomy of a floating point number

  1. Karl, thanks for pointing out the mistake. I updated the post after reading your correction.

  2. bool IsNumber(double x)
    {
    // This looks like it should always be true,
    // but it’s false if x is a NaN.
    return (x == x);
    }
    mentioned in “IEEE floating-point exceptions in C++”
    This is not working at least in VC++ 6.0 compiler

  3. Manjunath, I’m surprised to hear that. I’ve tested that function on several versions of Visual C++, though I can’t recall whether I tested VC6. I’ve tested it on several versions of gcc as well.

  4. John, In gcc its working fine, but VC6 its not working as expected.
    Here is my x,
    x = numeric_limits::quiet_NaN();

  5. In my test code, I made my own NaNs manually rather than using quiet_NaN(). For example, I’d have code like

    x = -1.0; z = sqrt(x); assert(!IsNumber(z));

    or

    x = 1.0; y = 0.0; x /= y; z = y*x; assert(!IsNumber(z));

    I also have some more elaborate code to generate a quiet NaN. Maybe VC6 has a bug in quiet_NaN? I wouldn’t think so, but that seems more plausible than a violation of IEEE arithmetic.

  6. of course i did similar thing for test this function by generating NaN and in VC6 its failed. Then i’ve tried with quiet NaN for verifying it.

  7. In C++, the standard way to deal with these constants is via the parameterized numeric_limits class:

    #include <limits>

    double x = std::numeric_limits<double>::max();
    double y = std::numeric_limits<double>::epsilon();
    double nan = std::numeric_limits<double>::quiet_nan();
    [etc.]

    This has the advantage of working for float, double, long double, and whatever is in vogue tomorrow… It also works for the integer types (well, not NaN, but you know what I mean). It also works for typedefs where you neither know nor care what the underlying type is:

    typedef long double real_t;

    real_t x = std::numeric_limits<real_t>::max();

    The <limits> header is part of the C++98 standard, so this also has the advantage of being perfectly portable across 21st-century compilers.

  8. Thanks for the more detailed article on floating point; they’re hard to find. I was wondering, my math is a little weak. How do you get the expression 2^1023(2 – 2^-52) from 1.f * 2^1023, where f is a string of 52 1′s. I assume the 2^1023 remains untouched, so the question then is how do you get (2 – 2^-52) from 1.f? I tried finding a common denominator, so (2^52 + 2^51 + … 2^0) / 2^52, but didn’t seem to get me anywhere.
    Thanks.

  9. Be great to add a diagram showing the bit structure of a float. The first time I learned about them there was a diagram and I found it incredibly clear in one glance. :-)

  10. Hi Jack,

    I am trying to explain why 1.f = (2-2^(-52))

    1.f = 2⁰+2^(-1) + 2^(-2) + … + 2^(-52)
    so,
    2^(52) (1.f) = 2^(52) + 2^(51) + … + 2⁰ = 2^(53) – 1 (this is the key)

    And finally,

    1.f= (2^(53) – 1) / 2^(52) = 2 – 2^(-52)

    Best regards,
    Javier

Comments are closed.