Anatomy of a floating point number

In my previous post, I explained that floating point numbers are a leaky abstraction. Often you can pretend that they are mathematical real numbers, but sometimes you cannot. This post peels back the abstraction and explains exactly what a floating point number is. (Technically, this post describes an IEEE 754 double precision floating point number, by far the most common kind of floating point number in practice.)

A floating point number has 64 bits that encode a number of the form ± p × 2^e. The first bit encodes the sign, 0 for positive numbers and 1 for negative numbers. The next 11 bits encode the exponent e, and the last 52 bits encode the precision p. The encoding of the exponent and precision require some explanation.

The exponent is stored with a bias of 1023. That is, positive and negative exponents are all stored in a single positive number by storing e + 1023 rather than storing e directly. Eleven bits can represent integers from 0 up to 2047. Subtracting the bias, this corresponds to values of e from −1023 to +1024. Define e_min = −1022 and e_max = +1023. The values e_min − 1 and e_max + 1 are reserved for special use. More on that below.

Floating point numbers are typically stored in normalized form. In base 10, a number is in normalized scientific notation if the significand is ≥ 1 and < 10. For example, 3.14 × 10² is in normalized form, but 0.314 × 10³ and 31.4 × 10² are not. In general, a number in base β is in normalized form if it is of the form p × β^e where 1 ≤ p < β. This says that for binary, i.e. β = 2, the first bit of the significand of a normalized number is always 1. Since this bit never changes, it doesn’t need to be stored. Therefore we can express 53 bits of precision in 52 bits of storage. Instead of storing the significand directly, we store f, the fractional part, where the significand is of the form 1.f.

The scheme above does not explain how to store 0. Its impossible to specify values of f and e so that 1.f × 2^e = 0. The floating point format makes an exception to the rules stated above. When e = e_min − 1 and f = 0, the bits are interpreted as 0. When e = e_min − 1 and f ≠ 0, the result is a denormalized number. The bits are interpreted as 0.f × 2^e_min. In short, the special exponent reserved below e_min is used to represent 0 and denormalized floating point numbers.

The special exponent reserved above e_max is used to represent ∞ and NaN. If e = e_max + 1 and f = 0, the bits are interpreted as ∞. But if e = e_max + 1 and f ≠ 0, the bits are interpreted as a NaN or “not a number.” See IEEE floating point exceptions for more information about ∞ and NaN.

Since the largest exponent is 1023 and the largest significant is 1.f where f has 52 ones, the largest floating point number is

2¹⁰²³(2 − 2⁻⁵²) = 2¹⁰²⁴ − 2⁹⁷¹ ≈ 2¹⁰²⁴ ≈ 1.8 × 10³⁰⁸.

In C, this constant is defined as DBL_MAX, defined in <float.h>.

Since the smallest exponent is -1022, the smallest positive normalized number is 1.0 × 2⁻¹⁰²² ≈ 2.2 × 10⁻³⁰⁸. In C, this is defined as DBL_MIN. However, it is not the smallest positive number representable as a floating point number, only the smallest normalized floating point number. Smaller numbers can be expressed in denormalized form, albeit at a loss of significance. The smallest denormalized positive number occurs with f has 51 0’s followed by a single 1. This corresponds to 2⁻⁵²*2⁻¹⁰²² = 2⁻¹⁰⁷⁴ ≈ 4.9 × 10⁻³²⁴. Attempts to represent any smaller number must underflow to zero.

C gives the name DBL_EPSILON to the smallest positive number ε such that 1 + ε ≠ 1 to machine precision. Since the significant has 52 bits, it’s clear that DBL_EPSILON = 2⁻⁵² ≈ 2.2 × 10⁻¹⁶. That is why we say a floating point number has between 15 and 16 significant (decimal) figures.

For more details see What Every Computer Scientist Should Know About Floating-Point Arithmetic.

First post in this series: Floating point numbers are a leaky abstraction

14 thoughts on “Anatomy of a floating point number”

Karl Ove Hufthammer

7 April 2009 at 09:21

Thanks for the nice summary of floating point numbers. BTW, I think you made a small mistake in the third paragraph: 2043 should be 2047.

John

7 April 2009 at 09:39

Karl, thanks for pointing out the mistake. I updated the post after reading your correction.

Manjunath

22 June 2009 at 23:54

bool IsNumber(double x)
{
// This looks like it should always be true,
// but it’s false if x is a NaN.
return (x == x);
}
mentioned in “IEEE floating-point exceptions in C++”
This is not working at least in VC++ 6.0 compiler

John

23 June 2009 at 01:23

Manjunath, I’m surprised to hear that. I’ve tested that function on several versions of Visual C++, though I can’t recall whether I tested VC6. I’ve tested it on several versions of gcc as well.

Manjunath

23 June 2009 at 01:32

John, In gcc its working fine, but VC6 its not working as expected.
Here is my x,
x = numeric_limits::quiet_NaN();

John

23 June 2009 at 01:44

In my test code, I made my own NaNs manually rather than using quiet_NaN(). For example, I’d have code like

x = -1.0; z = sqrt(x); assert(!IsNumber(z));

x = 1.0; y = 0.0; x /= y; z = y*x; assert(!IsNumber(z));

I also have some more elaborate code to generate a quiet NaN. Maybe VC6 has a bug in quiet_NaN? I wouldn’t think so, but that seems more plausible than a violation of IEEE arithmetic.

Manjunath

23 June 2009 at 02:10

of course i did similar thing for test this function by generating NaN and in VC6 its failed. Then i’ve tried with quiet NaN for verifying it.

Nemo

13 October 2010 at 14:33

In C++, the standard way to deal with these constants is via the parameterized numeric_limits class:

#include <limits>

double x = std::numeric_limits<double>::max();
double y = std::numeric_limits<double>::epsilon();
double nan = std::numeric_limits<double>::quiet_nan();
[etc.]

This has the advantage of working for float, double, long double, and whatever is in vogue tomorrow… It also works for the integer types (well, not NaN, but you know what I mean). It also works for typedefs where you neither know nor care what the underlying type is:

typedef long double real_t;

real_t x = std::numeric_limits<real_t>::max();

The <limits> header is part of the C++98 standard, so this also has the advantage of being perfectly portable across 21st-century compilers.

Antonio

7 July 2011 at 10:41

IEEE should hurry up with that floating point interval arithmetic standard http://en.wikipedia.org/wiki/Interval_arithmetic#IEEE_Interval_Standard_.E2.80.93_P1788

Jack

2 January 2012 at 09:16

Thanks for the more detailed article on floating point; they’re hard to find. I was wondering, my math is a little weak. How do you get the expression 2^1023(2 – 2^-52) from 1.f * 2^1023, where f is a string of 52 1’s. I assume the 2^1023 remains untouched, so the question then is how do you get (2 – 2^-52) from 1.f? I tried finding a common denominator, so (2^52 + 2^51 + … 2^0) / 2^52, but didn’t seem to get me anywhere.
Thanks.

Barnaby

21 February 2012 at 11:56

Be great to add a diagram showing the bit structure of a float. The first time I learned about them there was a diagram and I found it incredibly clear in one glance. :-)

Javier

14 July 2012 at 10:31

Hi Jack,

I am trying to explain why 1.f = (2-2^(-52))

1.f = 2⁰+2^(-1) + 2^(-2) + … + 2^(-52)
so,
2^(52) (1.f) = 2^(52) + 2^(51) + … + 2⁰ = 2^(53) – 1 (this is the key)

And finally,

1.f= (2^(53) – 1) / 2^(52) = 2 – 2^(-52)

Best regards,
Javier

Nathan Fisher

5 April 2015 at 17:43

Hi John,

Just as an FYI the sun page you referenced is now located here;
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Cheers,
Nathan

John

6 April 2015 at 10:35

Thanks. I updated the link in the post.

That link has changed several times since I found it.

Comments are closed.