When is round-trip floating point radix conversion exact?

Suppose you store a floating point number in memory, print it out in human-readable base 10, and read it back in. When can the original number be recovered exactly?

D. W. Matula answered this question more generally in 1968 [1].

Suppose we start with base β with p places of precision and convert to base γ with q places of precision, rounding to nearest, then convert back to the original base β. Matula’s theorem says that if there are no positive integers i and j such that

βi = γj

then a necessary and sufficient condition for the round-trip to be exact (assuming no overflow or underflow) is that

γq-1 > βp.

In the case of floating point numbers (type double in C) we have β = 2 and p = 53. (See Anatomy of a floating point number.) We’re printing to base γ = 10. No positive power of 10 is also a power of 2, so Matula’s condition on the two bases holds.

If we print out q = 17 decimal places, then

1016 > 253

and so round-trip conversion will be exact if both conversions round to nearest. If q is any smaller, some round-trip conversions will not be exact.

You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits, and for a quad precision number (p = 113 bits precision) you need q = 36 decimal digits [2].

Looking back at Matula’s theorem, clearly we need

γq ≥ βp.

Why? Because the right side is the number of base β fractions and the left side is the number of base γ fractions. You can’t have a one-to-one map from a larger space into a smaller space. So the inequality above is necessary, but not sufficient. However, it’s almost sufficient. We just need one more base γ figure, i.e. we Matula tells us

γq-1 > βp

is sufficient. In terms of base 2 and base 10, we need at least 16 decimals to represent 53 bits. The surprising thing is that one more decimal is enough to guarantee that round-trip conversions are exact. It’s not obvious a priori that any finite number of extra decimals is always enough, but in fact just one more is enough; there’s no “table maker’s dilemma” here.

Here’s an example to show the extra decimal is necessary. Suppose p = 5. There are more 2-digit numbers than 5-bit numbers, but if we only use two digits then round-trip radix conversion will not always be exact. For example, the number 17/16 written in binary is 1.0001two, and has five significant bits. The decimal equivalent is 1.0625ten, which rounded to two significant digits is 1.1ten. But the nearest binary number to 1.1ten with 5 significant bits is 1.0010two = 1.125ten. In short, rounding to nearest gives

1.0001two -> 1.1ten -> 1.0010two

and so we don’t end up back where we started.

More floating point posts

[1] D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50. January 1968. Cited in Handbook of Floating-point Arithmetic by Jean-Mihel Muller et al.

[2] The number of bits allocated for the fractional part of a floating point number is 1 less than the precision: the leading figure is always 1, so IEEE formats save one bit by not storing the leading bit, leaving it implicit. So, for example, a C double has 53 bits precision, but 52 bits of the 64 bits in a double are allocated to storing the fraction.

4 thoughts on “When is round-trip floating point radix conversion exact?

  1. Severin Pappadeux

    “You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits” Really? 2^24 is about 16 millions, so q = 8 decimal digits should be enough.

  2. @Severin: It is true that 8 decimal digits is enough to have more than 2^24 possible numbers. That is necessary but not sufficient. With 8 decimals it would be possible to create some one-to-one map back and forth, but that map would not correspond to radix conversion with round to nearest. For exact round-trip radix conversion you need one more decimal.

  3. The necessary condition for a one-to-one map (i.e., that distinct floating-point values are guaranteed to map to distinct decimal representations) can be refined to a characterizing– i.e., equivalent– stricter condition that γ^(q-1) ≥ β^(p)-1. This is in another paper by Matula published later that same year:

    Matula, David W., The Base Conversion Theorem, Proceedings of the American Mathematical Society, 19(3) June 1968, p. 716-723

    The good news is that in the particular common case where we are converting base 2 to base 10, this condition is equivalent to the inequality in your post, with the sole exception of single-bit conversions (p=q=1).

Leave a Reply

Your email address will not be published. Required fields are marked *