Suppose you store a floating point number in memory, print it out in human-readable base 10, and read it back in. When can the original number be recovered exactly?

D. W. Matula answered this question more generally in 1968 [1].

Suppose we start with base β with *p* places of precision and convert to base γ with *q* places of precision, rounding to nearest, then convert back to the original base β. Matula’s theorem says that if there are no positive integers *i* and *j* such that

β^{i} = γ^{j}

then a necessary and sufficient condition for the round-trip to be exact (assuming no overflow or underflow) is that

γ^{q-1} > β^{p}.

In the case of floating point numbers (type `double`

in C) we have β = 2 and p = 53. (See Anatomy of a floating point number.) We’re printing to base γ = 10. No positive power of 10 is also a power of 2, so Matula’s condition on the two bases holds.

If we print out *q* = 17 decimal places, then

10^{16} > 2^{53}

and so round-trip conversion will be exact if both conversions round to nearest. If *q* is any smaller, some round-trip conversions will not be exact.

You can also verify that for a single precision floating point number (*p* = 24 bits precision) you need *q* = 9 decimal digits, and for a quad precision number (*p* = 113 bits precision) you need *q* = 36 decimal digits [2].

Looking back at Matula’s theorem, clearly we need

γ^{q} ≥ β^{p}.

Why? Because the right side is the number of base β fractions and the left side is the number of base γ fractions. You can’t have a one-to-one map from a larger space into a smaller space. So the inequality above is necessary, but not sufficient. However, it’s almost sufficient. We just need one more base γ figure, i.e. we Matula tells us

γ^{q-1} > β^{p}

is sufficient. In terms of base 2 and base 10, we need at least 16 decimals to represent 53 bits. The surprising thing is that one more decimal is enough to guarantee that round-trip conversions are exact. It’s not obvious a priori that any finite number of extra decimals is always enough, but in fact just one more is enough; there’s no “table maker’s dilemma” here.

Here’s an example to show the extra decimal is necessary. Suppose *p* = 5. There are more 2-digit numbers than 5-bit numbers, but if we only use two digits then round-trip radix conversion will not always be exact. For example, the number 17/16 written in binary is 1.0001_{two}, and has five significant bits. The decimal equivalent is 1.0625_{ten}, which rounded to two significant digits is 1.1_{ten}. But the nearest binary number to 1.1_{ten} with 5 significant bits is 1.0010_{two} = 1.125_{ten}. In short, rounding to nearest gives

1.0001_{two} -> 1.1_{ten} -> 1.0010_{two}

and so we don’t end up back where we started.

## More floating point posts

[1] D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50. January 1968. Cited in Handbook of Floating-point Arithmetic by Jean-Mihel Muller et al.

[2] The number of bits allocated for the fractional part of a floating point number is 1 less than the precision: the leading figure is always 1, so IEEE formats save one bit by not storing the leading bit, leaving it implicit. So, for example, a C `double`

has 53 bits precision, but 52 bits of the 64 bits in a `double`

are are allocated to storing the fraction.

“You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits” Really? 2^24 is about 16 millions, so q = 8 decimal digits should be enough.

@Severin: It is true that 8 decimal digits is enough to have more than 2^24 possible numbers. That is necessary but not sufficient. With 8 decimals it would be possible to create

someone-to-one map back and forth, but that map would not correspond to radix conversion with round to nearest. For exact round-trip radix conversion you need one more decimal.The necessary condition for a one-to-one map (i.e., that distinct floating-point values are guaranteed to map to distinct decimal representations) can be refined to a characterizing– i.e., equivalent– stricter condition that γ^(q-1) ≥ β^(p)-1. This is in another paper by Matula published later that same year:

Matula, David W., The Base Conversion Theorem, Proceedings of the American Mathematical Society, 19(3) June 1968, p. 716-723

The good news is that in the particular common case where we are converting base 2 to base 10, this condition is equivalent to the inequality in your post, with the sole exception of single-bit conversions (p=q=1).

@John, you’re right, I’ll take it back. Tested all 32bit floats via roundtrip float->char->float conversion. Code is here: https://github.com/Kri-Ol/from_chars-to_chars-conversion-test