Suppose you want to compute the natural logarithms of every floating point number, correctly truncated to a floating point result. Here by floating point number we mean an IEEE standard 64-bit float, what C calls a double
. Which logarithm is hardest to compute?
We’ll get to the hardest logarithm shortly, but we’ll first start with a warm up problem. Suppose you needed to compute the first digit of a number z. You know that z equals 2 + ε where ε is either 10−100 or −10−100. If ε is positive, you should return 2. If ε is negative, you should return 1. You know z to very high precision a priori, but to return the first digit correctly, you need to compute z to 100 decimal places.
In this post we’re truncating floating point numbers, i.e. rounding down, but similar considerations apply when rounding to nearest. For example, if you wanted to compute 0.5 + ε rounded to the nearest integer.
In general, it may not be possible to know how accurately you need to compute a value in order to round it correctly. William Kahan called this the table maker’s dilemma.
Worst case for logarithm
Lefèvre and Muller [1] found that the floating point number whose logarithm is hardest to compute is
x = 1.0110001010101000100001100001001101100010100110110110 × 2678
where the fractional part is written in binary. The log of x, again written in binary, is
111010110.0100011110011110101110100111110010010111000100000000000000000000000000000000000000000000000000000000000000000111 …
The first 53 significant bits, all the bits a floating point number can represent [2], are followed by 65 zeros. We have to compute the log with a precision of more than 118 bits in order to know whether the last bit of the fraction part of the float representation should be a 0 or a 1. Usually you don’t need nearly that much precision, but this is the worst case.
Mathematica code
Here is Mathematica code to verify the result above. I don’t know how to give Mathematica floating point numbers in binary, but I know how to give it integers, so I’ll multiply the fraction part by 252 and take 52 from the exponent.
n = FromDigits[ "10110001010101000100001100001001101100010100110110110", 2] x = n 2^(678 - 52) BaseForm[N[Log[x], 40], 2]
This computes the log of x to 40 significant decimal figures, between 132 and 133 bits, and gives the result above.
Related posts
A few days ago I wrote about another example by Jean-Michel Muller: A floating point oddity.
William Kahan also came up recently in my post on summing an array of floating point numbers.
***
[1] Vincent Lefèvre, Jean-Michel Muller. Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. November 2000. Research Report 2000-35. Available here.
[2] There are 52 bits in the faction of an IEEE double. The first bit of a fraction is 1, unless a number is subnormal, so the first bit is left implicit, squeezing out one extra bit of precision. That is, storing 52 bits gives us 53 bits of precision.
Numbers in different bases can be entered in Mathematica using ^^ notation. The ` specifies the precision to use.
In[1]:= x=2^^1.0110001010101000100001100001001101100010100110110110`132*2^678
BaseForm[N[Log[x],40],2]
Out[1]= 1.737429606443346566788426276849475875520*10^204
Out[2]//BaseForm= Subscript[1.11010110010001111001111010111010011111001001011100010000000000000000000000000000000000000000000000000000000000000000011100110100100, 2]*2^(8)
Here is the number in standard floating point notation: 1.7374296064433466e+204 (or in hex notation: 0x1.62a88613629b6p+678).