Extended floating point precision in R and C

The GNU MPFR library is a C library for extended precision floating point calculations. The name stands for Multiple Precision Floating-point Reliable. The library has an R wrapper Rmpfr that is more convenient for interactive use. There are also wrappers for other languages.

It takes a long time to install MPFR and its prerequisite GMP, and so I expected it to take a long time to install Rmpfr. But the R library installs quickly, even on a system that doesn’t have MPFR or GMP installed. (I installed GMP and MPFR from source on Linux, but installed Rmpfr on Windows. Presumably the Windows R package included pre-compiled binaries.)

I’ll start by describing the high-level R interface, then go into the C API.


You can call the functions in Rmpfr with ordinary numbers. For example, you could calculate ζ(3), the Riemann zeta function evaluated at 3.

    > zeta(3)
    1 'mpfr' number of precision  128   bits
    [1] 1.202056903159594285399738161511449990768

The default precision is 128 bits, and a numeric argument is interpreted as a 128-bit MPFR object. R doesn’t have a built-in zeta function, so the only available zeta is the one from Rmpfr. If you ask for the cosine of 3, you’ll get ordinary precision.

    > cos(3)
    [1] -0.9899925

But if you explicitly pass cosine a 128-bit MPFR representation of the number 3 you will get cos(3) to 128-bit precision.

    > cos(mpfr(3, 128))                            
    1 'mpfr' number of precision  128   bits       
    [1] -0.9899924966004454572715727947312613023926

Of course you don’t have to only use 128-bits. For example, you could find π to 100 decimal places by multiplying the arctangent of 1 by 4.

    > 100*log(10)/log(2) # number of bits needed for 100 decimals                                               
    [1] 332.1928     
    >  4*atan(mpfr(1,333))                                                                                      
    1 'mpfr' number of precision  333   bits                                                                    
    [1] 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807 

MPFR C library

The following C code shows how to compute cos(3) to 128-bit precision and 4 atan(1) to 333 bit precision as above.

    #include <stdio.h>
    #include <gmp.h>
    #include <mpfr.h>
    int main (void)
        // All functions require a rounding mode.
        // This mode specifies round-to-nearest
        mpfr_rnd_t rnd = MPFR_RNDN;
        mpfr_t x, y;
        // allocate unitialized memory for x and y as 128-bit numbers
        mpfr_init2(x, 128);
        mpfr_init2(y, 128);
        // Set x to the C double number 3
        mpfr_set_d(x, 3, rnd);
        // Set y to the cosine of x
        mpfr_cos(y, x, rnd);
        // Print y to standard out in base 10
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
        // Compute pi as 4*atan(1)
        // Re-allocate x and y to 333 bits
        mpfr_init2(x, 333);    
        mpfr_init2(y, 333);    
        mpfr_set_d(x, 1.0, rnd);        
        mpfr_atan(y, x, rnd);
        // Multiply y by 4 and store the result back in y
        mpfr_mul_d(y, y, 4, rnd);
        printf ("y = ");
        mpfr_out_str (stdout, 10, 0, y, rnd);
        putchar ('\n');
        // Release memory
        return 0;

If this code is saved in the file hello_mpfr.c then you can compile it with

    gcc hello_mpfr.c -lmpfr -lgmp

One line above deserves a little more explanation. The second and third arguments to mpfr_out_str are the base b and number of figures n to print.

We chose b=10 but you could specify any base value 2 ≤ b ≤ 62.

If n were set to 100 then the output would contain 100 significant figures. When n=0, MPFR will determine the number of digits to output, enough digits that the string representation could be read back in exactly. To understand how many digits that is, see Matula’s theorem in the previous post.

When is round-trip floating point radix conversion exact?

Suppose you store a floating point number in memory, print it out in human-readable base 10, and read it back in. When can the original number be recovered exactly?

D. W. Matula answered this question more generally in 1968 [1].

Suppose we start with base β with p places of precision and convert to base γ with q places of precision, rounding to nearest, then convert back to the original base β. Matula’s theorem says that if there are no positive integers i and j such that

βi = γj

then a necessary and sufficient condition for the round-trip to be exact (assuming no overflow or underflow) is that

γq-1 > βp.

In the case of floating point numbers (type double in C) we have β = 2 and p = 53. (See Anatomy of a floating point number.) We’re printing to base γ = 10. No positive power of 10 is also a power of 2, so Matula’s condition on the two bases holds.

If we print out q = 17 decimal places, then

1016 > 253

and so round-trip conversion will be exact if both conversions round to nearest. If q is any smaller, some round-trip conversions will not be exact.

You can also verify that for a single precision floating point number (p = 24 bits precision) you need q = 9 decimal digits, and for a quad precision number (p = 113 bits precision) you need q = 36 decimal digits [2].

Looking back at Matula’s theorem, clearly we need

γq ≥ βp.

Why? Because the right side is the number of base β fractions and the left side is the number of base γ fractions. You can’t have a one-to-one map from a larger space into a smaller space. So the inequality above is necessary, but not sufficient. However, it’s almost sufficient. We just need one more base γ figure, i.e. we Matula tells us

γq-1 > βp

is sufficient. In terms of base 2 and base 10, we need at least 16 decimals to represent 53 bits. The surprising thing is that one more decimal is enough to guarantee that round-trip conversions are exact. It’s not obvious a priori that any finite number of extra decimals is always enough, but in fact just one more is enough; there’s no “table maker’s dilemma” here.

Here’s an example to show the extra decimal is necessary. Suppose p = 5. There are more 2-digit numbers than 5-bit numbers, but if we only use two digits then round-trip radix conversion will not always be exact. For example, the number 17/16 written in binary is 1.0001two, and has five significant bits. The decimal equivalent is 1.0625ten, which rounded to two significant digits is 1.1ten. But the nearest binary number to 1.1ten with 5 significant bits is 1.0010two = 1.125ten. In short, rounding to nearest gives

1.0001two -> 1.1ten -> 1.0010two

and so we don’t end up back where we started.

More floating point posts

[1] D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50. January 1968. Cited in Handbook of Floating-point Arithmetic by Jean-Mihel Muller et al.

[2] The number of bits allocated for the fractional part of a floating point number is 1 less than the precision: the leading figure is always 1, so IEEE formats save one bit by not storing the leading bit, leaving it implicit. So, for example, a C double has 53 bits precision, but 52 bits of the 64 bits in a double are allocated to storing the fraction.

MDS codes

A maximum distance separable code, or MDS code, is a way of encoding data so that the distance between code words is as large as possible for a given data capacity. This post will explain what that means and give examples of MDS codes.


A linear block code takes a sequence of k symbols and encodes it as a sequence of n symbols. These symbols come from an alphabet of size q. For binary codes, q = 2. But for non-trivial MDS codes, q > 2. More on that below.

The purpose of these codes is to increase the ability to detect and correct transmission errors while not adding more overhead than necessary. Clearly n must be bigger than k, but the overhead n-k has to pay for itself in terms of the error detection and correction capability it provides.

The ability of a code to detect and correct errors is measured by d, the minimum distance between code words. A code has separation distance d if every pair of code words differs in at least d positions. Such a code can detect up to d errors per block and can correct ⌊(d-1)/2⌋ errors.


The following example is not an MDS code but it illustrates the notation above.

The extended Golay code used to send back photos from the Voyager missions has q = 2 and [n, k, d] = [24, 12, 8]. That is, data is divided into segments of 12 bits and encoded as 24 bits in such a way that all code blocks differ in at least 8 positions. This allows up to 8 bit flips per block to be detected, and up to 3 bit flips per block to be corrected.

(If 4 bits were corrupted, the result could be equally distant between two valid code words, so the error could be detected but not corrected with certainty.)

Separation bound

There is a theorem that says that for any linear code

k + dn + 1.

This is known as the singleton bound. MDS codes are optimal with respect to this bound. That is,

k + d = n + 1.

So MDS codes are optimal with respect to the singleton bound, analogous to how perfect codes are optimal with respect to the Hamming bound. There is a classification theorem that says perfect codes are either Hamming codes or trivial with one exception. There is something similar for MDS codes.


MDS codes are essentially either Reed-Solomon codes or trivial. This classification is not as precise as the analogous classification of perfect codes. There are variations on Reed-Solomon codes that are also MDS codes. As far as I know, this accounts for all the known MDS codes. I don’t know that any others have been found, or that anyone has proved that there are no more.

Trivial MDS codes

What are these trivial codes? They are the codes with 0 or 1 added symbols, and the duals of these codes. (The dual of an MDS code is always an MDS code.)

If you do no encoding, i.e. take k symbols and encode them as k symbols, then d = 1 because different code words may only differ in one symbol. In this case n = k and so k + d = n + 1, i.e. the singleton bound is exact.

You could take k data symbols and add a checksum. If q = 2 this would be a parity bit. For a larger alphabet of symbols, it could be the sum of the k data symbols mod q. Then if two messages differ in 1 symbol, they also differ in added checksum symbol, so d = 2. We have n = k + 1 and so again k + d = n + 1.

The dual of the code that does no encoding is the code that transmits no information! It has only one code word of size n. You could say, vacuously, that d = n because any two different code words differ in all n positions. There’s only one code word so k = 1. And again k + d = n + 1.

The dual of the checksum code is the code that repeats a single data symbol n times. Then d = n because different code words differ in all n positions. We have k = 1 since there is only one information symbol per block, and so k + d = n + 1.

Reed Solomon codes

So the stars of the MDS show are the Reed-Solomon codes. I haven’t said how to construct these codes because that deserves a post of its own. Maybe another time. For now I’ll just say a little about how they are used in application.

As mentioned above, the Voyager probes used a Golay code to send back images. However, after leaving Saturn the onboard software was updated to use Reed-Solomon encoding. Reed-Solomon codes are used in more down-to-earth applications such as DVDs and cloud data storage.

Reed-Solomon codes are optimal block codes in terms of the singleton bound, but block codes are not optimal in terms of Shannon’s theorem. LDPC (low density parity check) codes come closer to the Shannon limit, but some forms of LDPC encoding use Reed-Solomon codes as a component. So in addition to their direct use, Reed-Solomon codes have found use as building blocks for other encoding schemes.

Computing the area of a thin triangle

Heron’s formula computes the area of a triangle given the length of each side.

A = \sqrt{s(s-a)(s-b)(s-c)}


s = \frac{a + b + c}{2}

If you have a very thin triangle, one where two of the sides approximately equal s and the third side is much shorter, a direct implementation Heron’s formula may not be accurate. The cardinal rule of numerical programming is to avoid subtracting nearly equal numbers, and that’s exactly what Heron’s formula does if s is approximately equal to two of the sides, say a and b.

William Kahan’s formula is algebraically equivalent to Heron’s formula, but is more accurate in floating point arithmetic. His procedure is to first sort the sides in decreasing order, then compute

A = \frac{1}{4} \sqrt{(a + (b + c))(c - (a - b))(c + (a - b))(a + (b - c))}

You can find this method, for example, in Nick Higham’s book Accuracy and Stability of Numerical Algorithms.

The algebraically redundant parentheses in the expression above are not numerically redundant. As we’ll demonstrate below, the method is less accurate without them.

Optimizing compilers respect the parentheses: the results are the same when the code below is compiled with gcc with no optimization (-O0) and with aggressive optimization (-O3). The same is true of Visual Studio in Debug and Release mode.

C code demo

First, here is a straight-forward implementation of Heron

    #include <math.h>

    float heron1(float a, float b, float c) {
        float s = 0.5 * (a + b + c);
        return sqrt(s*(s - a)*(s - b)*(s - c));

And here’s an implementation of Kahan’s version.

    void swap(float* a, float* b) {
        float t = *b;
        *b = *a;
        *a = t;

    float heron2(float a, float b, float c) {
        // Sort a, b, c into descending order
        if (a < b) swap(&a, &b);
        if (a < c) swap(&a, &c);
        if (b < c) swap(&b, &c);

        float p = (a + (b + c))*(c - (a - b))*(c + (a - b))*(a + (b - c));
        return 0.25*sqrt(p);

Finally, here’s an incorrect implementation of Kahan’s method, with “unnecessary” parentheses removed.

    float heron3(float a, float b, float c) {
        // Sort a, b, c into descending order
        if (a < b) swap(&a, &b);
        if (a < c) swap(&a, &c);
        if (b < c) swap(&b, &c);

        float p = (a + b + c)*(c - (a - b))*(c + a - b)*(a + b - c);
        return 0.25*sqrt(p);

Now we call all three methods.

    int main()
        float a = 100000.1, b = 100000.2, c = 0.3;
        printf("%0.8g\n", heron1(a, b, c));
        printf("%0.8g\n", heron2(a, b, c));
        printf("%0.8g\n", heron3(a, b, c));

And for a gold standard, here is an implementation in bc with 40 decimal place precision.

    scale = 40

    a = 10000000.01
    b = 10000000.02
    c = 0.03

    s = 0.5*(a + b + c)

Here are the outputs of the various methods, in order of increasing accuracy.

    Heron: 14363.129
    Naive: 14059.268
    Kahan: 14114.293
    bc:    14142.157

Here “naive” means the incorrect implementation of Kahan’s method, heron3 above. The bc result had many more decimals but was rounded to the same precision as the C results.

Related post: How to compute the area of a polygon

Computing parity of a binary word

The previous post mentioned adding a parity bit to a string of bits as a way of detecting errors. The parity of a binary word is 1 if the word contains an odd number of 1s and 0 if it contains an even number of ones.

Codes like the Hamming codes in the previous post can have multiple parity bits. In addition to the parity of a word, you might want to also look at the parity of a bit mask AND’d with the word. For example, the Hamming(7, 4) code presented at the end of that post has three parity bits. For a four-bit integer x, the first parity bit is the parity bit of

    x & 0b1011

the second is the parity bit of

    x & 0b1101

and the third is the parity of

    x & 0b1110

These three bit masks are the last three columns of the generator matrix for of the Hamming(7, 4) code:

    1 0 0 0 1 1 1
    0 1 0 0 0 1 1
    0 0 1 0 1 0 1
    0 0 0 1 1 1 0

More generally, any linear operation on a vector of bits is given by multiplication by a binary matrix, where arithmetic is carried out mod 2. Matrix products can be defined in terms of inner products, and the inner product of words x and y is given by the parity of x&y.

Given that parity is important to compute, how would you compute it?

If you have a popcount function, you could read the last bit of the popcount. Since popcount counts the number of ones, the parity of x is 1 if popcount(x) is odd and 0 otherwise.

gcc extensions

In the earlier post on popcount, we mentioned three functions that gcc provides to compute popcount:


These three functions return popcount for an unsigned int, an unsigned long, and an unsigned long long respectively.

In each case the function will call a function provided by the target processor if it is available, and will run its own code otherwise.

There are three extensions for computing parity that are completely analogous:


Stand-alone code

If you want your own code for computing parity, Hacker’s Delight gives the following for a 32-bit integer x.

The last bit of y is the parity of x.

    y = x ^ (x >> 1);
    y = y ^ (y >> 2);
    y = y ^ (y >> 4);
    y = y ^ (y >> 8);
    y = y ^ (y >> 16);

More bit twiddling posts

Popcount: counting 1’s in a bit stream

Sometimes you need to count the number of 1’s in a stream of bits. The most direct application would be summarizing yes/no data packed into bits. It’s also useful in writing efficient, low-level bit twiddling code. But there are less direct applications as well. For example, three weeks ago this came up in a post I wrote about Pascal’s triangle.

The number of odd integers in the nth row of Pascal’s triangle equals 2b where b is the number of 1’s in the binary representation of n.

The function that takes a bit stream and returns the number of 1’s is commonly called popcount, short for population count.

Formula using floor

Here’s an interesting formula for popcount taken from Hacker’s Delight:

\mbox{popcount}(x) = x - \sum_{n=1}^\infty \left\lfloor \frac{x}{2^n} \right\rfloor

The sum is actually finite since after a certain point all the terms are zero. For example, if x = 13 (1101 in binary), then the right side is

13 – 6 – 3 – 1

which of course is 3.

Computing with gcc extensions

The gcc compiler has a function __builtin_popcount that takes an unsigned integer x and returns the number of bits in x set to 1. If the target platform has a chip instruction for computing popcount then compiler will generate code to call this instruction. Otherwise it uses library code.

For example, the following code prints 6, 1, and 2.

    #include <stdio.h>

    int main() {
        for (unsigned int x = 63; x < 66; x++)
            printf("%d\n", __builtin_popcount(x));

There are also functions __builtin_popcountl and __builtin_popcountll for unsigned long and unsigned long long arguments.

Computing in C

If you want to use your own C code to compute popcount rather than relying on compiler extensions, here are a couple possibilities taken from Hacker’s Delight. First, one that only requires unsigned ints.

    int pop1(unsigned x) {
        x -= ((x >> 1) & 0x55555555);
        x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
        x = (x + (x >> 4)) & 0x0F0F0F0F;
        x += (x >> 8);
        x += (x >> 16);    
        return x & 0x0000003F;

And here is a more elegant version that uses an unsigned long long.

    int pop2(unsigned x) {
        unsigned long long y;
        y = x * 0x0002000400080010ULL;
        y = y & 0x1111111111111111ULL;
        y = y * 0x1111111111111111ULL;
        y = y >> 60;
        return y;

Runge-Kutta methods and Butcher tableau

If you know one numerical method for solving ordinary differential equations, it’s probably Euler’s method. If you know two methods, the second is probably 4th order Runge-Kutta. It’s standard in classes on differential equations or numerical analysis to present Euler’s method as conceptually simple but inefficient introduction, then to present Runge-Kutta as a complicated but efficient alternative.

Runge-Kutta methods are a huge family of numerical methods with a wide variety of trade-offs: efficiency, accuracy, stability, etc. Euler’s method is a member of the Runge-Kutta family as are countless other variations. You could devote a career to studying Runge-Kutta methods, and some people have.

Beneath the complexity and variety, all Runge-Kutta methods have a common form that can be summarized by a matrix and two vectors. For explicit Runge-Kutta methods (ERK) the matrix is triangular, and for implicit Runge-Kutta methods (IRK) the matrix is full.

This summary of an RK method is known as a Butcher tableau, named after J. C. Butcher who classified RK methods.

“The” Runge-Kutta method

For example, let’s start with what students often take to be “the” Runge-Kutta method. This method approximates solutions to a differential equation of the form

y' = f(t, y)


y_{n+1} = y_n + \frac{h}{6}\left( k_{n1} + 2k_{n2} + 2k_{n3} + k_{n4}\right)


k_{n1} &=& f(t_n, y_n) \\ k_{n2} &=& f(t_n + 0.5h, y_n + 0.5hk_{n1}) \\ k_{n3} &=& f(t_n + 0.5h, y_n + 0.5hk_{n2}) \\ k_{n4} &=& f(t_n + h, y_n + hk_{n3}) \\

The Butcher tableau for this ERK method is

\begin{array} {c|cccc} 0\\ 1/2 & 1/2\\ 1/2 &0 &1/2 \\ 1& 0& 0& 1\\ \hline & 1/6 & 1/3 & 1/3 &1/6 \end{array}

The numbers along the left side are the coefficients of h in the first argument of f.

The numbers along the bottom are the coefficients of the ks in the expression for the value of y at the next step.

The numbers in the middle of the array are the coefficients of the ks in second argument  of f. Because this is an explicit method, each k only depends on the previous ks, and so the table of coefficients has a triangular form.

Runge-Kutta 3/8 rule

The method above is the most common 4th order ERK rule, there is another known as the 3/8 rule. It is a little less efficient and a little more accurate. A step of this rule is given by

y_{n+1} = y_n + \frac{h}{8}\left( k_{n1} + 3k_{n2} + 3k_{n3} + k_{n4}\right)


\begin{align*} k_{n1} &= f(t_n, y_n) \\ k_{n2} &= f(t_n + \frac{h}{3}, y_n + \frac{h}{3}k_{n1}) \\ k_{n3} &= f(t_n +\frac{2h}{3}, y_n -\frac{h}{3}k_{n1} + k_{n2}) \\ k_{n4} &= f(t_n + h, y_n + h k_{n1} - h k_{n2} + hk_{n3}) \end{align*}

This method is summarized in the following Butcher tableau.

\begin{array} {c|cccc} 0\\ 1/3 & 1/3\\ 2/3 & -1/3 &1 \\ 1& 1& -1 & 1\\ \hline & 1/8 & 3/8 & 3/8 &1/8 \end{array}

This example makes it a little easier to see what’s going on since none of the coefficients in the triangular array are zero. Full detail is given in the section below.

General Explicit Runge-Kutta

The most general form of an ERK rule with s steps is

y_{n+1} = y_n + h \sum_{i-1}^s b_i k_{ni}


k_{ni} = f\left(x_n + c_i h, y_n + h \sum_{j=1}^{i-1} a_{ij} k_{nj}\right)

and the Butcher tableau is

\begin{array} {c|ccccc} 0\\ c_2 & a_{21}\\ c_3 & a_{31} & a_{32} \\ \vdots & \vdots & & \ddots \\ c_s& a_{s1}& a_{s2} & \cdots & a_{s,s-1}\\ \hline & b_1 & b_2 & \cdots & b_{s-1} & b_s \end{array}

General implicit Runge-Kutta

With explicit (ERK) methods, each k depends only on its predecessors. With implicit (IRK) methods each k potentially depends on each of the others. The matrix in the tableau is full, not triangular, and one must solve for the ks.


k_{ni} = f\left(x_n + c_i h, y_n + h \sum_{j=1}^s a_{ij} k_{nj}\right)

with the sum going all the way up to s, and the Butcher tableau is

\begin{array} {c|ccccc} c_1 & a_{11} & a_{12} & \cdots & a_{1s} \\ c_2 & a_{21} & a_{22} & \cdots & a_{2s} \\ \vdots & \vdots & & \ddots & \vdots \\ c_s& a_{s1}& a_{s2} & \cdots & a_{s,s}\\ \hline & b_1 & b_2 & \cdots & b_{s} \end{array}

Implicit methods are more complicated to implement, and require more computation for a given step size. However, they are more stable for stiff differential equations and may allow larger steps. Implicit methods are less efficient when they’re not needed, and more efficient when they are needed.

Back to Euler’s method

I said at the top of the post that Euler’s method was a special case of Runge-Kutta. The Butcher tableau for the explicit (forward) Euler method is simply

 \begin{array} {c|c} 0 & 0\\ \hline & 1\end{array}

and the tableau for the implicit (backward) Euler method is just

\begin{array} {c|c} 1 & 1\\ \hline & 1\end{array}

In this post I say more about these two methods and compare their stability.

More on differential equations

TestU01 small crush test suite

crushed cars

In recent posts I’ve written about using RNG test suites on the output of the μRNG entropy extractor. This is probably the last post in the series. I’ve looked at NIST STS, PractRand, and DIEHARDER before. In this post I’ll be looking at TestU01.

TestU01 includes three batteries of tests: Small Crush, Crush, and Big Crush. The entropy extractor failed the smallest of the three, so I didn’t go on to the larger suites. Small Crush isn’t small; it used over 22 billion 32-bit samples as input, about 0.84 GB of data. Crush uses two orders of magnitude more data, and Big Crush uses another order of magnitude more data than Crush.

SmallCrush consists of 10 tests:

  • smarsa_BirthdaySpacings
  • sknuth_Collision
  • sknuth_Gap
  • sknuth_SimpPoker
  • sknuth_CouponCollector
  • sknuth_MaxOft
  • svaria_WeightDistrib
  • smarsa_MatrixRank
  • sstring_HammingIndep
  • swalk_RandomWalk1

The test names begin with s, followed by a prefix indicating the origin of the test. For example, knuth refers to Donald Knuth’s tests in volume 2 of TAOCP and marsa refers to George Marsaglia. The remainder of the name is more descriptive, such as SimpPoker for Knuth’s simple poker test.

The output of the entropy extractor failed four of the tests, failure being defined as producing a p-value less than 10-300. The other tests passed without issue, meaning they returned p-values in the range [0.001, 0.999].

Recall from earlier posts that μRNG entropy extractor takes three possibly biased bit streams and produces an unbiased bit stream, provided each of the input streams has min-entropy of at least 1/3. I produced biased streams by taking the bitwise OR of two consecutive values, producing a stream with probability 0.75 of being a 1 and probability 0.25 of being a 0. The result passed all STS and DIEHARDER tests, but failed some PractRand and Test01 SmallCrush tests. This is consistent with the generally held opinion that STS and DIEHARDER are relatively weak tests and PractRand and TestU01 are more rigorous tests.

I applied the entropy extractor to PCG without creating a biased stream, and the result passed PractRand and TestIU01 SmallCrush. Presumably it would have passed STS and DIEHARDER as well. This confirms that the extractor does no harm to a high-quality stream of pseudorandom bits. It largely removes the bias from biased streams, enough to pass the easier two test suites but not enough to pass the two more demanding test suites.

Previous posts in this series

Testing entropy extractor with NIST STS

Around this time last year I wrote about the entropy extractor used in μRNG. It takes three biased random bit streams and returns an unbiased bit stream, provided each input stream as has least 1/3 of a bit of min-entropy.

I’ve had in the back of my mind that I should go back and run the output of the extractor through a standard test suite. I’ve been doing more RNG testing lately, and so while the software is fresh on my mind I wanted to go back and test the entropy extractor.

To create a biased bit stream, I first created an unbiased bit stream using the PCG random number generator, then took the bitwise OR of consecutive samples. In C notation, I created two 32-bit unsigned integers u and v and used u|v. The resulting bits are 1’s three out of four times since a bit is 0 only if both corresponding bits are 0’s. The min-entropy of the resulting stream is

-log2 max(0.25, 0.75) = 0.415

which is larger than 1/3, so we should be able to apply the μRNG entropy extractor. To do this, we create three bit streams. Let a, b, and c be 8-bit bytes, each from a different stream. Then we combine these into

a×b + c

and use that as one byte of output. So it takes three bytes of input to produce one byte of output, which is to be expected since we’re starting with sources that may contain only 1/3 of a bit of min-entropy per bit of output.

The multiplication and addition above are carried out in the Galois field GF(28). This means that multiplication may be like nothing you expect, and addition is XOR, i.e. bitwise exclusive OR. The multiplication is the same at that used in AES encryption.

NIST Statistical Test Suite

There are several test suites we could use—DIEHARDER, PractRand, TestU01, etc.—and expect I’ll write more about those before long, but for this post we’ll focus on the NIST Statistical Test Suite or STS. (Update: the extractor fails hard on PractRand.) The STS suite includes the following 15 tests.

  • Frequency (monobit)
  • Frequency test within a block
  • Runs test
  • Test for the longest run of ones in a block
  • Binary matrix rank test
  • Discrete Fourier Transform (spectral) test
  • Non-overlapping template matching test
  • Overlapping template matching test
  • Maurer’s universal statistical test
  • Linear complexity test
  • Serial test
  • Approximate entropy test
  • Cumulative sums (cumsum) test
  • Random excursions test
  • Random excursions variant test

See NIST Special Publication 800-22: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications.

When we run the input of PCG alone through STS, it passes with flying colors. If we run the output of PCG after OR’ing pairs of unsigned integers together, the test fails spectacularly, spewing underflow warnings. But if we take three streams of biased bits that would each fail and keep the extracted output, it passes all the tests, just as with the original output of PCG.

In each case I tested one million 32-bit unsigned integers. In the biased case I sampled three million integers

Note that in our example we know the amount of bias because we deliberately created the bias. You might not know the bias, or equivalently the min-entropy, in general. As long as the min-entropy is greater than 1/3, the entropy extractor works.

More on RNG testing

Stiff differential equations

There is no precise definition of what it means for a differential equation to be stiff, but essentially it means that implicit methods will work much better than explicit methods. The first use of the term [1] defined stiff equations as

equations where certain implicit methods, in particular BDF, perform better, usually tremendously better, than explicit ones.

We’ll explain what it means for a method to be explicit or implicit and what BDF means. And we’ll give an example where the simplest implicit method performs much better than the simplest explicit method. Here’s a plot for a little foreshadowing.

stiff ODE solutions

Euler’s method

Suppose you have a first order differential equation

y'(x) = f(x, y)

with initial condition y(0) = c. The simplest numerical method for solving differential equations is Euler’s method. The idea is replace the derivative y‘ with a finite difference

y'(x) \approx \frac{y(x + h) - y(x)}{h}

and solve for y(x + h). You start out knowing y(0), then you solve for y(h), then start over and use y(h) to solve for y(2h), etc.

Explicit Euler method

There are two versions of Euler’s method. The explicit Euler method uses a forward difference to approximate the derivative and the implicit Euler method uses a backward difference.

Forward difference means that at a given point x, we approximate the derivative by moving ahead a step h

y'(x) \approx \frac{y(x + h) - y(x)}{h}

and evaluating the right hand side of the differential equation at the current values of x and y, i.e.

\frac{y(x + h) - y(x)}{h} = f(x, y)

and so

y(x + h) = y(x) + h f(x, y)

Implicit Euler method

The idea of the implicit Euler method is to approximate the derivative with a backward difference

y'(x) \approx \frac{y(x) - y(x- h)}{h}

which leads to

y(x) = y(x- h) + h f(x, y)

or equivalently

y(x + h) = y(x) + h f(x + h, y + h).

The text quoted at the top of the post referred to BDF. That stands for backward difference formula and we have an example of that here. More generally the solution at a given point could depend on the solutions more than one time step back [2].

This is called an implicit method because the solution at the next time step,

y(x + h)

appears on both sides of the equation. That is, we’re not given the solution value explicitly but rather we have to solve for it. This could be complicated depending on the nature of the function f. In the example below it will be simple.

Example with Python

We’ll take as our example the differential equation

y' = -50(y - \cos(x))

with initial condition y(0) = 0.

The exact solution, written in Python, is

    def soln(x):
        return (50/2501)*(sin(x) + 50*cos(x)) - (2500/2501)*exp(-50*x)

Here’s the plot again from the beginning of the post. Note that except at the very beginning, the difference between the implicit method approximation and exact solution is too small to see.


stiff ODE solutions

In the plot above, we divided the interval [0, 1] into 27 steps.

    x = linspace(0, 1, 27)

and implemented the explicit and implicit Euler methods as follows.

    def euler_explicit(x):
        y = zeros_like(x)
        h = x[1] - x[0]
        for i in range(1, len(x)):
            y[i] = y[i-1] -50*h*(y[i-1] - cos(x[i]))
        return y

    def euler_implicit(x):
        y = zeros_like(x)
        h = x[1] - x[0]
        for i in range(1, len(x)):
            y[i] = (y[i-1] + 50*h*cos(x[i])) / (50*h + 1)
        return y

Here’s a plot of the absolute error in both solution methods. The error for the implicit method, too small to see in the plot, is on the order of 0.001.

Error in numerical solutions

In the plots above, we use a step size h = 1/27. With a step size h = 1/26 the oscillations in the explicit method solution do not decay.

stiff ODE solutions

And with a step size h = 1/25 the oscillations grow.

stiff ODE solutions

The explicit method can work well enough if we take the step size smaller, say h = 1/50. But smaller step sizes mean more work, and in some context it is not practical to use a step size so small that an explicit method will work adequately on a stiff ODE. And if we do take h = 1/50 for both methods, the error is about 10x larger for the explicit method.

By contrast, the implicit method does fairly well even with h as large as 1/5.

(The “exact” solution here means the analytical solution sampled at six points just as the numerical solution is available at six points. The actual solution is smooth; the sharp corner comes from sampling the function at too few points for it to look smooth.)

stiff ODE solutions

Related posts

[1] C. F. Curtiss and J. O. Hirschfelder (1952). Integration of stiff equations. Proceedings of the National Academy of Sciences. Vol 38, pp. 235–243.

[2] This may be confusing because we’re still evaluating y at x + h, and so you might think there’s nothing “backward” about it. The key is that we are approximating the derivative of y at points that are backward relative to where we are evaluating f(x, y). On the right side, we are evaluating y at an earlier point than we are evaluating f(x, y).