Predicting when an RNG will output a given value

A few days ago I wrote about how to pick the seed of a simple random number generator so that a desired output came n values later. The number n was fixed and we varied the seed. In this post, the seed will be fixed and we’ll solve for n. In other words, we ask when a pseudorandom sequence will produce a given value.

In principle you could just run the RNG until you get the output you’re looking for, but we’ll assume such a brute force approach is not feasible or at least not fast enough.

If a LCG (linear congruential generator) has seed z, multiplier a, and modulus m, then the nth output is an z reduced mod m. So our task is to solve

x = an z mod m

for n. If we forget for a moment that we’re working with integers mod m, we see that the solution is

n = loga (x / z)

We can actually make this work if we interpret division by z to mean multiplication by the inverse of z mod m and if we interpret the logarithm to be a discrete logarithm. For more on discrete logarithms and one algorithm for computing them, see this post.

In an earlier post I used  a = 742938285 and m = 231 – 1 = 2147483647. We set n = 100 and solved for z to make the 100th output equal to 20170816, the date of that post. It turned out that z = 1898888478.

Now let’s set the seed z = 1898888478 and ask when the LCG sequence will take on the value x = 20170816. Of course we know that n will turn out to be 100, but let’s pretend we don’t know that. We’ll write a little Python script to find n.

I expect there’s a simple way to compute modular inverses using SymPy, but I haven’t found it, so I used some code from StackOverflow.

The following Python code produces n = 100, as expected.

from sympy.ntheory import discrete_log

def egcd(a, b):
    if a == 0:
        return (b, 0, 1)
    else:
        g, y, x = egcd(b % a, a)
        return (g, x - (b // a) * y, y)

def modinv(a, m):
    g, x, y = egcd(a, m)
    if g != 1:
        raise Exception('modular inverse does not exist')
    else:
        return x % m

a = 742938285
z = 1898888478
m = 2**31 - 1
x = 20170816
zinv = modinv(z, m)

n = discrete_log(m, x*zinv, a)
print(n)

Reverse engineering the seed of a linear congruential generator

The previous post gave an example of manipulating the seed of a random number generator to produce a desired result. This post will do something similar for a different generator.

A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of z and updates z at each step by multiplying by a constant a and taking the remainder by m. The particular LCG I’ve used has a = 742938285 and m = 231 − 1 = 2147483647.

(I used these parameters because they have maximum range, i.e. every positive integer less than m appears eventually, and because it was one of the LCGs recommended in an article years ago. That is, it’s very good as far as 32-bit LCGs go, which isn’t very far. An earlier post shows how it quickly fails the PractRand test suite.)

Let’s pick the seed so that the 100th output of the generator is today’s date in ISO form: 20170816.

We need to solve

a100z = 20170816 mod 2147483647.

By reducing  a100 mod 2147483647 we  find we need to solve

160159497 z = 20170816 mod 2147483647

and the solution is z = 1898888478. (See How to solve linear congruences.)

The following Python code verifies that the solution works.

    a = 742938285
    z = 1898888478
    m = 2**31 - 1

    for _ in range(100):
        z = a*z % m
    print(z)

Update: In this post, I kept n = 100 fixed and solved for the seed to give a specified output n steps later. In a follow up post I do the opposite, fixing the seed and solving for n.

Manipulating a random number generator

With some random number generators, it’s possible to select the seed carefully to manipulate the output. Sometimes this is easy to do. Sometimes it’s hard but doable. Sometimes it’s theoretically possible but practically impossible.

In my recent correspondence with Melissa O’Neill, she gave me an example that seeds a random number generator so that the 9th and 10th outputs produce the ASCII code for my name.

Here’s the code. The function next is the xoroshiro128+ (XOR/rotate/shift/rotate) random number generator. The global array s contains the state of the random number generator. Its initial values are the seeds of the generator.

#include <cstdio>
#include <cstdint>

// xoroshiro generator taken from
// http://vigna.di.unimi.it/xorshift/xoroshiro128plus.c

uint64_t s[2];

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}

uint64_t next(void) {
	const uint64_t s0 = s[0];
	uint64_t s1 = s[1];
	const uint64_t result = s0 + s1;

	s1 ^= s0;
	s[0] = rotl(s0, 55) ^ s1 ^ (s1 << 14); // a, b
	s[1] = rotl(s1, 36); // c

	return result;
}

int main() {
    freopen(NULL, "wb", stdout); 

    s[0] = 0x084f31240ed2ec3f;
    s[1] = 0x0aa0d69470975eb8;

    while (1) {
        uint64_t value = next();
        fwrite((void*) &value, sizeof(value), 1, stdout);
    }
}

Compile this code then look at a hex dump of the first few outputs. Here’s what you get:

cook@mac> gcc xoroshiro.cpp
cook@mac> ./a.out | hexdump -C | head
f7 4a 6a 7f b8 07 f0 12  f8 96 e1 af 29 08 e3 c8  |.Jj.........)...|
15 0e b0 38 01 ef b2 a7  bb e9 6f 4d 77 55 d7 a0  |...8......oMwU..|
49 08 f2 fc 0c b2 ea e8  48 c2 89 1b 31 3f d7 3d  |I.......H...1?.=|
11 eb bc 5e ee 98 b6 3b  d9 d1 cc 15 ae 00 fc 2f  |...^...;......./|
3e 20 4a 6f 68 6e 20 44  2e 20 43 6f 6f 6b 20 3c  |> John D. Cook <| 
d1 80 49 27 3e 25 c2 4b  2a e3 78 71 9c 9e f7 18  |..I'>%.K*.xq....|
0b bb 1f c6 1c 71 79 29  d6 45 81 47 3b 88 4f 42  |.....qy).E.G;.OB|
7c 1c 19 c4 22 58 51 2d  d7 74 69 ac 36 6f e0 3f  ||..."XQ-.ti.6o.?|
78 7f a4 14 1c bc a8 de  17 e3 f7 d8 0c de 2c ea  |x.............,.|
a2 37 83 f9 38 e4 14 77  e3 6a c8 52 d1 0c 29 01  |.7..8..w.j.R..).|

(I cut out the line numbers on the left side to make the output fit better horizontally on the page.)

Not only did one pair of seed values put my name in the output, another pair would work too. If you change the seed values to

s[0] = 0x820e8a6c1baf5b13;
s[1] = 0x5c51f1c4e2e64253;

you’ll also see my name in the output:

66 9d 95 fe 30 7c 60 de 7c 89 0c 6f cd d6 05 1e |f...0|`.|..o....|
2b e9 9c cc cd 3d c5 ec 3f e3 88 6c a6 cd 78 84 |+....=..?..l..x.|
20 12 ac f2 2b 3c 89 73 60 09 8d c3 85 68 9e eb | ...+<.s`....h..|
15 3e c1 0b 07 68 63 e5 73 a7 a8 f2 4b 8b dd d0 |.>...hc.s...K...|
3e 20 4a 6f 68 6e 20 44 2e 20 43 6f 6f 6b 20 3c |> John D. Cook <|
3f 71 cf d7 5f a6 ab cf 9c 81 93 d1 3d 4c 9e 41 |?q.._.......=L.A|
0d b5 48 9c aa fc 84 d8 c6 64 1d c4 91 79 b4 f8 |..H......d...y..|
0c ac 35 48 fd 38 01 dd cc a4 e4 43 c6 5b e8 c7 |..5H.8.....C.[..|
e1 2e 76 30 0f 9d 41 ff b0 99 84 8d c1 72 5a 91 |..v0..A......rZ.|
06 ea f2 8e d7 87 e5 2c 53 a9 0c a0 f4 cd d1 9b |.......,S.......|

Note that there are limits to how much you can manipulate the output of an RNG. In this case the seed was selected to produce two particular output values, but the rest of the sequence behaves as if it were random. (See Random is as random does.)

DIEHARDER random number generator test results for PCG and MWC

A few days ago I wrote about testing the PCG random number generator using the DIEHARDER test suite. In this post I’ll go into a little more background on this random number generator test suite. I’ll also show that like M. E. O’Neill’s PCG (“permuted congruential generator”), George Marsaglia’s MWC (“multiply with carry”) generator does quite well.

This is not to say that MWC is the best generator for every purpose, but any shortcomings of MWC are not apparent from DIEHARDER. The PCG family of generators, for example, is apparently superior to MWC, but you couldn’t necessarily conclude that from these tests.

Unless your application demands more of a random number generator than these tests do, MWC is probably adequate for your application. I wouldn’t recommend it for cryptography or for high-dimensional integration by darts, but it would be fine for many common applications.

DIEHARDER test suite

George Marsaglia developed the DIEHARD battery of tests in 1995. Physics professor Robert G. Brown later refined and extended Marsaglia’s original test suite to create the DIEHARDER suite. (The name of Marsaglia’s battery of tests was a pun on the Diehard car battery. Brown continued the pun tradition by naming his suite after Die Harder, the sequel to the movie Die Hard.) The DIEHARDER suite is open source. It is designed to be at least as rigorous as the original DIEHARD suite and in some cases more rigorous.

There are 31 distinct kinds of tests in the DIEHARDER suite, but some of these are run multiple times. In total, 114 tests are run.

Marsaglia’s MWC

The main strength of Marsaglia’s MWC algorithm is that it is very short. The heart of the code is only three lines:

    m_z = 36969 * (m_z & 65535) + (m_z >> 16);
    m_w = 18000 * (m_w & 65535) + (m_w >> 16);
    return (m_z << 16) + m_w;

The heart of PCG is also very short, but a bit more mysterious.

    rng->state = oldstate * 6364136223846793005ULL + (rng->inc | 1);
    uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
    uint32_t rot = oldstate >> 59u;
    return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));

(These are the 64-bit state versions of MWC and PCG. Both have versions based on larger state.)

Because these generators require little code, they’d be relatively easy to step into with a debugger, compared to other RNGs such as Mersenne Twister that require more code and more state.

Test results

Out of the 114 DIEHARDER tests run on MWC, all but three returned a pass, and the rest returned a weak pass.

A few weak passes are to be expected. The difference between pass, weak pass, and failure is whether a p-value falls below a certain threshold. Theory says that ideally p-values would uniformly distributed, and so one would fall outside the region for a strong pass now and then.

Rather than counting strong and weak passes, let’s look at the p-values themselves. We’d expect these to be uniformly distributed. Let’s see if they are.

Here are the p-values reported by the DIEHARDER tests for MWC:

Histogram of p-values for MWC

Here are the corresponding values for PCG:

Histogram of p-values for PCG

Neither test has too many small p-values, no more than we’d expect. This is normally what we’re concerned about. Too many small p-values would indicate that the generated samples are showing behavior that would be rare for truly random input.

But both sets of test results have a surprising number of large p-values. Not sure what to make of that. I suspect it says more about the DIEHARDER test suite than the random number generators being tested.

Update: I went back to look at some results from Mersenne Twister to see if this pattern of large p-values persists there. It does, and in fact the p-values are even more biased toward the high end for Mersenne Twister.

Histogram of Mersenne Twister p-values

One thing I noticed is that the large p-values are disproportionately coming from some of the same tests each time. In particular, the repetitions of thests_serial test have an unexpectedly high number of large p-values for each generator.

If you’d like someone to run DIEHARDER or other RNG test suites, let’s talk. We have run these tests for other clients and we’d be glad to run them for you.

Testing the PCG random number generator

M. E. O’Neill’s PCG family of random number generators looks very promising. It appears to have excellent statistical properties, and it takes remarkably little code to implement. (PCG stands for Permuted Congruential Generator.)

The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.

Update: See the very informative note by the author of PCG in the comments below.

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a lot of random numbers as input.)

For both test suites I used the seed (state) 20170707105851 and sequence constant (inc) 42.

The PCG generator did well on all the NIST tests. For every test, at least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

--------------------------------------------------------
 C1  C2  C3 ...  C10  P-VALUE  PROPORTION  STATISTICAL TEST
--------------------------------------------------------
  2   0   2        0  0.213309     10/10   Frequency
  0   0   1        3  0.534146     10/10   BlockFrequency
  3   0   0        0  0.350485     10/10   CumulativeSums
  1   1   0        2  0.350485     10/10   CumulativeSums
  0   2   2        1  0.911413     10/10   Runs
  0   0   1        1  0.534146     10/10   LongestRun
  0   1   2        0  0.739918     10/10   Rank
  0   4   0        0  0.122325     10/10   FFT
  1   0   0        1  0.000439     10/10   NonOverlappingTemplate
  ...              
  2   1   0        0  0.350485      9/10   NonOverlappingTemplate
  0   2   1        0  0.739918     10/10   OverlappingTemplate
  1   1   0        2  0.911413     10/10   Universal
  1   1   0        0  0.017912     10/10   ApproximateEntropy
  1   0   1        1     ----       3/4    RandomExcursions
  ...              
  0   0   0        1     ----       4/4    RandomExcursions
  2   0   0        0     ----       4/4    RandomExcursionsVariant
  ...              
  0   0   3        0     ----       4/4    RandomExcursionsVariant
  1   2   3        0  0.350485      9/10   Serial
  1   1   1        0  0.739918     10/10   Serial
  1   2   0        0  0.911413     10/10   LinearComplexity

...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================#
        test_name   |ntup| tsamples |psamples|  p-value |Assessment
#=============================================================================#
   diehard_birthdays|   0|       100|     100|0.46682782|  PASSED
      diehard_operm5|   0|   1000000|     100|0.83602120|  PASSED
  diehard_rank_32x32|   0|     40000|     100|0.11092547|  PASSED
    diehard_rank_6x8|   0|    100000|     100|0.78938803|  PASSED
   diehard_bitstream|   0|   2097152|     100|0.81624396|  PASSED
        diehard_opso|   0|   2097152|     100|0.95589325|  PASSED
        diehard_oqso|   0|   2097152|     100|0.86171368|  PASSED
         diehard_dna|   0|   2097152|     100|0.24812341|  PASSED
diehard_count_1s_str|   0|    256000|     100|0.75417270|  PASSED
diehard_count_1s_byt|   0|    256000|     100|0.25725000|  PASSED
 diehard_parking_lot|   0|     12000|     100|0.59288414|  PASSED
    diehard_2dsphere|   2|      8000|     100|0.79652706|  PASSED
    diehard_3dsphere|   3|      4000|     100|0.14978100|  PASSED
     diehard_squeeze|   0|    100000|     100|0.35356584|  PASSED
        diehard_sums|   0|       100|     100|0.04522121|  PASSED
        diehard_runs|   0|    100000|     100|0.39739835|  PASSED
        diehard_runs|   0|    100000|     100|0.99128296|  PASSED
       diehard_craps|   0|    200000|     100|0.64934221|  PASSED
       diehard_craps|   0|    200000|     100|0.27352733|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.10570816|  PASSED
 marsaglia_tsang_gcd|   0|  10000000|     100|0.00267789|   WEAK
         sts_monobit|   1|    100000|     100|0.98166534|  PASSED
            sts_runs|   2|    100000|     100|0.05017630|  PASSED
          sts_serial|   1|    100000|     100|0.95153782|  PASSED
...
          sts_serial|  16|    100000|     100|0.59342390|  PASSED
         rgb_bitdist|   1|    100000|     100|0.50763759|  PASSED
...
         rgb_bitdist|  12|    100000|     100|0.98576422|  PASSED
rgb_minimum_distance|   2|     10000|    1000|0.23378443|  PASSED
...
rgb_minimum_distance|   5|     10000|    1000|0.13215367|  PASSED
    rgb_permutations|   2|    100000|     100|0.54142546|  PASSED
...
    rgb_permutations|   5|    100000|     100|0.96040216|  PASSED
      rgb_lagged_sum|   0|   1000000|     100|0.66587166|  PASSED
...
      rgb_lagged_sum|  31|   1000000|     100|0.00183752|   WEAK
      rgb_lagged_sum|  32|   1000000|     100|0.13582393|  PASSED
     rgb_kstest_test|   0|     10000|    1000|0.74708548|  PASSED
     dab_bytedistrib|   0|  51200000|       1|0.30789191|  PASSED
             dab_dct| 256|     50000|       1|0.89665788|  PASSED
        dab_filltree|  32|  15000000|       1|0.67278231|  PASSED
        dab_filltree|  32|  15000000|       1|0.35348003|  PASSED
       dab_filltree2|   0|   5000000|       1|0.18749029|  PASSED
       dab_filltree2|   1|   5000000|       1|0.92600020|  PASSED

RNG testing

If you’d like help with RNG testing, let us know. You can find our contact info here.

Random number generator seed mistakes

Long run or broken software?

I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)5 = 1/32, but that underestimates the probability because a run could start at any time. The chances that the first five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is much higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

trick die

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

Seeding many independent processes

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not certain that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with sequential seeds: give the first process seed 1, the second seed 2, etc. The seeds don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

 

Need help with randomization?

How to test a random number generator

Last year I wrote a chapter for O’Reilly’s book Beautiful Testing (ISBN 0596159811). The publisher gave each of us permission to post our chapters online, and so here is Chapter 10: How to test a random number generator.

Update: The chapter linked to above describes how to test transformations of a trusted uniform random number generator. For example, maybe your programming language provides a way to generate numbers between 0 and 1 uniformly, and you have written code to transform that into a normal distribution. That’s the most common case. Few people write their own core RNG; most bootstrap the core RNG into other kinds of RNG.

If you need to test a uniform random number generator, I can help with that.

C# random number generation code

This weekend Code Project posted an updated version of my Code Project article Simple Random Number Generation. The article comes with C# source code for generating random samples from the following distributions.

  • Cauchy
  • Chi square
  • Exponential
  • Inverse gamma
  • Laplace (double exponential)
  • Normal
  • Student t
  • Uniform
  • Weibull

After I submitted the revised article I realized I could have easily included a beta distribution generator. To generate a sample from a beta(a, b) distribution, generate a sample u from gamma(a, 1) and a sample v from gamma(b, 1) and return u/(u+v). (See why this works here.)

This isn’t the most efficient beta generator possible, especially for some parameters. But it’s not grossly inefficient either. Also, it’s very simple, and the code in that article emphasizes simplicity over efficiency.

The code doesn’t use advanced C# features; it could easily be translated to other languages.

Related links

Random number generator controversy

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.