## Long run or broken software?

I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)^{5} = 1/32, but that underestimates the probability because a run could start at any time. The chances that the *first* five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

## Seeding many independent processes

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not *certain* that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with *sequential* seeds: give the first process seed 1, the second seed 2, etc. The *seeds* don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

Better to seed with lava lamp or radiation.

In looking into this in the past, I came across the research of L’Ecuyer for parallel-stream PRNGs… e.g., http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf and later work. Do you have any thoughts on this or on other things to look into for parallel PRNG streams for Monte Carlo analysis?

I’m with Andy – better to use radiation for your seeds.

Regarding seeding with 1, 2, 3, … my understanding is that small seeds are generally bad because they don’t have enough information. My experience is primarily with Mersenne Twister implementations. I don’t have any hard evidence 20 minutes after waking up on a Saturday morning but my recollection was the sequences generated with seeds containing mostly zero bits were inferior to those seeded with random 32 bit numbers.

There is also a problem with virtual servers.

If you start a server and suspend it after it has seeded the RNG, and always use that as the starting point for new instances, they will of course all have their RNG in the same state.

Vic, random seeds sound reasonable, but that’s unnecessary at best. And its a bad idea for the reasons given in the post: you’re likely to run into the birthday problem.

A high quality random number generator should produce uncorrelated output for

anytwo different seeds, including consecutive seeds, and as far as anyone knows, the Mersenne Twister does this.I tested this by creating a sequence of 10,000 random numbers, using seeds 1, 2, 3, … and then doing a Pearson correlation test. The p-values are uniformly distributed, just as they should be. You could see this by eye, and you can run a Kolmogorov-Smirnov goodness of fit test.

In this plot, the lower line, the blue dots, are p-values for vectors created with consecutive seeds. The upper line, the green dots, show the p-values for generating all the vectors consecutively without resetting the seed. The K-S test statistic for this picture is 0.074, with a p-value of 0.966. This means the K-S test believes the p-values are quite uniformly distributed.

This picture was created using only 10 vectors (45 correlations) so you could see individual dots. The small number of vectors explains why there are gaps in both lines. That’s to be expected. With more vectors, the gaps fill in as you’d expect, but this also means the plots become solid green and blue lines and so hard to understand. With more vectors, you can rely on the K-S test instead, and shows that the vectors created from consecutive seeds are independent, i.e. the p-values for the correlation tests are uniformly distributed.

Brad: Isn’t that what you’d want and expect? If you restart a virtual machine, you’d expect it to have it’s previous state. And firing up multiple copies of a machine with the same state should produce the same output.

Not if you are using it as a way to quickly restart a web server.

In Perl 5, hashes have a randomized element to them, if the attacker finds out what the random seed is they can construct a set of keys that would all have the exact same hashed value. Using that they construct a denial of service GET request. Your site goes down, then comes back up. They send the same GET request, it goes back down. If this is automated, and doesn’t kill the slow instances, you could have a further DoS from having to run hundreds or thousands of instances of the same server.

If you had fully restarted the instance, Perl would have used a different random number. At that point the attacker has to redo all the work to find the random seed, and construct a new list of keys, before he could attack you again.

This is why I recommend anyone who is in anyway exposing Perl to untrusted input to upgrade to at least 5.18 as it has additional randomness that helps reduce the amount of information it leaks. ( I’m only aware of the weakness in earlier versions second hand, but I trust the source. As people still use outdated versions, the exact details of the leak aren’t likely to be made public any time soon )

The 5.18 change also randomizes each hash separately, instead of all hashes together, which may help to make the above attack harder to do.

see http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attacks

the link at the bottom is broken try https://www.usenix.org/conference/12th-usenix-security-symposium/denial-service-algorithmic-complexity-attacks

There is a talk from C3 inspired by the first such change all the way back in 5.8.1 (2003), and how after almost ten years after the problem was first found, it was still in other languages.

see https://lwn.net/Articles/474912/ for that, There is a video, but I can’t find it right now.

Hi John!

Out of curiosity: I wonder, what are your thoughts on the following (Mersenne Twister seeding pitfalls): http://www.pcg-random.org/posts/cpp-seeding-surprises.html

I would recommend _not_ using subsequent seed values for multiple threads. This is not guaranteed to work (although it might), whereas sampling random seed values _without replacement_ should always work. I wrote a small post about this in RPubs: http://rpubs.com/Jouni_Helske/225931

Your random sampling without replacement could conceivably produce consecutive seeds, so I don’t see how the former could be guaranteed to work but not the latter.

Oh yeah, fair point, I completely forgot to think about that. Still, I’d argue that the effect of having few unlucky threads with consecutive seeds is better than all being consecutively seeded.

Of course from a practical perspective (when you actually do something with the random variables) using a latest Mersenne-Twister with consecutive seeds is probably good enough, and actually when I checked the same example using std::mt19937 in C++ I saw no autocorrelations at all. The implementation used in R seems to be the older one where it is known that there exists some problems with seeding.

But thank you for the blog post and reply, this got me thinking about the seeding more (/too?) carefully!