Long run or broken software?
I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)5 = 1/32, but that underestimates the probability because a run could start at any time. The chances that the first five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is higher.
Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.
The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.
How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.
Seeding many independent processes
A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.
If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.
To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.
The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not certain that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.
Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.
In this case it would have been better to seed each process with sequential seeds: give the first process seed 1, the second seed 2, etc. The seeds don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.