Mike Croucher asked the following question on his blog. Suppose you draw M sequences of random numbers of length N from a random number generator. What is the probability that they will overlap?
Assumes your random number generator is a cyclical list of p unique integers. Each draw picks a random starting point in the cycle and chooses N consecutive values. We are particularly interested in the case M = 104, N = 109, and p = 219937 – 1 (the period of the Mersenne Twister).
We will simplify the problem by over-estimating the probability of overlap. The over-estimated probability will still be extremely small.
Assume all the draws are independent. This greatly increases the probability of repetition because now there is some chance that there are repetitions within each draw. That wasn’t possible before, provided N < p.
The probability of no repetitions in n = MN draws is
We can under-estimate this probability by replacing every term in the product with the smallest term. (By under-estimating the probability of no repetitions, we’re over-estimating the probability of at least one repetition.)
When n = MN = 1013 and p = 219937 – 1, the last expression above is approximately 1 – 1026/219937. That says the (over-estimated) probability of repetition is 1026/219937 or about 2×10-5976, thousands of orders of magnitude smaller than the chances of, say, drawing a royal flush.
Whenever you have such mind-bogglingly small probabilities, some probability outside your model has to be more important than the probability computed with the model. For example, we could question the assumption that the starting points are random and independent. Or we could question whether the random number generator was written correctly, or that the compiler correctly compiled the program, or that the operating system correctly ran the program, etc.