Suppose you have a random number generator that returns numbers between 1 and N. The birthday problem asks how many random numbers would you have to output before there’s a 50-50 chance that you’ll repeat a number. The coupon collector problem asks how many numbers you expect to generate before you’ve seen all N numbers at least once.
These are examples of occupancy problems. The name comes from imagining N urns, randomly assigning balls to each, then asking how many urns are occupied.
Suppose people are walking into a room one at a time. The birthday problem asks at what point is there even odds that two people in the room have the same birthday. The coupon collector problem asks the expected number of people to enter the room before all birthdays are represented. But we could ask, for example, after 100 people enter the room, how likely it is that there are 70 different birthdays represented.
When we draw r random samples with replacement from a set of N items, let X(N, r) be the random variable that represents the number of distinct items in the sample. For example, suppose a hashing algorithm returns one of N possible hash codes. The number of distinct hash codes after hashing r documents would be X(N, r).
The probability mass function (pmf) for X(N, r) has been calculated, for example in [1], but it’s very complicated and you’re not likely to get much understanding of X(N, r) by looking at the expression for the pmf. The mean and variance of X(N, r) are somewhat complicated [2], but easier to work with than the pmf.
The mean of X(N, r) is
In the special case that N = r and N is large, the mean is approximately N(1 – 1/e). For example, suppose you had a deck of 52 cards. You draw one card, put it back in the deck, and shuffle the deck. Repeat this 52 times. You would get about 33 distinct cards on average.
The variance of X(N, r) is more complicated than the mean.
As with the mean, the case N = r with N large is simpler. In that case the variance is approximately N(1/e – 1/e²). In the example above, this works out to about 12. The standard deviation is about 3.5, and so you’d often see 33 ± 7 distinct cards.
Related posts
- Probability of secure hash collisions
- The coupon collector problem and π
- Serious applications of a party trick
[1] Paul G. Hoel, Sidney C. Port, and Charles J. Stone, Introduction to Probability TheoryX(N, r), Houghton Mifflin, Boston, 1971, pp. 43–45.
[2] Emily S. Murphree. Replacement Costs: The Inefficiencies of Sampling with Replacement. Mathematics Magazine, Vol. 78, No. 1 (Feb., 2005), pp. 51-57.