Suppose you have a standard deck of 52 cards. You pull out a card, put it back in the deck, shuffle, and pull out another card. How long would you expect to do this until you’ve seen every card?
Here’s a variation on the same problem. Suppose you’re a park ranger keeping data on tagged wolves in a national park. Each day you catch a wolf at random, collect some data, and release it. If there are 100 wolves in the park, how long until you’ve seen every wolf at least once?
We’ll solve this problem via simulation, then exactly.
Simulation
The Python code to solve the problem via simulation is straight forward. Here we’ll simulate our ranger catching and releasing one of 100 wolves. We’ll repeat our simulation five times to get an idea how much the results vary.
import numpy as np from numpy.random import default_rng rng = default_rng() num_runs = 5 N = 100 # number of unique items for run in range(num_runs): tally = np.zeros(N,dtype=int) trials = 0 while min(tally) == 0: tally[rng.integers(N)] += 1 trials += 1 print(trials)
When I ran this, I got 846, 842, 676, 398, and 420.
Exact solution
Suppose we have a desk of N cards, and we sample the deck with replacement M times. We can select the first card N ways. Ditto for the second, third, and all the rest of the cards. so there are NM possible sequences of card draws.
How many of these sequences include each card at least once? To answer the question, we look at Richard Stanley’s twelvefold way. We’re looking for the number of ways to allocate M balls into N urns such that each urn contains at least one ball. The answer is
N! S(M, N)
where S(M, N) is the Stirling number of the second kind with parameters M and N [1].
So the probability of having seen each card at least once after selecting M cards randomly with replacement is
N! S(M, N) / NM.
Computing Stirling numbers
We’ve reduced our problem to the problem of computing Stirling numbers (of the second kind), so how do we do that?
We can compute Stirling numbers in terms of binomial coefficients as follows.
Now we have a practical problem if we want to use this formula for larger values of n and k. If we work with floating point numbers, we’re likely to run into overflow. This is the perennial with probability calculations. You may need to compute a moderate-sized number as the ratio of two enormous numbers. Even though the final result is between 0 and 1, the intermediate results may be too large to store in a floating point number. And even if we don’t overflow, we may lose precision because we’re working with an alternating sum.
The usual way to deal with overflow is to work on a log scale. But that won’t work for Stirling numbers because we’re adding terms together, not multiplying them.
One way to solve our problem—not the most efficient way but a way that will work—is to work with integers. This is easy in Python because integers by default can be arbitrarily large.
There are two ways to compute binomial coefficients in Python: scipy.special.binom
and math.comb
. The former uses floating point operations and the latter uses integers, so we need to use the latter.
We can compute the numerator of our probability with
from math import comb def k_factorial_stirling(n, k): return sum((-1)**i * comb(k, i)*(k-i)**n for i in range(k+1))
Then our probability is
k_factorial_stirling(M, N) / N**M
If we compute the probability that our ranger has seen all 100 wolves after catching and releasing 500 times, we get 0.5116. So the median number of catch and release cycles is somewhere around 500, which is consistent with our simulation above.
Note that the numerator and denominator in our calculation are both on the order of 101000 while the largest floating point number is on the order of 10308, so we would have overflowed if we’d used binom
instead of comb
.
Related posts
[1] It’s unfortunate that these were called the “second kind” because they’re the kind that come up most often in application.
Image: “Fastening a Collar” by USFWS Mountain Prairie is licensed under CC BY 2.0 .
A related problem in probability is known as the coupon collector’s problem. If each box of cereal contains a coupon, and there are N varieties of coupon, and these are uniformly distributed over an endless supply of cereal boxes, what is the probability of getting a complete set after opening M boxes? What is the average number of boxes opened.
Herbert Wilf,
Some New Aspects of the Coupon Collector’s Problem,
SIAM Review,
Volume 48, Number 3, September 2006, pages 549-565.
Or since those coupons are not terribly common anymore: how many packs of Magic cards (or whatever game you prefer) do you have to buy before you get at least one of each card in the new expansion? :)
Mathematica provides a function for Stirling numbers (both kinds). For me, the second kind also comes up more often than the first in the area of performance modeling and analysis. This one liner confirms the Python code output for 500 samples:
In[] = 100! StirlingS2[500, 100] /100^500 // N
Out[]=0.511555
After about 1200 samples, there is a good chance that all 100 are seen (probability is 0.999422), so we can compute the CDF:
cdf = Table[{m, N[n! StirlingS2[m, n] /n^m /. {n -> 100}]}, {m, 100, 1200}];
prob = Prepend[Differences@cdf[[All, 2]], cdf[[All, 2]][[1]]]
The prob vector represents the probability that all 100 are seen with exactly 100 samples (highly unlikely), 101 samples, 102 samples, etc. This suggests that about 518 samples are needed on average to see all 100, which answers the related coupon problem.
The source code for this page contains the following:
content=”If you sample N times with replacement M times, how likely are you to have sampled every item at least once?”
I think the content text is wrong, but I’m not sure. Shouldn’t it read
“If you sample N things …”.
I noticed this when I pasted a link to this page in a Slack group.
As mentioned the coupon collector’s problem is the term I see others using for this. For the deck of cards with 52 cards the expected number is 236 from the wikipedia page at https://en.wikipedia.org/wiki/Coupon_collector%27s_problem
The approximate formula for the larger number answers is n log n + gamma*n + 1/2 where gamma is the Euler–Mascheroni constant (approx 0.5772156649) and the log is the natural log. Plug in 52 and you get 235.9799 or about the 236 from above. Plug in 2 (for when n=2) and you get 3.040726 when the exact answer is 3. For the 100 wolf example that would be an approximate answer of about 518.7. Note this is computing the expected value, not the median of many trials since unlucky trials can be a lot longer than the expected value while lucky trials are capped (you can never have fewer than 100 wolves caught, you can have unlucky trials that take you more than 900 or more than 1000 attempts to catch all the wolves).
My usecase to come to this was trying to figure out if you want some content to end up being cached in a multiple server situation when you have random loadbalancers to identical servers, if you can’t control your requests and they are going to be randomly loadbalanced how many early seed requests would you need to do to expect that all of the servers have loaded some new material (I.e., how many requests until each server would have been hit with at least one request) which is basically again this exact same problem.