Suppose you have a large number of buckets and an equal number of balls. You randomly pick a bucket to put each ball in one at a time. When you’re done, about what proportion of buckets will be empty?
One line of reasoning says that since you have as many balls as buckets, each bucket gets one ball on average, so nearly all the buckets get a ball.
Another line of reasoning says that randomness is clumpier than you think. Some buckets will have several balls. Maybe most of the balls will end up buckets with more than one ball, and so nearly all the buckets will be empty.
Is either extreme correct or is the answer in the middle? Does the answer depend on the number n of buckets and balls? (If you look at the cases n = 1 and 2, obviously the answer depends on n. But how much does it depend on n if n is large?) Hint: There is a fairly simple solution.
What applications can you imagine for the result?
For N large, the number of balls per bucket should follow a Poisson distribution with lambda = 1 (equal number of balls and buckets). The fraction of buckets with K balls is then exp(-1) / K! which is 0.368 for K=0, and 0.184 for K=1, so most balls are in buckets with zero or one balls.
Applications: clustering in fixed-size hash tables.
Obviously, after 1 bucket is filled, there’s a 1/n chance that the next ball will also land in that bucket, leaving at least 1 bucket empty, and so forth.
A quick perl 1-liner tells me that the fraction of empty buckets tends towards approximately 36.7%.
I’m not surprised that the fraction of empty buckets winds up being fairly large…
So yay clumpiness, I guess :) I’m not sure how I’d get to that fraction analytically, though.
@rhalbersma: thanks, makes sense!
Mike G: If I told you that the number you found empirically is 1/e, would that help you find the fraction analytically?
I would guess off the top of my head that roughly half the buckets would end up empty. I don’t have a math background, although I’m trying to remedy that (Yay Project Euler!) What should I read to further understand (or answer) this question? Probability fascinates me.
Tom, you should read http://www.amazon.com/Cartoon-Guide-Statistics-Larry-Gonick/dp/0062731025
Tom: You might look at books and videos by Mike Starbird. I don’t know much he has written or recorded about probability per se, but he’s an outstanding teacher. (He was one of my professors in college. I’ve mentioned him a couple times before.)
Update: I just realized I didn’t answer your question directly. I was replying to what you could do for learning math in general. As for this particular problem, watch the comments on this post. There’s a simple derivation I’ll post if no one else does.
Actually, my previous remark “most balls are in buckets with zero or one balls” was incorrect. It’s more accurate to say “most buckets have zero or one balls”.
To compute how many balls are in buckets with zero or one balls, we need to integrate the Poisson distribution. Doing it numerically in R:
y <- exp(-1) / factorial(seq(10))
ny <- y * seq(from=0,length.out=10)
1/round(ny / sum(ny),3)
and we get that 1 in 2 balls are in a bucket with 1 ball, 1 in 3 are in buckets with 2 balls, 1 in 8 are in buckets with 3 balls, and 1 in 30 are in buckets with 4 balls.
I figured probability of a bucket being empty as:
((n – 1) / n ) ** n)
Which comes out at 0.366…, which matches other peoples answers but I’m not sure if it’s the best derivation?
Douglas: Good. Now take the limit as n -> infinity.
@Douglas, I think your derivation is the most intuitive. The probability of a bucket being missed once is (n-1)/n giving ((n-1)/n)^n as the probability of it being missed n times. For n = infinity, this is 1/e
I don’t know Poisson distributions. I made a tree diagram for 3 and for 4. I can see what the probabilities of the 2 extreme cases are, but not the middle cases. All in one: (1/n)^(n-1). One in each: (n-1)!/n^(n-1).
For 4, I got P(one empty)=36/64, P(3 in one, and 2 empty)=12/64, P(2 in each of 2, with 2 empty)=9/64, but I don’t see a pattern for those middle cases yet. If e is involved, I guess I won’t see a pattern easily.
For those that are interested, the definition of e is lim(1 + 1/n)^n, n -> inf. Its easy to see that (1 + 1/n) ^ n and (n/(n-1)) ^ n are equivalent for large n. So the inverse, ((n-1)/n)^n must be equal to 1/e
As confirmation of the math I wrote a quick c# program (not that confirmation is necessarily needed as the math is proof enough, but it’s helpful to see it in action).
It shows the average empty buckets to tend around .36 also. I would imagine scenarios that caused 1/3 of underutilized resources when attempting to load balancing read/write access to a hard drive.
private static void Main(string[] args)
{
Random r = new Random();
for (int n = 1; n < 100; n++)
{
Console.Write(string.Format("N={0} ", n));
//get an 'avg' for N by trying it 10 times.
int runningtotal = 0;
for (int attempt = 0; attempt < 10; attempt++)
{
int emptyBuckets = Fillbuckets(n, r);
Console.Write(string.Format("{0} ", emptyBuckets));
runningtotal += emptyBuckets;
}
Console.WriteLine(string.Format(" avg: {0}, %: {1}", runningtotal / 10.0, (runningtotal / 10.0) / n));
}
Console.WriteLine("Done.");
Console.ReadLine();
}
public static int Fillbuckets(int n, Random r)
{
int[] buckets = new int[n];
for (int ball = 0; ball x == 0);
}
Well, I couldn’t work out the math, but empirics seems to agree with analytics.
probEmpty = function(n)
{
buckets = 1:n
pick = sample(x=n,size=n,replace=T)
(n-length(unique(pick)))/n
}
> system.time({res =replicate(n=1000, probEmpty(100000))})
User System verstrichen
9.02 0.50 9.52
> plot(density(res))
> summary(res)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3646 0.3671 0.3678 0.3678 0.3685 0.3719
I approached it by writing out all the possible distributions for n=3, showing which bucket each ball went in, like this:
1,1,1
1,1,2
1,1,3
1,2,1
…
It became clear then that with n=3, each bucket would be unused in 2/3 of 2/3 of 2/3 of the cases. Then you would multiply by 3 (for three buckets) and divide by 3 (to get proportion empty), which is just multiplying by 1, so you stop at the 2/3 of 2/3 of 2/3.
So for n=4, 3/4 of 3/4 of 3/4 of 3/4… and from that I got the formula others have found: (n-1/n)^n.
It seemed clear this was converging on something, but I didn’t know what until reading the comments, so I can definitely learn more by looking into it deeper.
Another application is Monte Carlo integration. Stratified sampling is designed to help solve this problem is samples clumping together.
I can’t answer the “clumpiness” problem, my math skills aren’t that good.
But it sounds a lot like something useful to know when designing some types of hash tables. The size of the table is the number of buckets, and collisions are the buckets with more than one ball. The cost of managing collisions against resizing the table to get better distribution is a useful thing to know and compute quickly.
The probability of a given ball being in a particular bucket is 1/n, so of not being in that bucket (1-1/n). So, for all n balls (1-1/n)^n, the simple solution. Here’s another:
Start with the line of reasoning “since you have as many balls as buckets, each bucket gets one ball on average, so nearly all the buckets get a ball”, and make successive corrections. Each bucket gets one ball except for when when two or more balls fall into the same bucket then there are empty buckets left over.
Two given balls will fall into the same bucket with probability 1/n and there are (nC2) ball pairs, so (nC2)(1/n) is the expected number of 2-way collisions. Each would create an empty bucket. But, there is double counting for 3-way collisions. The collisions of ball1 and ball2, ball2 and ball3, and ball1 and ball3 count are counted as 3 collisions but only empty two buckets.
Similarly (nC3)(1/n)^2 is the expected number of 3-way collisions, but if subtract them all, we’ve over-corrected by the expected number of 4-way collisions…
Keep going and you’ll arrive at
n – (nC1) + (nC2)(1/n) – (nC3)(1/n)^2 + (nC4)(1/n)^3 -…
where the first two canceling terms are the “each bucket” and the “gets one ball on average”. Divide through by the n buckets to arrive at
1 – (nC1)(1/n) + (nC2)(1/n)^2 – (nC3)(1/n)^3 + (nC4)(1/n)^4 -…
You might recognize that this as the binomial expansion of (1-1/n)^n, and you are back to the simple solution. But if not, note instead that the term (nCk)(1/n)^k ~ (1/k!) for large n, and the sequence is approximately
1 – 1/1! + 1/2! – 1/3! + 1/4! – …
This might look familiar x = -1 in
e^x ~ 1 + x + 1/2! x^2 + 1/3! x^3 + 1/4! x^4 +…
n <- 1000
balls <- n # how many balls
ranBuckets <- sample(1:balls, replace = TRUE) # pick all of the random buckets at once
# the slow way around to getting the balls into the buckets
#for (i in 1:balls){
# buckets[ranBuckets[i]] <- buckets[ranBuckets[i]] + 1 # put the ball in the bucket
#}
# the somewhat more convoluted fast way in R
buckets <- rle(sort(ranBuckets)) #get all of the buckets with more than 0
1-length(buckets$values)/n #proportions of empty buckets
buckets <- sort(buckets$lengths, decreasing = TRUE)
buckets <- c(buckets, numeric(n-length(buckets))) # get back to the original n buckets wit counts of balls in them
barplot(buckets) #plot buckets sorte by number of balls
I apparently pasted over my comment above the code there. It’s R code to do the simulation.
I learned this in my probability class. It goes like this:
Prob of one bucket getting no balls => (1 – 1/n)^n ~ (e^(-1/n))^n = 1/e
So expected number of empty bucket => n/e
This is helpful in calculating how many server you need for load balancing your traffic for your website :)
Many people have correctly stated that the number of balls in any given bucket has a binomial (roughly Poisson) distribution, and so the probability of being occupied is like 1 – 1/e. So the mean number of occupied buckets is n * (1 – 1/e).
The mean is only a small part of the story. What about the full distribution?
It can be shown that the event that one bucket is occupied is negatively correlated to the probability that another bucket is occupied. This implies that the total number of buckets is tightly concentrated around its mean.
In particular, governs the number of occupied buckets, just like if each bucket had an independent 1 – 1/e probability of being unoccupied. For example, it is exponentially unlikely that the number of occupied buckets differs by a constant from n * (1 – 1/e).
Many people have correctly stated that the number of balls in any given bucket has a binomial (roughly Poisson) distribution, and so the probability of being occupied is like 1 – 1/e. So the mean number of occupied buckets is n * (1 – 1/e).
The mean is only a small part of the story. What about the full distribution?
It can be shown that the event that one bucket is occupied is negatively correlated to the probability that another bucket is occupied. This implies that the total number of buckets is tightly concentrated around its mean.
In particular, Chernoff’s bound governs the number of occupied buckets, just like if each bucket had an independent 1 – 1/e probability of being unoccupied. For example, it is exponentially unlikely that the number of occupied buckets differs by a constant from n * (1 – 1/e).
The probability of a ball going in a particular bucket is 1/n, where n is the number of buckets (as well as the number of balls). So, the probability of a ball not going into a particular bucket is 1 – 1/n. For all n buckets, the proportion, p, of empty buckets is then:
p = (1 – 1/n)^n.
Now, let’s see what happens when we let n increase beyond all bounds. Let’s also use the well known limit: e = lim (1 + 1/n)^n as n -> infinity.
Let’s start by noting that 1 – 1/n = (n-1)/n = 1/(n/(n-1)). If we let m = n – 1, we have:
1 – 1/n = 1/((m+1)/m) = 1/(1 + 1/m). Therefore:
p = (1 – 1/n)^n = (1/(1 + 1/m))^(m+1) = 1/(1 + 1/m)^(m+1). Let’s now consider the denominator, d:
d = (1 + 1/m)^(m+1) = ((1 + 1/m)^m)(1 + 1/m). Now if we let m -> infinity we have:
lim d = (lim (1 + 1/m)^m)(lim (1 + 1/m)) = (e)(1) = e. Therefore:
lim p = lim (1/d) = (lim 1)/(lim d) = 1/e, as m -> infinity.
Since m -> infinity as n -> infinity, we can conclude that as the number of buckets increases, the proportion of empty buckets approaches 1/e.
Mitzenmacher and “the power of two choices” — one of the most beautiful pieces of probability theory I have learned in a long time :-)
What about the expected maximum number of balls in a bucket? It’s not clear to me how to find it from the probability that a given bucket has k balls is P(k) = e^(-1)/k!. What is the formula of the cumulative distribution function?
Here’s my preemptive answer (as in, if I was working on this during a math competition, I’d stop here and run with it): http://i.imgur.com/SGRG2.png
As soon as I read “large number of buckets” I thought “Ok, let’s start with 2 and 3 buckets”.. and Pascal’s Triangle came right out! Sorry if I’m modelling it wrong, it was just a fun “aha” moment
Bleh, never mind
The distribution of the number of balls X in each bucket is multinomial: X ~ MN(n, 1/n * 1_n), where 1_n is a vector of ones of size n. Let X_i be the number of balls in the i-th bucket. Then E[X_i] = n * (1/n) = 1, Var[X_i] = n * (1/n) * (1 – 1/n) = 1 – 1/n and Cov(X_i, X_j) = -n * (1/n) * (1/n) = -1/n.
So, as n grows, we first notice that the X_i become uncorrelated. Moreover, as a first approximation, X_i -> Poisson(1) (in distribution)– we can roughly look at each X_i as a Binomial(n, 1/n) — so P(X_i = 0) = exp(-1). Note that, in general, if we had lambda times more balls than buckets than X_i -> Poisson(lambda), and so P(X_i = 0) = exp(-lambda). Finally, by the CLT, (X_i – 1)/sqrt(1 – 1/n) -> N(0,1) as n goes to infinity (again, in distribution), and the X_i are then not only uncorrelated, but independent.
After looking through the comments, and seeing everyone’s mathematical approximations of how to compute the percentage of empty buckets, I wrote a small Python script, based on numpy’s randint fuction, to find the bucket to place a ball.
def RandBuckets(n, repeats = 100):
'''
For a given number, n, of buckets and equal number of balls, find
the percentage of buckets that will contain 0 balls, if placed randomly
'''
repeat_num = repeats
total = 0
buckets = set([val for val in xrange(1,n+1)])
for i in xrange(repeat_num):
placed = set([np.random.randint(1,n+1) for i in xrange(1,n+1)])
diff = buckets.difference(placed)
total += (float(len(diff))/n)*100.
print "average percent empty = %.2f"%(total/repeat_num)
It was cool to see that no matter how many times I ran this, and what numbers I used for input, that the percentage of empty buckets was around 36.8%.
I have a follow-on question, which I don’t know how to solve. For a given number of balls and buckets N, what is the expected maximum number of balls in a single bucket?
The application I’m interested in is memory bank conflicts in a parallel machine with shared memory. If I have 32 banks in my shared memory, and 32 threads running SIMD parallel, and the threads index the memory randomly, what is the expected degree of bank conflicts/collisions?
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. For n = 32, that works out to about 2.788 (using natural log). That’s fine, but this programmer wrote a simulation and found that the maximum bank-conflict degree clusters between 3 and 4. Using the percentages he found using simulated as probability, I compute the expected value to be 3.529. It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32?
It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads. Thanks for an interesting discussion!
I would say (n/(n+1))^n, which converges to 1/e, aproximately 0.367879.
I meant that the probability that any bucket would be empty is 1/e. The number of empty buckets is n/e, of course.
This is a variation on the Birthday problem: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
http://en.wikipedia.org/wiki/Birthday_problem#Collision_counting
@ Mark Harris:
Yes, you can compute the exact expected value since you can enumerate all of the possibilities. The number of possibilities experience combinatorial explosion, so a naive approach isn’t feasible. But, it is possible using a recursive approach. Look for a paper called “The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies” by Charles J. Corrado.
I posted code over at the stackoverflow thread you linked, with output of probabilities and expected value (~3.532941). This is very close to the answer by simulation.
The big-O bound only tells you the the asymptotic behavior up to a constant multiple as n grows large. It isn’t meant to say anything about “small” n.
Here’s one way to solve it: Let X be the number of empty bins, and X_i be the indicator random variable that is 1 when the i-th bucket is empty and 0 otherwise. So, X = X_1 + … + X_n. Now, E(X) = E(X_1) + … + E(X_n) = ((n-1)/n)^n + … + ((n-1)/n)^n = n*((n-1)/n)^n.
In the limit as n -> infty, we have n/e bins to be empty.