Random sample overlap

Here’s a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)

Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?

To make the problem slightly more general, take two samples of size n from a population of size n² where n is large.

The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.

For n = 1,000 the approximation is good to three figures.

Proof

There are Binomial(n², n) ways to choose a sample of size n from a population of size n². After we’ve drawn the first sample, there are Binomial(n² – n, n) ways to draw a new sample that doesn’t overlap with the first one. The probability of such a sample is

\begin{align*} \left.\binom{n^2 - n}{n} \middle/ \binom{n^2}{n}\right. &= \frac{(n^2 - n)!}{(n^2 - 2n)!} \frac{(n^2 - n)!}{(n^2)!} \\ & = \frac{(n^2 -n)(n^2 - n -1) \cdots (n^2 - 2n+1)}{n^2 (n^2-1) \cdots (n^2 - n+1)} \\ &= \left( \frac{n^2 - n}{n^2} \right) \left( \frac{n^2 - n - 1}{n^2 -1} \right) \cdots \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right) \\ \end{align*}

The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.

By taking logs one can show that

\lim_{n\to\infty} \left( \frac{n^2 - n}{n^2} \right)^n = \lim_{n\to\infty} \left( \frac{n^2 - 2n + 1}{n^2 - n+1} \right)^n = \frac{1}{e}

Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.

Comments

This problem is reminiscent of the birthday problem, but it’s a little different because the samples are draw without replacement.

Incidentally, I didn’t make up this problem out of thin air. It came up naturally in the course of a consulting project this week.

5 thoughts on “Random sample overlap

  1. I tried it and you’re right. That’s cool! Now I’m a bit interested in the proof..
    Here’s the R code I used:

    n_no_over_laps <- 0.
    N_trials <- 100000
    S <- seq(1, 1000000)
    N <- 1000

    for(i in seq(1, N_trials)) {
    N1 <- sample(S, N, replace=T)
    N2 <- sample(S, N, replace=T)
    n_no_over_laps 0, 0, 1)
    }
    paste(1 / exp(1), n_no_over_laps / N_trials)

    # Theory, Experiment
    “0.367879441171442 0.36585”

  2. Nice result, worthy of a homework problem for my stats majors. The proof isn’t so bad once you recognize that it boils down to the limit of (1-1/x)^x as x gets large. Thanks!

  3. It’s a little more complicated since the sampling is done without replacement.

    Let n be the sample size and n^2 the population size. The probability of no overlap is

    ((n^2 – n)!)^2 / (n^2)! (n^2 – 2n)!

    You can write this out and find that it’s bounded above an below by expressions that converge to 1/e.

    Update: I’ve updated the post to include the proof sketched in this comment.

  4. Possibly simpler proof:
    Let k = sqrt(n). Choose one set of numbers first; it doesn’t matter what you choose since you always get k different numbers. Now choose the other set of k numbers. The probability of the (j+1)th number being distinct from the first set, assuming the 1st through jth are, is (n – k – j) / (n – j). This gets smaller as j increases, and so is bounded by (n – k) / n and (n – 2k) / (n – k); that is, by 1 – (1 / k) and 1 – (1 / (k – 1)).

    It is well known that (1 – (1 / k))^k converges to 1/e, and not too hard to show the same for the lower bound:
    (1 – (1 / (k – 1)))^k = (1 – (1 / (k – 1)))^(k – 1) * (1 – (1 / (k – 1))) -> 1/e * 1 = 1/e.

  5. I did not look at the other four responses before I posted.

    The problem was made easy because it said (presumably product) SERIAL numbers. I.e. all different. (However, in thinking about possible duplicates I believe that the analysis below still holds.)

    Thus for any given member of set A, the probability of a match in set B is 0.001, one in a thousand. Thus the probability of a non-match in set B is 0.999. With 1000 events of the same probability the chance of all non-matches is 0.999^1000. (‘^’ used for exponentiation because HTML follows, 0.9991000, sometimes does not work in comments on many platforms.)

    The chance that there will be no matches is 0.367695425.

    ps I am looking for someone to critique my clever pseudo-random number generator.

Comments are closed.