Consecutive coupon collector problem

Coupon collector problem

Suppose you have a bag of balls labeled 1 through 1,000. You draw balls one at a time and put them back after each draw. How many draws would you have to make before you’ve seen every ball at least once?

This is the coupon collector problem with N = 1000, and the expected number of draws is

N HN

where

HN = 1 + 1/2 + 1/3 + … + 1/N

is the Nth harmonic number.

As N increases, HN approaches log(N) + γ where γ = 0.577… is the Euler-Mascheroni constant, and so the expected time for the coupon collector problem is approximately

N (log(N) + γ).

Consecutive draws

Now suppose that instead of drawing single items, you draw blocks of consecutive items. For example, suppose the 1,000 balls are arranged in a circle. You pick a random starting point on the circle, then scoop up 10 consecutive balls, then put them back. Now how long would it take to see everything?

By choosing consecutive balls, you make it harder for a single ball to be a hold out. Filling in the holes becomes easier.

Bucketed problem

Now suppose the 1,000 balls are placed in 100 buckets and the buckets are arranged in a circle. Now instead of choosing 10 consecutive balls, you choose a bucket of 10 balls. Now you have a new coupon collector problem with N = 100.

This is like the problem above, except you are constraining your starting point to be a multiple of n.

Upper and lower bounds

I’ll use the word “scoop” to mean a selection of n balls at a time to avoid possible confusion over drawing individual balls or groups of balls.

If you scoop n balls at a time by making n independent draws, then you just have the original coupon collector problem, with the expected time divided by n.

If you scoop up n consecutively numbered balls each time, you reduce the expected time to see everything at least once. But your scoops can still overlap. For example, maybe you selected 13 through 22 on one draw, and 19 through 38 on the next.

In the bucketed problem, you reduce the expected time even further. Now your scoops will not partially overlap. (But they may entirely overlap, so it’s not clear that this reduces the total time.)

It would seem that we have sandwiched our problem between two other problems we have the solution to. The longest expected time would be if our scoop is made of n independent draws. Then the expected number of scoops is

N HN / n.

The shortest time is the bucketed problem in which the expected number of scoops is

(N/n) H(N/n).

It seems the problem of scooping n consecutive balls, with no constraint on the starting point, would have expected time somewhere between these two bounds. I say “it seems” because I haven’t proven anything here, just given plausibility arguments.

By the way, we can see how much bucketing reduces the expected time by using the log approximation above. With n independent draws each time, the expected number of scoops is roughly

(N/n) log(N)

whereas with the bucketed problem the expected number of scoops is roughly

(N/n) log(N/n).

Expected number of scoops

I searched a bit on this topic, and I found many problems with titles like “A variation on the coupon collector problem,” but none of the papers I found considered the variation I’ve written about here. If you work out the expected number of scoops, or find a paper where someone has worked this out, please let me know.

The continuous analog seems like an easier problem, and one that would provide a good approximation. Suppose you have a circle of circumference N and randomly place arcs of length n on the circle. What is the expected time until the circle is covered? I imagine this problem has been worked out many times and may even have a name.

Update: Thanks to Monte for posting links to the solution to the continuous problem in the comments below.

Simulation results

When N = 1000 and n = 10, the upper and lower bounds work out to 748 and 518.

When I simulated the consecutive coupon collector problem I got an average of 675 scoops, a little more than the average of the upper and lower bounds.

 

Hyperellipsoid surface area

Dimension 2

The equation for the perimeter of an ellipse is

p = 4aE(e^2)

where a is the semimajor axis, e is eccentricity, and E is a special function. The equation is simple, in the sense that it has few terms, but it is not elementary, because it depends on an advanced function, the complete elliptic integral of the second kind.

However, there is an approximation for the perimeter that is both simple and elementary:

p \approx 2 \pi \left(\frac{a^{3/2} + b^{3/2}}{2} \right )^{2/3}

Dimension 3

The generalization of an ellipse to three dimensions is an ellipsoid. The surface area of an ellipsoid is neither simple nor elementary. The surface area S is given by

\begin{align*} \varphi &= \arccos\left(\frac{c}{a} \right) \\ k^2 &= \frac{a^2\left(b^2 - c^2\right)}{b^2\left(a^2 - c^2\right)} \\ S &= 2\pi c^2 + \frac{2\pi ab}{\sin(\varphi)}\left(E(\varphi, k)\,\sin^2(\varphi) + F(\varphi, k)\,\cos^2(\varphi)\right) \end{align*}

where E is incomplete elliptic integral of the second kind and F is the incomplete elliptic integral of the first kind.

However, once again there is an approximation that is simple and elementary. The surface area approximately

S \approx 4\pi \left( \frac{(ab)^p + (bc)^p + (ac)^p}{3} \right )^{1/p}
where p = 1.6075.

Notice the similarities between the approximation for the perimeter of an ellipse and the approximation for the area of an ellipsoid. The former is the perimeter of a unit circle times a kind of mean of the axes. The latter is the area of a unit sphere times a kind of mean of the products of pairs of axes. The former uses a p-mean with p = 1.5 and the latter uses a p-mean with p = 1.6075. More on such means here.

Dimensions 4 and higher

The complexity of expressions for the surface area of an ellipsoid apparently increase with dimension. The expression get worse for hyperellipsoids, i.e. n-ellipsoids for n > 3. You can find such expressions in [1]. More of that in just a minute.

Conjecture

It is natural to conjecture, based on the approximations above, that the surface area of an n-ellipsoid is the area of a unit sphere in dimension n times the p-mean of all products of n-1 semiaxes for some value of p.

For example, the surface area of an ellipsoid in 4 dimensions might be approximately

S \approx 2\pi^2 \left( \frac{(abc)^p + (abd)^p + (acd)^p + (bcd)^p}{4} \right )^{1/p}

for some value of p.

Why this form? Permutations of the axes do not change the surface area, so we’d expect permutations not to effect the approximation either. (More here.) Also, we’d expect from dimensional analysis for the formula to involve products of n-1 terms since the result gives n-1 dimensional volume.

Surely I’m not the first to suggest this. However, I don’t know what work has been done along these lines.

Exact results

In [1] the author gives some very complicated but general expressions for the surface area of a hyperellipsoid. The simplest of his expression involves probability:

S = n \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n+1}{2} \right)} \text{E}\left(\sqrt{q_1^2 X_1^2 + q_2^2 X_2^2 + \cdots + q_n^2 X_n^2 }\right)

where the Xs are independent normal random variables with mean 0 and variance 1/2.

At first it may look like this can be simplified. The sum of normal random variables is a normal random variable. But the squares of normal random variables are not normal, they’re gamma random variables. The sum of gamma random variables is a gamma random variable, but that’s only if the variables have the same scale parameter, and these do not unless all the semiaxes, the qs, are the same.

You could use the formula above as a way to approximate S via Monte Carlo simulation. You could also use asymptotic results from probability to get approximate formulas valid for large n.

 

[1] Igor Rivin. Surface area and other measures of ellipsoids. Advances in Applied Mathematics 39 (2007) 409–427

Justifiable sample size

One of the most common things a statistician is asked to do is compute a sample. There are well known formulas for this, so why isn’t calculating a sample size trivial?

As with most things in statistics, plugging numbers into a formula is not the hard part. The hard part is deciding what numbers to plug in, which in turn depends on understanding the context of the study. What are you trying to learn? What are the constraints? What do you know a priori?

In my experience, sample size calculation is very different in a scientific setting versus a legal setting.

In a scientific setting, sample size is often determined by budget. This isn’t done explicitly. There is a negotiation between the statistician and the researcher that starts out with talk of power and error rates, but the assumptions are adjusted until the sample size works out to something the researcher can afford.

In a legal setting, you can’t get away with statistical casuistry as easily because you have to defend your choices. (In theory researchers have to defend themselves too, but that’s a topic for another time.)

Opposing counsel or a judge may ask how you came up with the sample size you did. The difficulty here may be more expository than mathematical, i.e. the difficulty lies in explaining subtle concepts, not is carrying out calculations. A statistically defensible study design is no good unless there is someone there who can defend it.

One reason statistics is challenging to explain to laymen is that there are multiple levels of uncertainty. Suppose you want to determine the defect rate of some manufacturing process. You want to quantify the uncertainty in the quality of the output. But you also want to quantify your uncertainty about your estimate of uncertainty. For example, you may estimate the defect rate at 5%, but how sure are you that the defect rate is indeed 5%? How likely is it that the defect rate could be 10% or greater?

When there are multiple contexts of uncertainty, these contexts get confused. For example, variations on the following dialog come up repeatedly.

“Are you saying the quality rate is 95%?”

“No, I’m saying that I’m 95% confident of my estimate of the quality rate.”

Probability is subtle and there’s no getting around it.

Related posts

First time seeing a rare event

Suppose you’ve been monitoring a rare event for a long time, then you see your first occurrence on the Nth observation. Now what would you say about the event’s probability?

For example, suppose you’re wondering whether dogs ever have two tails. You observe thousands of dogs and never see two tails. But then you see a dog with two tails? Now what can you say about the probability of dogs having two tails? It’s certainly not zero.

We’ll first look at the case of 0 successes out of N trials then look at the case of 1 success out of N trials.

Zero successes

If you’re observing a binary event and you’ve seen no successes out of N trials your point estimate of the probability of your event is 0. You can’t have any confidence in the relative accuracy of your estimate: if the true probability is positive, no matter how small, then the relative error in your estimate is infinite.

But you can have a great deal of confidence in its absolute accuracy. When you’re looking for a binary event and you have not seen any instances in N trials for large N, then a 95% confidence interval for the event’s probability is approximately [0, 3/N]. This is the statistical rule of three. This is a robust estimate, one you could derive from either a frequentist or Bayesian perspective.

Note that the confidence interval [0, 3/N] is exceptionally narrow. When observing a moderate mix of successes and failures the width of the confidence interval is on the order of 1/√N, not 1/N.

First success

After seeing your first success, your point estimate jumps from 0 to 1/N, and infinite relative increase. What happens to your confidence interval?

If we use Jeffreys’ beta(1/2, 1/2) prior, then the posterior distribution after seeing 1 success and N − 1 failures is a beta(3/2, N − 1/2). Now an approximate 95% confidence interval is

[0.1/N, 4.7/N]

So compared to the case of seeing zero successes, seeing one success makes our confidence interval about 50% wider and shifts it to the left by 0.1/N.

So if you’ve seen 100,000 dogs and only 1 had two tails, you could estimate that a 95% confidence interval for the probability of a dog having two tails is

[10−6, 4.7 × 10−5].

If we run the exact numbers we get

[ 1.07 × 10−6, 4.67 ×10−5].

Related posts

Can the chi squared test detect fake primes?

Jeweler examining gemstones

This morning I wrote about Dan Piponi’s fake prime function. This evening I thought about it again and wondered whether the chi-squared test could tell the difference between the distribution of digits in real primes and fake primes.

If the distributions were obviously different, this would be apparent from looking at histograms. When distributions are more similar, visual inspection is not as reliable as a test like chi-squared. For small samples, we can be overly critical of plausible variations. For large samples, statistical tests can detect differences our eyes cannot.

When data fall into a number of buckets, with a moderate number of items expected to fall in each bucket, the chi squared test attempts to measure how the actual counts in each bucket compare to the expected counts.

This is a two-sided test. For example, suppose you expect 12% of items to fall in bucket A and 88% to fall in bucket B. Now suppose you test 1,000 items. It would be suspicious if only 50 items landed in bucket A since we’d expect around 120. On the other hand, getting exactly 120 items would be suspicious too. Getting exactly the expected number is unexpected!

Let’s look a the primes, genuine and fake, less than N = 200. We’ll take the distribution of digits in the list of primes as the expected values and compare to the distribution of the digits in the fake primes.

When I ran this experiment, I got a chi-squared value of 7.77. This is an unremarkable value for a sample from a chi-squared distribution with 9 degrees of freedom. (There are ten digits, but only nine degrees of freedom because if you rule out nine possibilities then digit is determined with certainty.)

The p-value in this case, the probability of seeing a value as large as the one we saw or larger, is 0.557.

Next I increased N to 1,000 and ran the experiment again. Now I got a chi-squared value of 19.08, with a corresponding p-value of 0.024. When I set N to 10,000 I got a chi-squared value of 18.19, with a corresponding p-value of 0.033.

When I used N = 100,000 I got a chi-squared value of 130.26, corresponding to a p-value of 10−23. Finally, when I used N = 1,000,000 I got a chi-squared value of 984.7 and a p-value of 3.4 × 10−206.

In short, the chi-squared test needs a fair amount of data to tell that fake primes are fake. The distribution of digits for samples of fake primes less than a thousand or so is plausibly the same as that of actual primes, as far as the test can distinguish. But the chi-squared values get implausibly large for fake primes up to  100,000.

Random points in a high-dimensional orthant

In high dimensions, randomly chosen vectors are very likely nearly orthogonal. I’ll unpack this a little bit then demonstrate it by simulation. Then I’ll look at what happens when we restrict our attention to points with positive coordinates.

***

The lengths of vectors don’t contribute to the angles between them, so we may as well focus our attention on spheres.

Suppose we pick two points on a familiar sphere in 3 dimensions, our home planet. By symmetry, we can assume the first vector runs from the center of the earth to the north pole. Otherwise we can imagine picking the first point, then rotating the earth so that the chosen point becomes the north pole.

What vectors are orthogonal (perpendicular) to the north pole? The vectors from the center to a point on the equator. The vectors that are nearly orthogonal to the north pole are the vectors that are near the equator.

On a high-dimensional sphere, nearly all the area is near the equator (more on that here). This means that if we choose two vectors, they are very likely to be nearly orthogonal.

Simulating points on sphere

I’ll illustrate this by choosing pairs of points at random on a unit sphere in 100 dimensions and computing their dot product. This gives the cosine similarity of the points, the cosine of the angle between them. I’ll repeat this process 1,000 times and look at the average cosine similarity.

    from scipy.stats import norm
    import numpy as np

    np.random.seed(20230809)

    numruns = 1000
    avg = 0
    dim = 100
    for _ in range(numruns):
        x = norm(0,1).rvs(size = dim)
        y = norm(0,1).rvs(size = dim)
        avg += np.dot(x, y)/(np.dot(x,x)*np.dot(y,y))**0.5
     print(avg/numruns)

Here we use the standard method of generating random points on a sphere: generate each component as a normal (Gaussian) random variable, then divide by the length. The code above combines the normalization with finding the dot product.

The code above prints 0.00043. The average dot product between vectors is very small, i.e. the vectors are nearly orthogonal.

Simulating points in orthant

Next let’s look at one orthant of the sphere. (“Orthant” is the generalization to higher dimensions of quadrant in two dimensions and octant in three dimensions.) We’ll generate random samples from the first orthant by first generating random samples from the whole sphere, then taking the absolute value to fold all the points from the other orthants into the first orthant.

The only necessary change to the code is to put abs() around the code generating the random points.

    x = abs(norm(0,1).rvs(size = dim))
    y = abs(norm(0,1).rvs(size = dim))

When we run the code again it prints 0.639. This is the average value of the cosine of the angle between points, so the average angle is about 0.877 radians or about 50°.

This isn’t quite the right thing to do. We should convert to angles first, then take the average. But that leads to essentially the same result.

If my pencil-and-paper calculations are correct, the exact expected dot product value approaches 2/π = 0.6366 as dimension goes to infinity, and the value of 0.639 above would be consistent with this.

More high-dimensional geometry posts

Variance of binned data

Suppose you have data that for some reason has been summarized into bins of width h. You don’t have the original data, only the number of counts in each bin.

You can’t exactly find the sample mean or sample variance of the data because you don’t actually have the data. But what’s the best you can do? A sensible approach would be to imagine that the contents of each bin represents samples that fell in the middle of the bin. For example, suppose your data were rounded down to the nearest integer. If you have 3 observations in the bin [5, 6] you could think of that as the value 5.5 repeated three times.

When we do this, we don’t recover the data set, but we create a pseudo data set that we can then find the sample mean and sample variance of.

This works well for estimating the mean, but it overestimates the variance. William Sheppard (1863–1936) recommended what is now known as Sheppard’s correction, subtracting h²/12 from the sample variance of the pseudo data. Richardson’s paper mentioned in the previous post fits Sheppard’s correction into the general framework of “the deferred approach to the limit.”

A more probabilistic justification of Sheppard’s correction is that h²/12 is the variance of a uniform random variable over an interval of width h. You could think of Sheppard’s correction as jittering the observations uniformly within each bin.

Let’s do a little simulation to demonstrate the accuracy of estimating the sample variance with and without Sheppard’s correction.

    from numpy import random, floor
    from scipy.stats import norm

    random.seed(20230731)

    numruns = 1000
    rawbias = 0
    sheppardbias = 0
    for run in range(numruns):
        y = norm(0, 3).rvs(size=1000)
        ypseudo = floor(y) + 0.5
        v = ypseudo.var(ddof=1)
        rawbias += abs(v - y.var(ddof=1))
        sheppardbias += abs(v - y.var(ddof=1) - 1/12)

    print(rawbias/numruns, sheppardbias/numruns)

This Python script computes the variance of the pseudodata, with and without the Sheppard correction, and compares this value to the sample variance of the actual data. This process is repeated one thousand times.

On average, the uncorrected variance of the pseudodata was off by 0.0875 compared to the variance of the actual data. With Sheppard’s correction this drops to an average of 0.0442, i.e. the calculation with Sheppard’s correction was about 2 times more accurate.

Every Japanese prefecture shrinking

It’s well known that the population of Japan has been decreasing for years, and so I was a little puzzled by a recent headline saying that Japan’s population has dropped in every one of its 47 prefectures. Although the national population is in decline, until now not all of the nation’s 47 prefectures dropped in population at the same time. Is this remarkable? Actually, yes.

This post will not be about Japan’s demographics in detail, but was motivated by the headline mentioned above.

Demographics is not random; it is the result of individual human choices. But for this post we’ll look at demographics as if it were random.

Suppose each prefecture had the same probability p of declining. Then we could infer that p is probably quite large for all prefectures to decline at the same time. If you flip a coin 47 times and it comes up heads each time, you don’t imagine it’s equally likely to come up heads or tails the next time. (More on that here.)

For a Bernoulli random variable, how large would p have to be in order for the probability of 47 successes out of 47 trials to have probability at least 1/2?

If p47 = 1/2, then p = (1/2)1/47 = 0.985.

If the value of p is different in each prefecture, and the product of all the p‘s is 1/2, then all the p must be at least 1/2, and nearly all much larger. If one of the p‘s were equal to 1/2, the rest would have to be 1.

Applying this understanding to the headline at the top of the post, we could say that the fact that every prefecture has gone down in population is strong evidence that population will continue to drop, as long as current conditions continue.

Related posts

A note on Zipf’s law

Very often when a number is large, and we don’t know or care exactly how large it is, we can model it as infinite. This may make no practical difference and can make calculations much easier. I give several examples of this in the article Infinite is easier than big.

When you run across a statement that something is infinite, you can often mentally replace “infinite” with “big, but so big that it doesn’t matter exactly how big it is.” Conversely, the term “finite” in applied mathematics means “limited in a way that matters.”

But when we say something does or does not matter, there’s some implicit context. It does or does not matter for some purpose. We can get into trouble when we’re unclear of that context or when we shift contexts without realizing it.

Zipf’s law

Zipf’s law came up a couple days ago in a post looking at rare Chinese characters. That post resolves a sort of paradox, that rare characters come up fairly often. Each of these rare characters is very unlikely to occur in a document, and yet collectively its likely that some of them will be necessary.

In some contexts we can consider the number of Chinese characters or English words to be infinite. If you want to learn English by first learning the meaning of every English word, you will never get around to using English. The number of words is infinite in the sense that you will never get to the end.

Zipf’s law says that the frequency of the nth word in a language is proportional to ns. This is of course an approximation, but a surprisingly accurate approximation for such a simple statement.

Infinite N?

The Zipf probability distribution depends on two parameters: the exponent s and the number of words N. Since the number of words in any spoken language is large, and its impossible to say exactly how large it is, can we assume N = ∞? This seems like exactly the what we were talking about at the top of this article.

Whether it is possible to model N as infinite depends on s. The value of s that models word frequency in many languages is close to 1. The plot below illustrates Zipf’s law for the text of Wikipedia in many different languages. In each case, the slope on the log-log plot is approximately 1, i.e. s ≈ 1.

When s = 1, we don’t have a probability distribution because the sum of 1/n from 1 to ∞ diverges. And so no, we cannot assume N = ∞.

Now you may object that all we have to do is set s to be slightly larger than 1. If s = 1.0000001, then the sum of ns converges. Problem solved. But not really.

When s = 1 the series diverges, but when s is slightly larger than 1 the sum is very large. Practically speaking this assigns too much probability to rare words.

If s = 2, for example, one could set N = ∞. The Zipf distribution with s = 2 may be useful for modeling some things, but not for modeling word frequency.

Zipf’s law applied to word frequency is an interesting example of a model that contains a large N, and although it doesn’t matter exactly how large N is, it matters that N is finite. In the earlier article, I used the estimate that Chinese has 50,000 characters. If I had estimated 60,000 characters the conclusion of the article would not have been too different. But assuming an infinite number of characters would result in substantially overestimating the frequency of rare words.

Related posts

Image above by Sergio Jimenez, CC BY-SA 4.0, via Wikimedia Commons

V-statistics

A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

\underbrace{S \times S \times \cdots \times S}_{n \text{ times}}

is a V-statistic.

As in the previous post, let h(x, y) = (xy)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs))
    print(var(xs))

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print

   np.var(xs)

rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

Related posts