Mentally approximating factorials

Posted on 22 June 2023 by John

Suppose you’d like to have a very rough idea how large n! is for, say, n less than 100.

If you’re familiar with such things, your Pavlovian response to “factorial” and “approximation” is Stirling’s approximation. Although Stirling’s approximation is extremely useful—I probably use it every few weeks—it is not conducive to mental calculation.

The cut point

It’s useful to know that 24! ≈ 10²⁴ and 25! ≈ 10²⁵.

Said another way, the curves for n! and 10ⁿ cross approximately midway between 24 and 25. To the left of the crossing, n! < 10ⁿ and to the right of the crossing n! > 10ⁿ.

So, for example, if you hear someone refer to permutations of the English alphabet, you know the number of permutations is going to be north of 10²⁶.

Left of the cut

Suppose you want to estimate n! for n < 24. You know n! < 10ⁿ, but maybe you’d like to be a little more precise.

I’ll suppose you know n! for n up to 6. The approximation

log₁₀ n! ≈ n − 2

has an absolute error of less than 1.5 for n = 7, 8, 9, …, 23.

Right of cut

For n = 26, 27, 28, …, 100 the approximation

log₁₀ n! ≈ 7n/4 − 20

has an absolute error less than 3.

Note that calculating 7n/4 as n + n/2 + n/4 is probably easier than calculating (7n)/4.

Leading digits of primes

Posted on 22 June 2023 by John

How are the first digits of primes distributed? Do some digits appear as first digits of primes more often that others? How should we even frame the problem?

There are an infinite number of primes that begin with each digit, so the cardinalities of the sets of primes beginning with each digit are the same. But cardinality isn’t the right tool for the job.

The customary way to (try to) approach problems like this is to pose a question for integers up to N and then let N got to infinity. This is called natural density. But this limit doesn’t always converge, and I don’t believe it converges in our case.

Logarithmic density

An alternative is to compute logarithmic density. This measure exists in cases where natural density does not. And when natural density does exist, logarithmic density may give the same result [1].

The logarithmic density of a sequence A is defined by

$\lim_{N\to\infty}\frac{1}{\log N}\sum_{x \in A, \, x \leq N} \frac{1}{x}$

We want to know about the relative logarithmic density of primes that start with 1, for example. The relative density of a sequence A relative to a sequence B is defined by

$\lim_{N\to\infty} \frac{\sum_{x \in A, \, x \leq N} \frac{1}{x}} {\sum_{x \in B, \, x \leq N} \frac{1}{x}}$

We’d like to know about the logarithmic density of primes that start with 1, 2, 3, …, 9 relative to all primes. The exact value is known [2], and we will get to that shortly, but first let’s try to anticipate the result by empirical calculation.

We want to look at the first million primes, and we could do that by calling the function prime with the integers 1 through 1,000,000. But it is much more efficient to as SymPy to find the millionth prime and then find all primes up to that prime.

    from sympy import prime, primerange
    import numpy as np

    first_digit = lambda n: int(str(n)[0])

    density = np.zeros(10)
    N = 1_000_000
    for x in primerange(prime(N+1)):
        density[0] += 1/x
        density[first_digit(x)] += 1/x
    density /= density[0]

Here’s a histogram of the results.

Clearly the distribution is uneven, and in general more primes begin with small digits than large digits. But the distribution is somewhat irregular. That’s how things work with primes: they act something like random numbers, except when they don’t. This sort of amphibious nature, between regularity and irregularity, makes primes interesting.

The limits defining the logarithmic relative density do exist, though the fact that even a million terms doesn’t produce a smooth histogram suggests the convergence is not too rapid.

In the limit we get that the density of primes with first digit d is

$\log_{10}\left(1 + \frac{1}{d}\right)$

The densities we’d see in the limit are plotted as follows.

This is exactly the density given by Benford’s law. However, this is not exactly Benford’s law because we are using a different density, logarithmic relative density rather than natural density [3].

[1] The Davenport–Erdős theorem says that for some kinds of sequences, if the natural density exists, logarithmic density also exists and will give the same result.

[2] R. E. Whitney. Initial Digits for the Sequence of Primes. The American Mathematical Monthly, Vol. 79, No. 2 (Feb., 1972), pp. 150–152

[3] R. L. Duncan showed that the leading digits of integers satisfy the logarithmic density version of Benford’s law.

Matching delimiters and chiastic patterns

Posted on 21 June 2023 by John

When I first started programming I’d occasionally get an error because a delimiter wasn’t matched. Maybe I had a { without a matching }, for example. Or if I were writing Pascal, which I did back in the day, I might have had a BEGIN statement without a corresponding END statement.

At some point I saw someone type an opening delimiter, then the closing delimiter, then back up and insert the content in between. That seemed strange for about five seconds, then I realized it was brilliant and have adopted that practice ever since.

I’ve rarely seen a mismatched delimiter since then, but I ran into one this morning. In order to explain what happened, I’ll need to talk about LaTeX syntax.

LaTeX delimiters

LaTeX has two modes for math symbols: inline and display. Inline symbols appear in the middle of a sentence, while displayed symbols are set apart on their own line and centered. This is analogous to inline elements and block elements in HTML.

When I learned LaTeX, everyone used one dollar sign for inline mode and two dollar signs for display mode. So, for example, you would write $x$ for an x in the context of prose, and $$x = 5$$ to display an equation stating that x equals 5.

Now the preferred syntax is to use \[ to begin display mode and \] to end it. This has several advantages. In particular it makes it easier to debug mismatched delimiter errors since you can count the number of open and closed delimiters; you couldn’t tell without context whether $$ begins or ends a displayed equation.

It’s still common to use dollar signs for inline math mode, but there is an alternate notation that uses $ to begin and $ to end. This notation is not commonly used as far as I know. Escaped parentheses have all the theoretical advantages of escaped brackets, but in practice the scope of inline math content is very small and so it’s not a problem that the opening and closing delimiters are the same.

Org-mode

The only time I use $ and $ is when I’m not directly writing LaTeX but writing an org-mode file that generates LaTeX. I chose to write a client’s repoort in org-mode rather than directly in LaTeX because the document is very long, and so org-mode’s outlining is convenient. I can hide or expand parts of the tree by pressing a tab, for example. Also, the document has a few tables, and tables are easier to write in org-mode’s markdown than in LaTeX.

Unfortunately org-mode sometimes understands expressions like $x$ and sometimes it doesn’t. It would be safer to always write $x$ but the former works often enough that I tend to use it out of habit. The root of my problem this morning was I had used dollar sign delimiters in a way that confused org-mode.

All else being equal, it’s best when opening and closing delimiters are different. I might advise someone learning LaTeX today to use the $x$ notation, even though I continue to use $x$ .

Chiastic patterns

When I was trying to debug my unmatched delimiter problem, I searched for all \begin and \end statements in my (org-mode generated) LaTeX file by searching on the regular expression \\begin\|\\end. Here’s an example of what I got back (with indentation added).

    \begin{quote}
    \end{quote}
    \begin{table}[htbp]
        \begin{tabular}{ll}
        \end{tabular}
    \end{table}
    \begin{algorithm}
        \begin{algorithmic}[1]
            \begin{enumerate}
            \end{enumerate}
        \end{algorithmic}
    \end{algorithm}

Sometimes opening and closing delimiters are consecutive, such as the beginning and end of the quote. But sometimes you’ll find a chiasmus pattern, named for the Greek letter χ. Notice the table above has an ABBA pattern and the algorithm has an ABCCBA pattern.

This kind of pattern is extremely common in the Bible, though it’s easy for modern readers to miss it. One reason this pattern matters is that the context of a sentence is not only the sentences above and below, but also the parallel sentence in the chiastic pattern. Furthermore, the most important point in a chiasmus is often at the middle.

While chiastic patterns were common in ancient eastern prose, they are far less common in modern western prose. However, chiastic patters are common in programming, so common that no one ever comments on them.

You can see the chiastic pattern in the indentation of source code. The context of a delimiter is its matching delimiter, which may many lines away. And the most important line of code for optimization is the code in the innermost loop, at the focus of a chiasmus.

I became aware of chiastic patterns in the Bible by reading Paul Through Mediterranean Eyes. This book has many outline illustrations that look a lot like source code such as nested for-loops.

Equation of an ellipse or hyperbola through five points

Posted on 19 June 2023 by John

Yesterday I wrote about the equation of a circle through three points. This post will be similar, looking at the equation of an ellipse or hyperbola through five points. As with the earlier post, we can write down an elegant equation right away. Making the equation practical will take more work.

The equation of a general conic section through five points

$\begin{vmatrix} x^2 & y^2 & xy & x & y & 1 \\ x_1^2 & y_1^2 & x_1 y_1 & x_1 & y_1 & 1 \\ x_2^2 & y_2^2 & x_2 y_2 & x_2 & y_2 & 1 \\ x_3^2 & y_3^2 & x_3 y_3 & x_3 & y_3 & 1 \\ x_4^2 & y_4^2 & x_4 y_4 & x_4 & y_4 & 1 \\ x_5^2 & y_5^2 & x_5 y_5 & x_5 & y_5 & 1 \\ \end{vmatrix} = 0$

Direct but inefficient solution

The most direct (though not the most efficient) way to turn this into an equation

is to expand the determinant by minors of the first row.

$A = \begin{vmatrix} y_1^2 & x_1 y_1 & x_1 & y_1 & 1 \\ y_2^2 & x_2 y_2 & x_2 & y_2 & 1 \\ y_3^2 & x_3 y_3 & x_3 & y_3 & 1 \\ y_4^2 & x_4 y_4 & x_4 & y_4 & 1 \\ y_5^2 & x_5 y_5 & x_5 & y_5 & 1 \\ \end{vmatrix} ,\, B = - \begin{vmatrix} x_1^2 & x_1 y_1 & x_1 & y_1 & 1 \\ x_2^2 & x_2 y_2 & x_2 & y_2 & 1 \\ x_3^2 & x_3 y_3 & x_3 & y_3 & 1 \\ x_4^2 & x_4 y_4 & x_4 & y_4 & 1 \\ x_5^2 & x_5 y_5 & x_5 & y_5 & 1 \\ \end{vmatrix}$

$C = \begin{vmatrix} x_1^2 & y_1^2 & x_1 & y_1 & 1 \\ x_2^2 & y_2^2 & x_2 & y_2 & 1 \\ x_3^2 & y_3^2 & x_3 & y_3 & 1 \\ x_4^2 & y_4^2 & x_4 & y_4 & 1 \\ x_5^2 & y_5^2 & x_5 & y_5 & 1 \\ \end{vmatrix} ,\, D = - \begin{vmatrix} x_1^2 & y_1^2 & x_1 y_1 & y_1 & 1 \\ x_2^2 & y_2^2 & x_2 y_2 & y_2 & 1 \\ x_3^2 & y_3^2 & x_3 y_3 & y_3 & 1 \\ x_4^2 & y_4^2 & x_4 y_4 & y_4 & 1 \\ x_5^2 & y_5^2 & x_5 y_5 & y_5 & 1 \\ \end{vmatrix}$

$E = \begin{vmatrix} x_1^2 & y_1^2 & x_1 y_1 & x_1 & 1 \\ x_2^2 & y_2^2 & x_2 y_2 & x_2 & 1 \\ x_3^2 & y_3^2 & x_3 y_3 & x_3 & 1 \\ x_4^2 & y_4^2 & x_4 y_4 & x_4 & 1 \\ x_5^2 & y_5^2 & x_5 y_5 & x_5 & 1 \\ \end{vmatrix} ,\, F = - \begin{vmatrix} x_1^2 & y_1^2 & x_1 y_1 & x_1 & y_1\\ x_2^2 & y_2^2 & x_2 y_2 & x_2 & y_2\\ x_3^2 & y_3^2 & x_3 y_3 & x_3 & y_3\\ x_4^2 & y_4^2 & x_4 y_4 & x_4 & y_4\\ x_5^2 & y_5^2 & x_5 y_5 & x_5 & y_5\\ \end{vmatrix}$

Python example

The following code will generate five points and set up the system of equations Ms = b whose solution s will give our coefficients.

    import numpy as np

    np.random.seed(20230619)

    M = np.zeros((5,5))
    b = np.zeros(5)
    for i in range(5):
        x = np.random.random()
        y = np.random.random()    
        M[i, 0] = y**2
        M[i, 1] = x*y
        M[i, 2] = x
        M[i, 3] = y
        M[i, 4] = 1
        b[i] = -x*x

Now we solve for the coefficients.

    s = np.linalg.solve(M, b)
    A, B, C, D, E, F = 1, s[0], s[1], s[2], s[3], s[4]

Next we verify that the points we generated indeed lie on the conic section whose equation we just found.

    for i in range(5):
        x = M[i, 2]
        y = M[i, 3]
        assert( abs(A*x*x + B*y*y + C*x*y + D*x + E*y + F) < 1e-10 )

Ellipse or hyperbola?

It turns out that the five points generated above lie on a hyperbola.

If we had set the random number generator seed to 20230620 we would have gotten an ellipse.

When we generate points at random, as we did above, we will almost certainly get an ellipse or a hyperbola. Since we’re generating pseudorandom numbers, there is some extremely small chance the generated points would lie on a line or on a parabola. In theory, this almost never happens, in the technical sense of “almost never.”

If 4AB − C² is positive we have a an ellipse. If it is negative we have a hyperbola. If it is zero, we have a parabola.

A parabola, a triangle, and a circle

Posted on 18 June 2023 by John

Let P be a parabola. Draw tangents to P at three different points. These three lines intersect to form a triangle.

Theorem: The circumcircle of this triangle passes through the focus of P. [1]

In the image above, the dashed lines are tangents and the black dot is the focus of the parabola.

(See this post for an explanation of what focus and directrix mean.)

By the way, creating the image above required finding the circle through the three intersection points. See the previous post for how to find such a circle.

[1] Ross Honsberger. Episodes in Nineteenth and Twentieth Century Euclidean Geometry. Mathematical Association of America. 1995. Page 47.

Equation of a circle through three points

Posted on 18 June 2023 by John

Three given points on a circle with unknown center

A few months ago I wrote about several equations that come up in geometric calculations that all have the form of a determinant equal to 0, where one column of the determinant is all 1s.

The equation of a circle through three points (x₁, y₁), (x₂, y₂), and (x₃, y₃) has this form:

$\begin{vmatrix} x^2 + y^2 & x & y & 1 \\ x_1^2 + y_1^2 & x_1 & y_1 & 1 \\ x_2^2 + y_2^2 & x_2 & y_2 & 1 \\ x_3^2 + y_3^2 & x_3 & y_3 & 1 \\ \end{vmatrix} = 0$

This is elegant equation, but it’s not in a form that’s ready to program. This post will start with the equation above and conclude with code.

Let M_ij be the determinant of the minor of the (i, j) element of the matrix above. That is, M_ij is the determinant of the 3 × 3 matrix obtained by deleting the ith row and jth column. Then we can expand the determinant above by minors of the last column to get

$M_{11}(x^2 + y^2) - M_{12}x + M_{13}y - M_{14} = 0$

which we can rewrite as

$x^2 + y^2 -\frac{M_{12}}{M_{11}} x + \frac{M_{13}}{M_{11}} y - \frac{M_{14}}{M_{11}} = 0.$

A circle with center (x₀, y₀) and radius r₀ has equation

By equating the coefficients x and y in the two previous equations we have

$\begin{align*} x_0 &= \phantom{-}\frac{1}{2} \frac{M_{12}}{M_{11}} \\ y_0 &= -\frac{1}{2} \frac{M_{13}}{M_{11}} \\ \end{align*}$

Next we expand the determinant minors in more explicit form.

$\begin{align*} M_{11} &= \begin{vmatrix} x_1 & y_1 & 1 \\ x_2 & y_2 & 1 \\ x_3 & y_3 & 1 \\ \end{vmatrix} \\ &= (x_2y_3 - x_3y_2) - (x_1y_3 - x_3y_1) + (x_1y_2 - x_2y_1) \\ &= (x1y_2 + x_2y_3 + x_3y_1) - (x_2y_1 + x_3y_2 + x_1y_3) \end{align*}$

Having gone to the effort of finding a nice expression for M₁₁ we can reuse it to find M₁₂ by the substitution

$x_i \to s_i \equiv x_i^2 + y_i^2$

as follows:

$\begin{align*} M_{12} &= \begin{vmatrix} x_1^2 + y_1^2 & y_1 & 1 \\ x_2^2 + y_2^2 & y_2 & 1 \\ x_3^2 + y_3^2 & y_3 & 1 \\ \end{vmatrix} = \begin{vmatrix} s_1^2 & y_1 & 1 \\ s_2^2 & y_2 & 1 \\ s_3^2 & y_3 & 1 \\ \end{vmatrix} \\ &= (s_1y_2 + s_2y_3 + s_3y_1) - (s_2y_1 + s_3y_2 + s_1y_3) \end{align*}$

And we can find M₁₃ from M₁₂ by reversing the role of the xs and ys.

$\begin{align*} M_{13} &= \begin{vmatrix} x_1^2 + y_1^2 & x_1 & 1 \\ x_2^2 + y_2^2 & x_2 & 1 \\ x_3^2 + y_3^2 & x_3 & 1 \\ \end{vmatrix} = \begin{vmatrix} s_1^2 & x_1 & 1 \\ s_2^2 & x_2 & 1 \\ s_3^2 & x_3 & 1 \\ \end{vmatrix} \\ &= (s_1x_2 + s_2x_3 + s_3x_1) - (s_2x_1 + s_3x_2 + s_1x_3) \end{align*}$

Now we have concrete expressions for M₁₁, M₁₂, and M₁₃, and these give us concrete expressions for x₀ and y₀.

To find the radius of the circle we simply find the distance from the center (x₀, y₀) to any of the points on the circle, say (x₁, y₁).

The following Python code incorporates the derivation above.

    def circle_thru_pts(x1, y1, x2, y2, x3, y3):
        s1 = x1**2 + y1**2
        s2 = x2**2 + y2**2
        s3 = x3**2 + y3**2
        M11 = x1*y2 + x2*y3 + x3*y1 - (x2*y1 + x3*y2 + x1*y3)
        M12 = s1*y2 + s2*y3 + s3*y1 - (s2*y1 + s3*y2 + s1*y3)
        M13 = s1*x2 + s2*x3 + s3*x1 - (s2*x1 + s3*x2 + s1*x3)
        x0 =  0.5*M12/M11
        y0 = -0.5*M13/M11
        r0 = ((x1 - x0)**2 + (y1 - y0)**2)**0.5
        return (x0, y0, r0)

The code doesn’t use any loops or arrays, so it ought to be easy to port to any programming language.

For some inputs the value of M₁₁ can be (nearly) zero, in which case the three points (nearly) lie on a line. In practice M₁₁ can be nonzero and yet zero for practical purposes and so that’s something to watch out for when using the code.

The determinant at the top of the post is zero when the points are colinear. There’s a way to make rigorous the notion that in this case the points line on a circle centered at infinity.

Update: Equation of an ellipse or hyperbola through five points

Set of orbits with the same average distance to sun

Posted on 17 June 2023 by John

Suppose a planet is in an elliptical orbit around the sun with semimajor axis a and semiminor axis b. Then the average distance of the planet to the sun over time equals

a(1 + e²/2)

where the eccentricity e satisfies

e² = 1 − b²/a².

You can find a proof of this statement in [1].

This post will look at the set of all orbits with a fixed average distance r to the sun. Without loss of generality we can choose our units so that r = 1.

Clearly one possibility is to set a = b = 1 so the orbit is a circle. The distance is constantly 1, so the average is 1.

We can also maintain a distance of 1 by reducing a but increasing the eccentricity e. The possible orbits of average distance 1 satisfy

a(1 + e²/2) = 1

with 0 < b ≤ a ≤ 1. A little algebra shows that

b = √(3a² – 2a),

and that 2/3 < a ≤ 1. As a approaches 2/3, b approaches 0.

Let’s put the center of our coordinate system at the sun and assume the other focus of the elliptical orbits is somewhere along the positive x-axis. When e is 0 we have a unit circle orbit. As e approaches 1, the orbits approach a horizontal line with the sun on one end.

[1] Sherman K. Stein. “Mean Distance” in Kepler’s Third Law. Mathematics Magazine, Vol. 50, No. 3 (May, 1977), pp. 160–162

Pushing numerical integration software to its limits

Posted on 12 June 2023 by John

The previous post discussed the functions

$f_n(x) = \sum_{k=1}^n \frac{|\sin( k\pi x )|}{k}$

as test cases for plotting. This post will look at using the same functions as test cases for integration. As you can see from the plot of f₃₀(x) below, the function is continuous, but the derivative of the function has a lot of discontinuities.

The integrals of Steinerberger’s functions can be computed easily in closed form. Each term in the sum integrates to 2/πk and so

$\int_0^1 f_n(x)\, dx = \frac{2}{\pi} \sum_{k=1}^n \frac{1}{k} = 2 H_n / \pi$

where H_n is the nth harmonic number.

Symbolic integration

When I tried to get Mathematica to compute the integral above for general n it got stuck.

When I tried the specific case of 10 with

    Integrate[f[x, 10], {x, 0, 1}]

it churned for a while, then returned a complicated expression involving nested radicals:

    (1/(79380 \[Pi]))(465003 + 400 Sqrt[2 (48425 - 13051 Sqrt[5])] -
    35600 Sqrt[2 (5 - Sqrt[5])] + 42800 Sqrt[2 (5 + Sqrt[5])] -
    400 Sqrt[2 (48425 + 13051 Sqrt[5])])

Mathematica could not simplify the result to

    HarmonicNumber[10] 2/Pi]

but it could confirm that the two expressions are numerically equal within the limitations of floating point precision.

Assuming the two expressions for the integral are equal, it would be interesting to see why this is so. I don’t see off hand where the radicals above come from.

Numerical integration

When I asked Mathematica to compute the integral above numerically with

    NIntegrate[f[x, 10], {x, 0, 1}]

it produced an answer which correct to four decimal places and a warning

NIntegrate failed to converge to prescribed accuracy after 9 recursive bisections in x near {x} = {0.667953}. NIntegrate obtained 1.8646445043323603` and 2.908590085997311`*^-6 for the integral and error estimates.

Even though NIntegrate didn’t achieve full precision, it warned that this was the case. It produced a decent answer and a fairly good estimate of its error, about half of the actual error.

I tried numerically integrating the same function with Python.

    from scipy.integrate import quad
    quad(lambda x: f(x, 10), 0, 1)

Python’s quad function did about the same as Mathematica’s NIntegrate. It produced the following warning:

IntegrationWarning: The maximum number of subdivisions (50) has been achieved. If increasing the limit yields no improvement it is advised to analyze the integrand in order to determine the difficulties. If the position of a local difficulty can be determined (singularity, discontinuity) one will probably gain from splitting up the interval and calling the integrator on the subranges. Perhaps a special-purpose integrator should be used. (1.8646395747766926, 0.0024098799348653954)

This error message is more helpful, and the error estimate 0.0024 is an upper bound on the error.

Increasing the number of subdivisions by a factor of 10 with

    quad(lambda x: f(x, 10), 0, 1, limit=500)

eliminated the warning and made the integration much more accurate.

Plotting a function with a lot of local minima

Posted on 12 June 2023 by John

Stefan Steinerberger defines “an amusing sequence of functions” in [1] by

$f_n(x) = \sum_{k=1}^n \frac{|\sin( k\pi x )|}{k}$

Here’s a plot of f₃₀(x):

As you can see, f_n(x) has a lot of local minima, and the number of local minima increases rapidly with n.

Here’s a naive attempt to produce the plot above using Python.

    from numpy import sin, pi, linspace
    import matplotlib.pyplot as plt

    def f(x, n):
        return sum( abs(sin(k*pi*x))/k  for k in range(1, n+1) )

    x = linspace(0, 1)
    plt.show()

This produces the following.

The code above doesn’t specify how finely to divide the interval [0, 1], and by default linspace will produce 50 evenly spaced points. In the example above we need a lot more points. But how many? This is not a simple question.

You could try more points, adding a third argument to linspace, say 1000. That will produce something that looks more like the first plot, but is the first plot right?

The first plot was produced by the Mathematica statements

    f[x_, n_] := Sum[Abs[Sin[Pi k x]/k], {k, 1, n}]
    Plot[f[x, 30], {x, 0, 1}]

Mathematica adaptively determines how many plotting points to use. It tries to detect when a function is changing rapidly in some region and allocate more points there. The function that is the subject of this post makes a good test case for such automated refinement.

Often Mathematica does a good enough job automatically, but sometimes it needs help. The Plot function takes additional parameters like PlotPoints and MaxRecursion that you can use to tell Mathematica to work harder.

It would be an interesting exercise to make a high quality plot of f_n for some moderately large n, making sure you’ve captured all the local minima.

[1] Stefan Steinerberger. An Amusing Sequence of Functions. Mathematics Magazine, Vol. 91, No. 4 (October 2018), pp. 262–266

Collecting a large number of coupons

Posted on 9 June 2023 by John

This post is an addendum to the recent post Reviewing a thousand things. We’re going to look again at the coupon collector problem, randomly sampling a set of N things with replacement until we’ve collected one of everything.

As noted before, for large N the expected number of draws before you’ve seen everything at least once is approximately

N( log(N) + γ )

where γ is Euler’s constant.

A stronger result using a Chernoff bound says that for any constant c, the probability that the number of draws exceeds

N( log(N) + c )

approaches

1 − exp( − exp(−c))

as N increases. See [1].

I said in the earlier post that if you were reviewing a thousand things by sampling flash cards with replacement, there’s a 96% chance that after drawing 10,000 cards you would have seen everything. We could arrive at this result using the theorem above.

If N = 1,000 and we want N( log(N) + c ) to equal 10,000, we set c = 3.092. Then we find

1 − exp( − exp(−c) ) = 0.0444

which is consistent with saying there’s a 96% chance of having seen everything, 95.56% if you want to be more precise.

[1] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.

Month: June 2023

Mentally approximating factorials

The cut point

Left of the cut

Right of cut

Related posts

Leading digits of primes

Logarithmic density

Related posts

Matching delimiters and chiastic patterns

LaTeX delimiters

Org-mode

Chiastic patterns

Equation of an ellipse or hyperbola through five points

Direct but inefficient solution

More efficient solution

Python example

Ellipse or hyperbola?

Related posts

A parabola, a triangle, and a circle

Equation of a circle through three points

Set of orbits with the same average distance to sun

Related posts

Pushing numerical integration software to its limits

Symbolic integration

Numerical integration

Related posts

Plotting a function with a lot of local minima

Collecting a large number of coupons