MCMC and the coupon collector problem

Bob Carpenter wrote today about how Markov chains cannot thoroughly cover high-dimensional spaces, and that they do not need to. That’s kinda the point of Monte Carlo integration. If you want systematic coverage, you can/must sample systematically, and that’s impractical in high dimensions.

Bob gives the example that if you want to get one integration point in each quadrant of a 20-dimensional space, you need a million samples. (220 to be precise.) But you may be able to get sufficiently accurate results with a lot less than a million samples.

If you wanted to be certain to have one sample in each quadrant, you could sample at (±1, ±1, ±1, …, ±1). But if for some weird reason you wanted to sample randomly and hit each quadrant, you have a coupon collector problem. The expected number of samples to hit each of N cells with uniform [1] random sampling with replacement is

N( log(N) + γ )

where γ is Euler’s constant. So if N = 220, the expected number of draws would be over 15 million.

Related posts

[1] We’re assuming each cell is sampled independently with equal probability each time. Markov chain Monte Carlo is more complicated than that because the draws are not independent.

The Borwein integrals

The Borwein integrals introduced in [1] are a famous example of how proof-by-example can go wrong.

Define sinc(x) as sin(x)/x. Then the following equations hold.

 \begin{align*} \int_0^\infty \text{sinc}(x) \,dx &= \frac{\pi}{2} \\ \int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \,dx &= \frac{\pi}{2} \\ \int_0^\infty \text{sinc}(x)\, \text{sinc}\left(\frac{x}{3}\right) \,\text{sinc}\left(\frac{x}{5}\right) \,dx &= \frac{\pi}{2} \\ \vdots &\phantom{=} \\ \int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \cdots \text{sinc}\left(\frac{x}{13}\right) \,dx &= \frac{\pi}{2} \\ \end{align*}


\int_0^\infty \text{sinc}(x) \, \text{sinc}\left(\frac{x}{3}\right) \cdots \text{sinc}\left(\frac{x}{15}\right) \,dx = \frac{\pi}{2} - \delta

where δ ≈ 2.3 × 10−11.

This is where many presentations end, concluding with the moral that a pattern can hold for a while and then stop. But I’d like to go just a little further.


B(n) = \int_0^\infty \prod_{k=0}^{n} \text{sinc}\left(\frac{x}{2k+1}\right) \, dx.

Then B(n) = π/2 for n = 1, 2, 3, …, 6 but not for n = 7, though it almost holds for n = 7. What happens for larger values of n?

The Borwein brothers proved that B(n) is a monotone function of n, and the limit as n → ∞ exists. In fact the limit is approximately π/2 − 0.0000352.

So while it would be wrong to conclude that B(n) = π/2 based on calculations for n ≤ 6, this conjecture would be approximately correct, never off by more than 0.0000352.

[1] David Borwein and Jonathan Borwein. Some Remarkable Properties of Sinc and Related Integrals. The Ramanujan Journal, 3, 73–89, 2001.

Integrals involving secants and tangents

As a student, I often made the mistake of thinking that if I knew a more powerful theorem, I didn’t need to learn a less powerful theorem. The reason this is a mistake is that the more powerful theorem may be better by one obvious criterion but not be better by other less-obvious criteria.

The most powerful solution technique is not always the best, not because there’s anything immoral about using an over-powered technique, but because the most powerful technique in theory might not be that useful in practice.

The most powerful technique for integrating rational functions of trig functions is Weierstrass’ t = tan(x/2) trick which I wrote about a few days ago. In principle, Weierstrass’ trick reduces integrating any rational combination of trig functions to integrating a rational function of x. And in principle, partial factions reduce any such integral to closed form.

But that which is doable in principle may not be doable in practice. If it is doable, this might not be the easiest approach.

If P is a polynomial in two variables, then [1] proves that

\int P(\sec x, \tan x) \, \sec x\, dx = F(u) - G(v) + c \log u + C


\begin{align*} u &= \sec x + \tan x \\ v &= \sec x - \tan x \\ \end{align*}

If your integral has the form above, this approach will probably be far simpler than using Weierstrass’ substitution.

The key to proving the theorem above is to use the facts

\begin{align*} \sec x &= \frac{u+v}{2} \\ \tan x &= \frac{u-v}{2} \\ uv &= 1 \end{align*}

You can find a detailed explanation in [1], but essentially the equations above are enough to not only prove the theorem but to compute F and G.

The paper gives an example, integrating

\int 16 \sec^5 x\, dx

in less than half a printed page. Computing the integral with Weierstrass’ trick would require integrating

\int 32 \frac{(1 + t^2)^4}{(1 - t^2)^5} \, dt

which must be possible, though it seems daunting.

Incidentally, the paper also shows how to compute integrals of the form

\int P(\sec x, \tan x) \, dx

removing the requirement of a factor of sec x outside the polynomial.

Related posts

[1] Jonathan P. McCammond. Integrating Polynomials in Secant and Tangent. The American Mathematical Monthly, Nov., 1999, Vol. 106, No. 9, pp. 856–858

The world’s sneakiest substitution

Something that seems like an isolated trick may turn out to be much more important. This is the case with a change of variables discovered by Karl Weierstrass.

Every calculus student learns a handful of standard techniques: u-substitutions, partial fractions, integration by parts, and trig substitutions. Then there is one more technique that is like something off a secret menu, something know only to the cognoscenti: Weierstrass’ t = tan(x/2) trick [1]. Michael Spivak called this “the world’s sneakiest substitution.”

The world's sneakiest substitution is undoubtedly t = tan(x/2), x = 2 arctan t, dx = 2 dt / (1 + t^2)

This was on page 325 of the first edition of Spivak’s Calculus, page 360 of the second edition.

This innocent but apparently unmotivated change of variables has the following magical consequences:

  • sin(x) = 2t/(1 + t2)
  • cos(x) = (1 – t2) / (1 + t2)
  • dx = 2 dt/(1 + t2).

This means that the trick can convert an integral containing any rational combination of trig functions into one involving only rational functions, which then can (in principle) be integrated in closed form using partial fractions. It’s the key that unlocks all trig integrals.

However, this is not as practical as it sounds. You may get a high-degree rational function, and while in theory the rational function can be integrated using partial fractions, the decomposition may be tedious if not impossible. Still, it’s interesting that a single trick reduces one large class of integration problems to another.

Now let’s leave integration behind. The equations above say that as t varies, the functions 2t/(1 + t2) and (1 – t2) / (1 + t2) take on all the values that sin(x) and cos(x) do. This means, for example, that you can draw a circle using graphics hardware that does not support sines and cosines. In theory the range of t would have to be infinite, but in practice the range would only have to be large enough to fill in the right pixels.

If we can parameterize a circle with rational functions, what else can we parameterize this way? It turns out, for example, that elliptic curves do not have rational parameterizations. This question is its own niche of algebraic geometry. I don’t know the history here, but it’s plausible some of this math was motivated by wondering what else could be done along the lines of Weierstrass’ trick.

[1] When I’ve mentioned this trick before, some have told me this was a routine part of their course. That was not my experience, either as a student or as an instructor. As a freshman I learned the trick from a friend who was the best of our cohort at integration, someone who would have done well at an integration bee if we’d had one. I felt I’d been let in on a secret.

Numerical integral with a singularity

Richard Hamming [1] gives this nice example of an integral with a mild singularity:

\int_0^1 \log(\sin(x))\, dx

The integrand approaches −∞ as x approaches 0 and yet the integral is finite. If we try into numerically evaluate this integral, we will either get inaccurate results or we will have to go to a lot of work.

This post will show that with a clever reformulation of the problem we use simple methods to get accurate results, or use sophisticated methods with fewer function evaluations.

As I wrote about years ago, a common technique for dealing with singularities is to add and subtract a function with the same asymptotic behavior that can be integrated by hand. Hamming does a slight variation on this, multiplying and dividing by x inside the logarithm.

\begin{align*} \int_0^1 \log(\sin(x))\, dx &= \int_0^1 \log(x)\, dx + \int_0^1 \log\left(\frac{\sin(x)}{x}\right) \, dx \\ &=  -1 + \int_0^1 \log\left(\frac{\sin(x)}{x}\right) \, dx \end{align*}

Here we integrated the singular part, log(x), and we are left with numerically integrating a well-behaved function, one that is smooth and bounded on the interval of integration. Because sin(x) ≈ x for small x, we can define sin(x)/x to be 1 at 0 and have a smooth function.

We’ll use NumPy’s sinc function to handle sin(x)/x properly near 0. There are two conventions for defining the sinc function, either as sin(x)/x or as sin(πx)/πx. NumPy uses the latter convention, so we define our own sinc function as follows.

import numpy as np
def mysinc(x): return np.sinc(x/np.pi)

Trapezoid rule

Let’s use the simplest numerical method, the trapezoid rule.

We run into a problem immediately: if we chop up the interval [0, 1] in a way that puts an integration point at 0, our resulting integral will be infinite. We could replace 0 with some ε > 0, but if we do, we should try a few different values of ε to see whether our choice of ε greatly impacts our integral.

for ε in [1e-6, 1e-7, 1e-8]:
    xs = np.linspace(ε, 1, 100)
    integral = sum( np.log(np.sin(x)) for x in xs ) / 100

This gives us


suggesting that the integral does indeed depend on ε. We’ll see soon that our integral evaluates to around −1.05. So our results are not accurate, and the smaller ε is, the worse our answer is. [2]

Now let’s evaluate the integral as Hamming suggests. We’ll use a varying number of integration points so the difference between our integral estimates will give us some idea whether we’re converging.

for N in [25, 50, 100]:
    xs = np.linspace(0, 1, N)
    integral = sum( np.log(mysinc(xs)) )/N
    integral -= 1 # subtract the integral of log(x)

This gives us


The consistency between the result suggests the integral is around −1.057.

Adaptive integration

We said at the top of this article that Hamming’s reformulation would either let us get accurate results from simple methods or let us get by with fewer function evaluations using a sophisticated method. Now we’ll demonstrate the latter, using the adaptive integration algorithm quad from SciPy.

from scipy.integrate import quad

# Integrate log(sin(x)) from 0 to 1
integral, error = quad(lambda x: np.log(np.sin(x)), 0, 1)
print(integral, error) 

# Integrate log(x sin(x)/x) = log(x) + log(mysinc(x)) from 0 to 1
integral, error = quad(lambda x: np.log(mysinc(x)), 0, 1)
integral -= 1
print(integral, error)

This prints

-1.0567202059915843 1.7763568394002505e-15
-1.056720205991585 6.297207865333937e-16

suggesting that both approaches are working. Both estimate their error to be near machine precision. But as we’ll see, the direct approach uses about 10 times as many function evaluations.

Let’s ask quad to return more details by adding full_output = 1 as an argument.

integral, error, info = quad(lambda x: np.log(np.sin(x)), 0, 1, full_output=1)
print(integral, error, info["neval"])

integral, error, info = quad(lambda x: np.log(mysinc(x)), 0, 1, full_output=1)
integral -= 1
print(integral, error, info["neval"])

This shows that the first integration used 231 function evaluations, but the second used only 21.

The difference in efficiency doesn’t matter when you’re evaluating an integral one time, but if you were repeatedly evaluating similar integrals inside a loop, subtracting off the singularity could make your problem run 10 times faster. Simulations involving Bayesian statistics can have such integrations in the inner loop, and so making an integration an order of magnitude faster could make the overall simulation an order of magnitude faster, reducing CPU-days to CPU-hours.

Related posts

[1] Richard Hamming, Numerical Methods for Scientists and Engineers. Second edition. Dover.

[2] We can get better results if we let ε and 1/N go to zero at the same rate. The following code produces mediocre results, but better results than above.

for N in [10, 100, 1000]:
    xs = np.linspace(1/N, 1, N)
    integral = sum( np.log(np.sin(x)) for x in xs ) / N

This prints


which at least seems to be getting closer to the correct result, but has bad accuracy for so many function evaluations.

Regular solids and Monte Carlo integration

Monte Carlo integration is not as simple in practice as it is often introduced. A homework problem might as you to integrate a function of two variables by selecting random points from a cube and counting how many of the points fall below the graph of the function. This would indeed give you an estimate of the volume bounded by the surface and hence the value of the integral.

But Monte Carlo integration is most often used in high dimensions, and the region of interest takes up a tiny proportion of a bounding box. In practice you’d rarely sample uniformly from a high-dimensional box. This post will look at sampling points on a (possibly high-dimensional) sphere.

The rate of convergence of Monte Carlo integration depends on the variance of the samples, and so people look for ways to reduce variance. Antipodal sampling is one such approach. The idea is that a function on a sphere is likely to take on larger values on one side of the sphere and smaller values on the other. So for every point x where the function is sampled, it is also sampled at the diametrically opposite point −x on the assumption/hope that the values of the function at the two points are negatively correlated.

Antipodal sampling is a first step in the direction of a hybrid of regular and random sampling, sampling by random choices of regularly spaced points, such as antipodal points. When this works well, you get a sort of synergy, an integration method that converges faster than either purely systematic or purely random sampling.

If a little is good, then more is better, right? Not necessarily, but maybe, so it’s worth exploring. If I remember correctly, Alan Genz explored this. Instead of just taking antipodal points, you could sample at the points of a regular solid, like a tetrahedron. Randomly select and initial point, create a tetrahedron on the sphere with this as one of the vertices, and sample your integrand at each of the vertices. Or you could think of having a tetrahedron fixed in space and randomly rotating the sphere so that the sphere remains in contact with the vertices of the tetrahedron.

If you’re going to sample at the vertices of a regular solid, you’d like to know what regular solids are possible. In three dimensions, there are five: tetrahedron, hexahedron (cube), octahedron, dodecahedon, and icosahedron. Only the first three of these generalize to dimensions 5 and higher, so you only have three choices in high dimension if you want to sample at the vertices of a regular solid.

Here’s more about the cross polytope, the generalization of the octahedron.

If you want more regularly-spaced points on a sphere than regular solids will permit, you could compromise and use points whose spacing is approximately regular, such as the Fibonacci lattice. You could randomly rotate your Fibonacci lattice to create a randomized quasi-Monte Carlo (RQMC) method.

You have a knob you can turn determining the amount of regularity and the amount of randomness. At one extreme is purely random sampling. As you turn the knob you go from antipodes to tetrahedra and up to cross polytopes. Then there’s a warning line, but you can keep going with things like the Fibonacci lattice, introducing some distortion, sorta like turning the volume of a speaker up past the point of distortion.

In my experience, I’ve gotten the best results near the random sampling end of the spectrum. Antipodal sampling sped up convergence, but other methods not so much. But your mileage may vary; the results depend on the kind of function you’re integrating.

Quadrature rules and an impossibility theorem

Many numerical integration formulas over a finite interval have the form

 \int_a^b f(x)\, dx = \sum_{i=1}^n w_i\, f(x_i) + C f^{(k)}(\xi)

That is, the integral on the left can be approximated by evaluating the integrand f at particular nodes and taking the weighted sum, and the error is some multiple of a derivative of f evaluated at a point in the interval [a, b].

This post will give several examples, showing how they all fit into to framework above, then discuss the impossibility of extending this framework to infinite integrals.

Simpson’s 3/8 rule

Simpson’s rule says

\int_{x_0}^{x_3} f(x) \, dx = \frac{3h}{8} \Big(f(x_0) + 3f(x_1) + 3f(x_2) + f(x_3) \Big)- \frac{3f^{(4)}(\xi) h^5}{80}

In this case the x‘s are evenly spaced a distance of h apart. The weights are 3h/8, 9h/8, 9h/8, and 3h/8.

We have n = 4, k = 4, and C = −3h5/80.

We don’t know a priori what the value of ξ will be, only that there will be some value of ξ between x0 and x3 that makes the equation hold. If we have an explicit bound on the 4th derivative of f then we have an explicit bound on the error in approximating the integral by the weighted sum.

Bode’s rules

A sequence of quadrature rules go by the name “Bode’s rule.” Here is one example, also known as Weddle’s rule.

\begin{align*} \int_{x_0}^{x_6} f(x)\, dx = \frac{h}{140}&\Big(41 f(x_0) + 216f(x_1) + 27 f(x_2) +272 f(x_3) \\ &+ 27 f(x_4) + 216 f(x_5) + 41 f(x_6) \Big) \\ &-\frac{9 f^{(8)}(\xi) h^9}{1400} \end{align}

As with Simpson’s 3/8 rule, you could map the formula for Bode’s rule(s) to the template at the top of the post.

Gauss quadrature

Gauss’ formula says

\int_{-1}^1 f(x)\, dx = \sum_{i=1}^n w_i\, f(x_i) + C f^{(2n)}(\xi)

for some ξ in [−1, 1]. Here the limits of integration are fixed, though you could use a change of variables to integrals over other finite integrals into this form.

Unlike Bode’s rule and Simpson’s rule, the x‘s are not evenly spaced but are the zeros of Pn, the Legendre polynomial of degree n. The weights are related to the x‘s and the derivative Pn evaluated at the x‘s. The constant C is a complicated function of n but is independent of f.

Note that the error term involves the (2n)th derivative of f. This explains why Gaussian integration can be more accurate than other methods using the same number of function evaluations. The non-uniform spacing of the integration nodes enables higher-order error terms.

Non-existence theorem

Although many integration rules over a finite interval have the form

 \int_a^b f(x)\, dx = \sum_{i=1}^n w_i\, f(x_i) + C f^{(k)}(\xi)

Davis and Rabinowitz [1] proved that there cannot be an integration rule of the form

\int_0^\infty f(x)\, dx = \sum_{i=1}^n w_i\, f(x_i) + C f^{(k)}(\xi)

The proof, given in [1], takes only about one page. The entire article is a little more than a page, and about half the article is preamble to the proof.

Related posts

[1] P. J. Davis and P. Rabinowitz. On the Nonexistence of Simplex Integration Rules for Infinite Integrals. Mathematics of Computation, Vol. 26, No. 119 (July, 1972), pp. 687–688

Pushing numerical integration software to its limits

The previous post discussed the functions

f_n(x) = \sum_{k=1}^n \frac{|\sin( k\pi x )|}{k}

as test cases for plotting. This post will look at using the same functions as test cases for integration. As you can see from the plot of f30(x) below, the function is continuous, but the derivative of the function has a lot of discontinuities.


The integrals of Steinerberger’s functions can be computed easily in closed form. Each term in the sum integrates to 2/πk and so

\int_0^1 f_n(x)\, dx = \frac{2}{\pi} \sum_{k=1}^n \frac{1}{k} = 2 H_n / \pi

where Hn is the nth harmonic number.

Symbolic integration

When I tried to get Mathematica to compute the integral above for general n it got stuck.

When I tried the specific case of 10 with

    Integrate[f[x, 10], {x, 0, 1}]

it churned for a while, then returned a complicated expression involving nested radicals:

    (1/(79380 \[Pi]))(465003 + 400 Sqrt[2 (48425 - 13051 Sqrt[5])] -
    35600 Sqrt[2 (5 - Sqrt[5])] + 42800 Sqrt[2 (5 + Sqrt[5])] -
    400 Sqrt[2 (48425 + 13051 Sqrt[5])])

Mathematica could not simplify the result to

    HarmonicNumber[10] 2/Pi]

but it could confirm that the two expressions are numerically equal within the limitations of floating point precision.

Assuming the two expressions for the integral are equal, it would be interesting to see why this is so. I don’t see off hand where the radicals above come from.

Numerical integration

When I asked Mathematica to compute the integral above numerically with

    NIntegrate[f[x, 10], {x, 0, 1}]

it produced an answer which correct to four decimal places and a warning

NIntegrate failed to converge to prescribed accuracy after 9 recursive bisections in x near {x} = {0.667953}. NIntegrate obtained 1.8646445043323603` and 2.908590085997311`*^-6 for the integral and error estimates.

Even though NIntegrate didn’t achieve full precision, it warned that this was the case. It produced a decent answer and a fairly good estimate of its error, about half of the actual error.

I tried numerically integrating the same function with Python.

    from scipy.integrate import quad
    quad(lambda x: f(x, 10), 0, 1)

Python’s quad function did about the same as Mathematica’s NIntegrate. It produced the following warning:

IntegrationWarning: The maximum number of subdivisions (50) has been achieved. If increasing the limit yields no improvement it is advised to analyze the integrand in order to determine the difficulties. If the position of a local difficulty can be determined (singularity, discontinuity) one will probably gain from splitting up the interval and calling the integrator on the subranges. Perhaps a special-purpose integrator should be used. (1.8646395747766926, 0.0024098799348653954)

This error message is more helpful, and the error estimate 0.0024 is an upper bound on the error.

Increasing the number of subdivisions by a factor of 10 with

    quad(lambda x: f(x, 10), 0, 1, limit=500)

eliminated the warning and made the integration much more accurate.

Related posts

Expected distance between points in a cube

random samples from a cube

Suppose you randomly sample points from a unit cube. How far apart are pairs of points on average?

My intuition would say that the expected distance either has a simple closed form or no closed form at all. To my surprise, the result is somewhere in between: a complicated closed form.

Computing the expected value by simulation is straight forward. I’ll make the code a little less straight forward but more interesting and more compact by leveraging a couple features of NumPy.

    import numpy as np

    dimension = 3
    reps = 10000

    cube1 = np.random.random((reps, dimension))
    cube2 = np.random.random((reps, dimension))
    distances = np.linalg.norm(cube1 - cube2, axis=1)

This gives a value of 0.6629. This value is probably correct to two decimal places since the Central Limit Theorem says our error should be on the order of one over the square root of the number of reps.

In fact, the expected value given here is

\frac{4}{105} + \frac{17}{105}\sqrt{2} - \frac{2}{35}\sqrt{3} + \frac{1}{5}\log(1 + \sqrt{2}) + \frac{2}{5}\log(2 + \sqrt{3}) - \frac{1}{15}\pi

which evaluates numerically to 0.661707….

The expression for the expected value is simpler in lower dimensions; it’s just 1/3 in one dimension. In higher dimensions there is no closed form in terms of elementary functions and integers.

Incidentally, the image above was made using the following Mathematica code.

    pts = RandomPoint[Cube[], 10^3];
    Graphics3D[{PointSize[Small], Point[pts], Opacity[0.3]}, Boxed -> True]

Golden integration

Let φ be the golden ratio. The fractional parts of nφ bounce around in the unit interval in a sort of random way. Technically, the sequence is quasi-random.

Quasi-random sequences are like random sequences but better in the sense that they explore a space more efficiently than random sequences. For this reason, Monte Carlo integration (“integration by darts“) can often be made more efficient by replacing random sequences with quasi-random sequence. This post will illustrate this efficiency advantage in one dimension using the fractional parts of nφ.

Here are functions that will generate our integration points.

    from numpy import random, zeros

    def golden_source(n):
        phi = (1 + 5**0.5)/2
        return (phi*n)%1

    def random_source(N):
        return random.random()

We will pass both of these generators as arguments to the following function which saves a running average of function evaluates at the generated points.

    def integrator(f, source, N):
        runs = zeros(N)
        runs[0] = f(source(0))
        for n in range(1, N):
            runs[n] = runs[n-1] + (f(source(n)) - runs[n-1])/n
        return runs

We’ll use as our example integrand f(x) = x² (1 − x)³. The integral of this function over the unit interval is 1/60.

    def f(x):
        return x**2 * (1-x)**3
    exact = 1/60

Now we call our integrator.

    N = 1000
    golden_run = integrator(f, golden_source, N)
    random_run = integrator(f, random_source, N)

Now we plot the difference between each run and the exact value of the integral. Both methods start out with wild fluctuations. We leave out the first 10 elements in order to make the error easier to see.

    import matplotlib.pyplot as plt

    k = 10
    x = range(N)
    plt.plot(x[k:], golden_run[k:] - exact)
    plt.plot(x[k:], random_run[k:] - exact)

This produces the following plot.

The integration error using φn − ⌊φn⌋ is so close to zero that it’s hard to observe its progress. So we plot it again, this time taking the absolute value of the integration error and plotting on a log scale.

    plt.plot(x[k:], abs(golden_run[k:] - exact))
    plt.plot(x[k:], abs(random_run[k:] - exact))

This produces a more informative plot.

The integration error for the golden sequence is at least an order of magnitude smaller, and often a few orders of magnitude smaller.

The function we’ve integrated has a couple features that make integration using quasi-random sequences (QMC, quasi-Monte Carlo) more efficient. First, it’s smooth. If the integrand is jagged, QMC has no advantage over MC. Second, our integrand could be extended smoothly to a periodic function, i.e. f(0) = f(1) and f′(0) = f′(1). This makes QMC integration even more efficient.

Related posts