Introduction to MCMC

Markov Chain Monte Carlo (MCMC) is a technique for getting your work done when Monte Carlo won’t work.

The problem is finding the expected value of f(X) where X is some random variable. If you can draw independent samples xi from X, the solution is simple:

E[ f(X) ] \approx \frac{1}{n} \sum_{i=1}^n f(x_i)

When it’s possible to draw these independent samples, the sum above is well understood. It’s easy to estimate the error after n samples, or to turn it the other way around, estimate what size n you need so that the error is probably below a desired threshold.

MCMC is a way of making the approximation above work even though it’s not practical to draw independent random samples from X. In Bayesian statistics, X is the posterior distribution of your parameters and you want to find the expected value of some function of these parameters. The problem is that this posterior distribution is typically complicated, high-dimensional, and unique to your problem (unless you have a simple conjugate model). You don’t know how to draw independent samples from X, but there are standard ways to construct a Markov chain whose samples make the approximation above work. The samples are not independent but in the limit the set of samples has the same distribution as X.

MCMC is either simple or mysterious, depending on your perspective. It’s simple in that writing code that should work, eventually. Writing efficient code is another matter. And above all, knowing when your answer is good enough is tricky. If the samples were independent, the Central Limit Theorem would tell you how the sum should behave. But since the samples are dependent, all bets are off. Almost all we know is that the average on the right side converges to the expectation on the left side. Aside from toy problems there’s very little theory to tell you when your average is sufficiently close to what you want to compute.

Because there’s not much theory, there’s a lot of superstition and folklore. As with other areas, there’s truth in MCMC superstition and folklore, but also some error and nonsense.

In some ways MCMC is truly marvelous. It’s disappointing that there isn’t more solid theory around convergence, but it lets you take a stab at problems that would be utterly intractable otherwise. In theory you can’t know when it’s time to stop, and in practice you can fool yourself into thinking you’ve seen convergence when you haven’t, such as when you have a bimodal distribution and you’ve only been sampling from one mode. But it’s also common in practice to have some confidence a calculation has converged because the results are qualitatively correct: this value should be approximately this other value, this should be less than that, etc.

Bayesian statistics is older than Frequentist statistics, but it didn’t take off until statisticians discovered MCMC in the 1980s, after physicists discovered it in the 1950s. Before that time, people might dismiss Bayesian statistics as being interesting but impossible to compute. But thanks to Markov Chain Monte Carlo, and Moore’s law, many problems are numerically tractable that were not before.

Since MCMC gives a universal approach for solving certain kinds of problems, some people equate the algorithm with the problem. That is, they see MCMC as the solution rather than an algorithm to compute the solution. They forget what they were ultimately after. It’s sometimes possible to compute posterior probabilities more quickly and more accurately by other means, such as numerical integration.

Click to find out more about consulting for statistical computing


High-dimensional integration

Numerically integrating functions of many variables almost always requires Monte Carlo integration or some variation. Numerical analysis textbooks often emphasize one-dimensional integration and, almost as a footnote, say that you can use a product scheme to bootstrap one-dimensional methods to higher dimensions. True, but impractical.

Traditional numerical integration routines work well in low dimensions. (“Low” depends on context, but let’s say less than 10 dimensions.) When you have the choice of a well-understood, deterministic method, and it works well, by all means use it. I’ve seen people who are so used to problems requiring Monte Carlo methods that they use these methods on low-dimensional problems where they’re not the most efficient (or robust) alternative.

Monte Carlo

Except for very special integrands, high dimensional integration requires either Monte Carlo integration or some variation such as quasi-Monte Carlo methods.

As I wrote about here, naive Monte Carlo integration isn’t practical for high dimensions. It’s unlikely that any of your integration points will fall in the region of interest without some sort of importance sampler. Constructing a good importance sampler requires some understanding of your particular problem. (Sorry. No one-size-fits-all black-box methods are available.)

Quasi-Monte Carlo (QMC)

Quasi-Monte Carlo methods (QMC) are interesting because they rely on something like a random sequence, but “more random” in a sense, i.e. more effective at exploring a high-dimensional space. These methods work well when an integrand has low effective dimension. The effective dimension is low when nearly all the action takes place on a lower dimensional manifold. For example, a function may ostensibly depend on 10,000 variables, while for practical purposes 12 of these are by far the most important. There are some papers by Art Owen that say, roughly, that Quasi-Monte Carlo methods work well if and only if the effective dimension is much smaller than the nominal dimension. (QMC methods are especially popular in financial applications.)

While QMC methods can be more efficient than Monte Carlo methods, it’s hard to get bounds on the integration error with these methods. Hybrid approaches, such as randomized QMC can perform remarkably well and give theoretical error bounds.

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a common variation on Monte Carlo integration that uses dependent random samples. In many applications, such as Bayesian statistics, you’d like to be able to create independent random samples from some probability distribution, but this is not practical. MCMC is a compromise. It produces a Markov chain of dependent samples, and asymptotically these samples have the desired distribution. The dependent nature of the samples makes it difficult to estimate error and to determine how well the integration estimates using the Markov chain have converged.

(Some people speak of the Markov chain itself converging. It doesn’t. The estimates using the Markov chain may converge. Speaking about the chain converging may just be using informal language, or it may betray a lack of understanding.)

There is little theory regarding the convergence of estimates from MCMC, aside from toy problems. An estimate can appear to have converged, while an important region of the integration problem remains unexplored. Despite its drawbacks, MCMC is the only game in town for many applications.

Click to learn more about numerical integration consulting

Integration by Darts

Monte Carlo integration has been called “Integration by Darts,” a clever pun on “integration by parts.” I ran across the phrase looking at some slides by Brian Hayes, but apparently it’s been around a while. The explanation that Monte Carlo is “integration by darts” is fine as a 0th order explanation, but it can be misleading.

Introductory courses explain Monte Carlo integration as follows.

  1. Plot the function you want to integrate.
  2. Draw a box that contains the graph.
  3. Throw darts (random points) at the box.
  4. Count the proportion of darts that land between the graph and the horizontal axis.
  5. Estimate the area under the graph by multiplying the area of the box by the proportion above.

In principle this is correct, but this is far from how Monte Carlo integration is usually done in practice.

For one thing, Monte Carlo integration is seldom used to integrate functions of one variable. Instead, it is mostly used on functions of many variables, maybe hundreds or thousands of variables. This is because more efficient methods exist for low-dimensional integrals, but very high dimensional integrals can usually only be computed using Monte Carlo or some variation like quasi-Monte Carlo.

If you draw a box around your integrand, especially in high dimensions, it may be that nearly all your darts fall outside the region you’re interested in. For example, suppose you throw a billion darts and none land inside the volume determined by your integration problem. Then the point estimate for your integral is 0. Assuming the true value of the integral is positive, the relative error in your estimate is 100%. You’ll need a lot more than a billion darts to get an accurate estimate. But is this example realistic? Absolutely. Nearly all the volume of a high-dimensional cube is in the “corners” and so putting a box around your integrand is naive. (I’ll elaborate on this below. [1])

So how do you implement Monte Carlo integration in practice? The next step up in sophistication is to use “importance sampling.” [2] Conceptually you’re still throwing darts at a box, but not with a uniform distribution. You find a probability distribution that approximately matches your integrand, and throw darts according to that distribution. The better the fit, the more efficient the importance sampler. You could think of naive importance sampling as using a uniform distribution as the importance sampler. It’s usually not hard to find an importance sampler much better than that. The importance sampler is so named because it concentrates more samples in the important regions.

Importance sampling isn’t the last word in Monte Carlo integration, but it’s a huge improvement over naive Monte Carlo.

Click to learn more about numerical integration consulting


[1] So what does it mean to say most of the volume of a high-dimensional cube is in the corners? Suppose you have an n-dimensional cube that runs from -1 to 1 in each dimension and you have a ball of radius 1 inside the cube. To make the example a little simpler, assume n is even, n = 2k. Then the volume of the cube is 4k and the volume of the sphere is πk / k!. If k = 1 (n = 2) then the sphere (circle in this case) takes up π/4 of the volume (area), about 79%. But when k = 100 (n = 200), the ball takes up 3.46×10-169 of the volume of the cube. You could never generate enough random samples from the cube to ever hope to land a single point inside the ball.

[2] In a nutshell, importance sampling replaces the problem of integrating f(x) with that of integrating (f(x) / g(x)) g(x) where g(x) is the importance sampler, a probability density. Then the integral of (f(x) / g(x)) g(x) is the expected value of (f(X) / g(X)) where X is a random variable with density given by the importance sampler. It’s often a good idea to use an importance sampler with slightly heavier tails than the original integrand.

Integration trick

Here’s a clever example from Paul Nahin’s new book Inside Interesting Integrals. Suppose you want to evaluate

\int_{-1}^1 \frac{\cos(x)}{\exp(1/x) + 1}\,dx

Since the range of integration is symmetric around zero, you might think to see whether the integrand is an odd function, in which case the integral would be zero. (More on such symmetry tricks here.) Unfortunately, the integrand is not odd, so that trick doesn’t work directly. However, it does help indirectly.

You can split any function f(x) into its even and odd parts.

f_e(x) = \frac{f(x) + f(-x)}{2} \\ f_o(x) = \frac{f(x) - f(-x)}{2}

The integral of a function over a symmetric interval is the integral of its even part because its odd part integrates to zero. The even part of the integrand above works out to be simply cos(x)/2 and so the integral evaluates to sin(1).

Click to learn more about numerical integration consulting


How Lp spaces nest

A function f(x) is said to be in Lp if the integral of |f(x)|p is finite. So if you know f is in Lp for some value of p, can conclude that f is also in Lq for some q? In general, no. But for two important special cases, yes. The special cases have to do with the domain of f.

If f is a function on a finite measure space and p > q, then every function in Lp is in Lq. In particular probability spaces are finite measure spaces, and so if a random variable’s pth moment exists, the qth moment exists.

If the domain of f is the integers with counting measure, f is a sequence and the “integral” of |f(x)|p is simply a sum. Now the relationship between Lp spaces is reversed: p > q, every sequence in Lq is in Lp.

There are two reasons a function could fail to be in Lp: singularities and fat tails. In a finite measure space, there are no tails; you only have to worry about singularities. On the integers, there are no singularities; you only have to worry about tails.

Now consider functions on the whole real line. Membership in Lp says nothing about membership in Lq because now you can have singularities and fat tails.

Avoiding underflow in Bayesian computations

Here’s a common problem that arises in Bayesian computation. Everything works just fine until you have more data than you’ve seen before. Then suddenly you start getting infinite, NaN, or otherwise strange results. This post explains what might be wrong and how to fix it.

A posterior density is (proportional to) a likelihood function times a prior distribution. The likelihood function is a product. The number of data points is the number of terms in the product. If these numbers are less than 1, and you multiply enough of them together, the result will be too small to represent in a floating point number and your calculation will underflow to zero. Then subsequent operations with this number, such as dividing it by another number that has also underflowed to zero, may produce an infinite or NaN result.

The instinctive reaction regarding underflow or overflow is to use logs. And often that works. If you wanted to know where the maximum of the posterior density occurs, you could find the maximum of the logarithm of the posterior density. But in Bayesian computations you often need to integrate the posterior density times some functions. Now you cannot just work with logs because logs and integrals don’t commute.

One way around the problem is to multiply the integrand by a constant so large that there is no danger of underflow. Multiplying by constants does commute with integration.

So suppose your integrand is on the order of 10^-400, far below the smallest representable number. Do you need extended precision arithmetic? No, you just need to understand your problem.

If you multiply your integrand by 10^400 before integrating, then your integrand is roughly 1 in magnitude. Then do you integration, and remember that the result is 10^400 times the actual value.

You could take the following steps.

  1. Find m, the maximum of the log of the integrand.
  2. Let I be the integral of exp( log of the integrand – m ).
  3. Keep track that your actual integral is exp(m) I, or that its log is m + log I.

Note that you can be extremely sloppy in this process. You don’t need an accurate estimate of the maximum of the integrand per se. If you’re within a few dozen orders of magnitude, for example, that could be sufficient to carry out your integration without underflow.

One way to estimate the maximum is to use a frequentist estimator of your parameters as an approximate MLE, and assume that this is approximately where your posterior density takes on its maximum value. This might actually be very accurate, but it doesn’t need to be.

Note also that it’s OK if some of the evaluations of your integrand underflow to zero. You just don’t want the entire integral to underflow to zero. Places where the integrand is many orders of magnitude less than its maximum don’t contribute to the integration anyway. (It’s important that your integration pays attention to the region where the integrand is largest. A naive integration could entirely miss the most important region and completely underestimate the integral. But that’s a matter for another blog post.)

Click to find out more about consulting for statistical computing


Fractional integration

Define the integration operator I by

(I,f)(x) = int_a^x f(t), dt

so I f is an antiderivative of f.

Define the second antiderivative I2 by applying I to f twice:

(I_2, f)(x) = int_a^x int_a^t f(t_1), dt_1, dt

It turns out that

(I_2, f)(x) = int_a^x (x - t) , f(t), dt

To see this, notice that both expressions for I2 are equal when x = a, and they have the same derivative, so they must be equal everywhere.

Now define I3 by applying I to I2 and so forth defining In for all positive integers n. Then one can show that the nth antiderivative is given by

(I_n, f)(x) = frac{1}{Gamma(n)} int_a^x (x - t)^{n-1} , f(t), dt

The right side of this equation makes sense for non-integer values of n, and so we can take it as a definition of fractional integrals when n is not an integer.

Related post: Fractional derivatives

For daily posts on analysis, follow @AnalysisFact on Twitter.

AnalysisFact twitter icon

Numerical integration trick

Suppose you want to evaluate the following integral:

int_{30}^infty \exp\left( -frac{100}{x+6} \right) , frac{1}{x^3} , \dx

We’d like to do a change of variables to make the range of integration finite, and we’d like the transformed integral to be easy to evaluate numerically.

The change of variables t = 1/x2 transforms the range of integration from (30, ∞) to (0, 1/900). I’ll explain where this choice came from and why it results in a nice function to integrate.

The integrand is essentially 1/x3 for large x because the exponential term approaches 1. If the integrand were exactly 1/x3 then the change of variables t = 1/x2 would make the integrand constant. So we’d expect this change of variables to give us an integrand that behaves well. In particular, the integral should be bounded at 0.

(If you’re not careful, a change of variables may just swap one kind of improper integral for another. That is, you can make the domain of integration finite, but at the cost of introducing a singularity in the transformed integrand. However, we don’t have that problem here because our change of variables was just right.)

In general, if you need to integrate a function that behaves like 1/xn as x goes to infinity, try the change of variables t = 1/xn-1.

Click to learn more about numerical integration consulting


Related post: The care and treatment of singularities

SciPy integration misunderstanding

Today I needed to compute an integral similar to this:

int_{1000}^infty frac{dx}{100, x^3}

I used the following SciPy code to compute the integral:

from scipy.integrate import quad

def f(x):
    return 0.01*x**-3

integral, error = quad(f, 1000, sp.inf, epsrel = 1e-6)
print integral, error

My intention was to compute the integral to 6 significant figures. (epsrel is a shortened form of epsilon relative, i.e. relative error.) To my surprise, the estimated error was larger than the value of the integral. Specifically, the integral was computed as 5.15 × 10-9 and the error estimate was 9.07 × 10-9.

What went wrong? The integration routine quad lets you specify either a desired bound on your absolute error (epsabs) or a desired bound on your relative error (epsrel). I assumed that since I specified the relative error, the integration would stop when the relative error requirement was met. But that’s not how it works.

The quad function has default values for both epsabs and epsrel.

def quad(... epsabs=1.49e-8, epsrel=1.49e-8, ...):

I thought that since I did not specify an absolute error bound, the bound was not effective, or equivalently, that the absolute error target was 0. But no! It was as if I’d set the absolute error bound to 1.49 × 10-8. Because my integral is small (the exact value is 5 × 10-9) the absolute error requirement is satisfied before the relative error requirement and so the integration stops too soon.

The solution is to specify an absolute error target of zero. This condition cannot be satisfied, and so the relative error target will determine when the integration stops.

integral, error = quad(f, 1000, sp.inf, epsrel = 1e-6, epsabs = 0)

This correctly computes the integral as 5 × 10-9 and estimates the integration error as 4 ×10-18.

It makes some sense for quad to specify non-zero default values for both absolute and relative error, though I imagine most users expect small relative error rather than small absolute error, so perhaps the latter could be set to 0 by default.

Care and treatment of singularities

My favorite numerical analysis book is Numerical Methods that Work. In the hardcover version of the book, the title was impressed in the cover and colored in silver letters. Before the last word of the title, there was another word impressed but not colored in. You had to look at the cover carefully to see that it actually says “Numerical methods that usually work.”

First published in 1970, the book’s specific references to computing power are amusingly dated. But book’s advice is timeless.

The chapter entitled “The care and treatment of singularities” gives several approaches for integrating functions that have a singularity. This post will elaborate on the simplest example from that chapter.

Suppose you want to compute the following integral.

int_0^1 frac{dx}{sqrt{sin(x)}}

The integrand is infinite at 0. There are a couple common hacks to get around this problem, and neither works well. One is simply to re-define the integrand so that it has some finite value at 0. That keeps the integration program from complaining about a division by zero, but it doesn’t help evaluate the integral accurately.

Another hack is to change the lower limit of integration from 0 to something a little larger. OK, but how much larger? And even if you replace 0 with something small, you still have the problem that your integrand is nearly singular near 0. Simple integration routines assume integrands are polynomial-like, and this integrand is not polynomial-like near zero.

The way to compute this integral is to subtract off the singularity. Since sin(x) is asymptotically equal to x as x goes to 0, √sin(x) is asymptotically √x. So if we subtract 1/√x, we’re left with a bounded integrand, and one that is considerably more polynomial-like than the one we started with. We then integrate by hand what we subtracted off. That is, we replace our original integral with the following pair of integrals.

int_0^1 left( frac{1}{sqrt{sin(x)}} - frac{1}{sqrt{x}} right) ,dx + int_0^1 frac{dx}{sqrt{x}}

Compute the first numerically and the second analytically. Simpson’s rule with intervals of size 1/64 gives 0.03480535 for the first integral, which is correct to seven decimal places. The second integral is simply 2, so the total is 2.03480535.

Now lets look back at how our two hacks would have worked. Suppose we define our integrand f(x) to be 0 at x = 0. Then applying Simpson’s rule with step size 1/64 would give an integral of 3.8775, which is completely wrong. Bringing the step size down to 2^-20 makes things worse: the integral would then be 4.036. What if instead of defining the integrand to be 0 at 0, we define it to be some large value, say 10^6? In that case Simpson’s rule with step size 1/64 returns 5212.2 as the integral. Lowering the step size to 2^-20 improves the integral to 4.351, but still every figure is wrong since the integral is approximately 2.

What if we’d replaced 0 with 0.000001 and used Simpson’s rule on the original integral? That way we’d avoid the problem of having to define the integrand at 0. Then we’d have two sources of error. First the error from not integrating from 0 to 0.000001. That error is 0.002, which is perhaps larger than you’d expect: the change to the integral is 2000x larger than the change to the lower limit of integration. But this problem is insignificant compared to the next.

The more serious problem is applying Simpson’s rule to the integral even if we avoid 0 as the lower limit of integration. We still have a vertical asymptote, even if our limits of integration avoid the very singularity, and no polynomial ever had a vertical asymptote. With such decidedly non-polynomial behavior, we’d expect Simpsons rule and its ilk to perform poorly.

Indeed that’s what happens. With step size 1/64, Simpson’s rule gives a value of 9.086 for the integral, which is completely wrong since we know the integral is roughly 2.0348. Even with step size 2^-20, i.e. over a million function evaluations, Simpson’s rule gives 4.0328 for the integral.

Incidentally, if we had simply approximated our integrand 1/√sin(x) with 1/√x, we would have estimated the integral as 2.0 and been correct to two significant figures, while Simpson’s rule applied naively gets no correct figures. On the other hand, Simpson’s rule applied cleverly (i.e. by subtracting off the singularity) gives eight correct figures with little effort. As I concluded here, a crude method cleverly applied beats a clever technique crudely applied.

Click to learn more about numerical integration consulting


Related posts:

Saved by symmetry

When I solve a problem by appealing to symmetry, students’ jaws drop. They look at me as if I’d pulled a rabbit out of a hat.

I used think of these tricks as common knowledge, but now I think they’re common knowledge in some circles (e.g. physics) and not as common in others. These tricks are simple, but not as many people as I’d thought have been trained to spot opportunities to apply them.

Here’s an example.

Pop quiz 1: Evaluate the following intimidating integral.

int_{-infty}^infty x log(1 + x^2) e^{-x^2}, dx

Solution: Zero, by symmetry, because the integrand is odd.

The integrand is an odd function (i.e. f(-x) = –f(x)), and the integrand of an odd function over a symmetric interval is zero. This is because the region below the x-axis is symmetric to the region above the x-axis as the following graph shows.

plot of x log(1 + x^2) exp(-x^2)

The example above is not artificial: similar calculations come up constantly in applications. For example, the Fourier series of an odd function contains only sine terms, no cosine terms. Why? The integrals to compute the Fourier coefficients for the cosine terms all involve odd functions integrated over an interval symmetric about zero.

Another common application of symmetry is evaluating derivatives. The derivative of an odd function is an even function and vice versa.

Pop quiz 2: What is the coefficient of x5 in the Taylor series of cos(1 + x2)?

Solution: Zero, by symmetry, because cos(1 + x2) is an even function.

Odd functions of x have only odd powers of x in their Taylor series and even functions have only even powers of x in their Taylor series. Why? Because the coefficients come from derivatives evaluated at zero.

If f(x) is an odd function, all of its even derivatives are odd functions. These derivatives are zero at x = 0, and so all the coefficients of even powers of x are zero. A similar argument shows that even functions have only even powers of x in their Taylor series.

Symmetry tricks are obvious in hindsight. The hard part is learning to recognize when they apply. Symmetries are harder to recognize, but also more valuable, in complex situations. The key is to think about the problem you’re trying to solve before you dive into heads-down calculation.

Click to learn more about numerical integration consulting


Related posts:

Three surprises with the trapezoid rule

The trapezoid rule is a very simple method for estimating integrals. The idea is to approximate the area under a curve by a bunch of thin trapezoids and add up the areas of the trapezoids as suggested in the image below.

This is an old idea, probably older than the formal definition of an integral. In general it gives a crude estimation of the integral. If the width of the trapezoids is h, the error in using the trapezoid rule is roughly proportional to h2. It’s easier to do better. For example, Simpson’s rule is a minor variation on the trapezoid rule that has error proportional to h5.

So if the trapezoid rule is old and inaccurate, why does anyone care about it? Here are the surprises.

  1. You can still get a publication out of the trapezoid rule! In 1994, a doctor published a paper reinventing the trapezoid rule. Not only did the editors not recognize this ancient algorithm, the paper has been cited many times since it was published. (Update: more about the trapezoid paper here.)
  2. Although the trapezoid rule is inefficient in general, it can be shockingly efficient for periodic functions.
  3. The trapezoid rule can also be shockingly efficient for analytic functions that go to zero quickly, so called double exponential functions.

The last two observations are more widely applicable than you might think at first. What if you want to integrate something that isn’t periodic and isn’t a double exponential function? You may be able to do a change of variables that makes your integrand have one of these special forms. The article Fast Numerical Integration explains an integration method based on double exponential functions and includes C++ source code.

The potential efficiency of the trapezoid rule illustrates a general principle: a crude method cleverly applied can beat a clever technique crudely applied. The simplest numerical integration technique, one commonly taught in freshman calculus, can be extraordinarily efficient when applied with skill to the right problem. Conversely, a more sophisticated integration technique such as Gauss quadrature can fail miserably when naively applied.

Click to learn more about numerical integration consulting


Related posts:

Optical illusion, mathematical illusion

Someone sent me a link to an optical illusion while I was working on a math problem. The two things turned out to be related.

In the image below, what look like blues spiral and green spirals are actually exactly the same color. The spiral that looks blue is actually green inside, but the magenta stripes across it make the green look blue. I know you don’t believe me; I didn’t believe it either. See this blog post for an explanation, including a magnified close-up of the image. Or open it in an image editor and use a color selector to see for yourself.

My math problem was also a case of two things that look different even though they are not. Maybe you can think back to a time as a student when you knew your answer was correct even though it didn’t match the answer in the back of the book. The two answers were equivalent but written differently. In an algebra class you might answer 5 / √ 3 when the book has 5 √ 3 / 3. In trig class you might answer 1 – cos2x when the book has sin2x. In a differential equations class, equivalent answers may look very different since arbitrary constants can obfuscate differences.

In my particular problem, I was looking at weights for Gauss-Hermite integration. I was trying to reconcile two different expressions for the weights, one in some software I’d written years ago and one given in A&S. I thought I’d found a bug, at least in my comments if not in my software. My confusion was analogous to not recognizing a trig identity.  I wish I could say that the optical illusion link made me think that the two expressions may be the same and they just look different because of a mathematical illusion. That would make a good story. Instead, I discovered the equivalence of the two expressions by brute force, having Mathematica print out the values so I could compare them. Only later did I see the analogy between my problem and the optical illusion.

In case you’re interested in the details, my problem boiled down to the equivalence between Hn+1(xi)2 and 4n2Hn-1(xi)2 where Hn(x) is the nth Hermite polynomial and xi is the ith root of Hn. Here’s why these are the same. The Hermite polynomials satisfy a recurrence relation Hn+1(x) = 2x Hn(x) – 2n Hn-1(x) for all x. Since Hn(xi) = 0, Hn+1(xi) = -2nHn-1(xi). Now square both sides.

Related post: Orthogonal polynomials

Quasi-random sequences in art and integration

Sometimes when people say they want random points, that’s not what they really want. Random points clump more than most people expect. Quasi-random sequences are not random in any mathematical sense, but they might match popular expectations of randomness better than the real thing, especially for aesthetic applications. If by “random” someone means “scattered around without a clear pattern and not clumped together” then quasi-random sequences might do the trick.

Here are the first 50 points in a quasi-random sequence of points in the unit square.

quasi-random points

By contrast, here are 50 points in a unit square whose coordinates are uniform random samples.

random points

The truly random points clump together. Notice the cluster of three points in the top right corner. There are few other instances of pairs of points being very close together. Also, there are fairly large areas that don’t contain any random points. The quasi-random points by contrast are better spread out. They have a self-avoiding property that keeps them from clustering, and they fill the space more efficiently.

Quasi-random sequences could be confused with pseudo-random sequences. They’re not at all the same thing. Pseudo-random sequences are computer-generated sequences that in many ways behave as if they were truly random, even though they were produced by deterministic algorithms. For many practical purposes, including sensitive statistical tests, pseudo-random sequences are simply random sequences. (The “truly” random points above were technically “pseudo-random” points.)

The quasi-random points above were part of a Sobol sequence, a common quasi-random sequence. Other quasi-random sequences include the Halton sequence and the Hammersley sequence. Mathematically, these sequences are defined has having low-discrepancy. Roughly speaking, this means the “discrepancy” between the number of points that actually fall in a volume and the number of points you’d expect to fall in the same volume is small. See the Wikipedia article on quasi-random sequences for more mathematical details.

Besides being aesthetically useful, quasi-random sequences are useful in applied mathematics. Because these sequences explore a space more efficiently than random sequences, quasi-random sequences sometimes lead to more efficient high-dimensional integration algorithms than Monte Carlo integration. Quasi-Monte Carlo integration, i.e. integration based on quasi-random sequences rather than random sequences, is popular in financial applications. Art Owen has written insightful papers on Quasi-Monte Carlo integration (QMC). He has classified which integration problems can be efficiently computed via QMC methods and which cannot. In a nutshell, QMC works well when the effective dimension of a problem is significantly lower than the actual dimension. For example, a financial model might ostensibly depend on 1000 variables, but 50 of those variables contribute far more to the integrand than all the other variables. The integrand might essentially be a function of only 50 variables. In that case, QMC will work well. Note that it is not necessary to identify these 50 variables or do any change of variables. QMC just magically takes advantage of the situation.

One disadvantage of QMC integration is that it doesn’t naturally lead to an estimate of its own accuracy, unlike Monte Carlo integration. Several hybrid approaches have been proposed to combine QMC integration and Monte Carlo integration to get the efficiency of the former and the error estimates of the latter. For example, one could randomly jitter the quasi-random points or randomly permute their components. Some of these results are in Art Owen’s papers.

To read more about quasi-random sequences, see the book Random Number Generation and Quasi-Monte Carlo Methods.

Click to learn more about help with randomization


Numerical integration article posted

This weekend CodeProject posted an article I wrote entitled Fast numerical integration.

The algorithm in the article, introduced in the 1970’s by Masatake Mori and Hidetosi Takahasi, is indeed fast. It integrates analytical functions over bounded intervals with the most accuracy for a fixed number of integration points.

The CodeProject article includes source code and is mostly about the code. See this page for mathematical details.