High-dimensional integration

Numerically integrating functions of many variables almost always requires Monte Carlo integration or some variation. Numerical analysis textbooks often emphasize one-dimensional integration and, almost as a footnote, say that you can use a product scheme to bootstrap one-dimensional methods to higher dimensions. True, but impractical.

Traditional numerical integration routines work well in low dimensions. (“Low” depends on context, but let’s say less than 10 dimensions.) When you have the choice of a well-understood, deterministic method, and it works well, by all means use it. I’ve seen people who are so used to problems requiring Monte Carlo methods that they use these methods on low-dimensional problems where they’re not the most efficient (or robust) alternative.

Monte Carlo

Except for very special integrands, high dimensional integration requires either Monte Carlo integration or some variation such as quasi-Monte Carlo methods.

As I wrote about here, naive Monte Carlo integration isn’t practical for high dimensions. It’s unlikely that any of your integration points will fall in the region of interest without some sort of importance sampler. Constructing a good importance sampler requires some understanding of your particular problem. (Sorry. No one-size-fits-all black-box methods are available.)

Quasi-Monte Carlo (QMC)

Quasi-Monte Carlo methods (QMC) are interesting because they rely on something like a random sequence, but “more random” in a sense, i.e. more effective at exploring a high-dimensional space. These methods work well when an integrand has low effective dimension. The effective dimension is low when nearly all the action takes place on a lower dimensional manifold. For example, a function may ostensibly depend on 10,000 variables, while for practical purposes 12 of these are by far the most important. There are some papers by Art Owen that say, roughly, that Quasi-Monte Carlo methods work well if and only if the effective dimension is much smaller than the nominal dimension. (QMC methods are especially popular in financial applications.)

While QMC methods can be more efficient than Monte Carlo methods, it’s hard to get bounds on the integration error with these methods. Hybrid approaches, such as randomized QMC can perform remarkably well and give theoretical error bounds.

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a common variation on Monte Carlo integration that uses dependent random samples. In many applications, such as Bayesian statistics, you’d like to be able to create independent random samples from some probability distribution, but this is not practical. MCMC is a compromise. It produces a Markov chain of dependent samples, and asymptotically these samples have the desired distribution. The dependent nature of the samples makes it difficult to estimate error and to determine how well the integration estimates using the Markov chain have converged.

(Some people speak of the Markov chain itself converging. It doesn’t. The estimates using the Markov chain may converge. Speaking about the chain converging may just be using informal language, or it may betray a lack of understanding.)

There is little theory regarding the convergence of estimates from MCMC, aside from toy problems. An estimate can appear to have converged, while an important region of the integration problem remains unexplored. Despite its drawbacks, MCMC is the only game in town for many applications.


Numerical integration consulting


Random inequalities and Edgeworth approximation

I’ve written a lot about random inequalities. That’s because computers spend a lot of time computing random inequalities in the inner loop of simulations. I’m looking for ways to speed this up.

Here’s my latest idea: Approximating random inequalities with Edgeworth expansions

An Edgeworth expansion is like a Fourier series, except you use derivatives of the normal density as your basis rather than sine functions. Sometimes the full Edgeworth expansion does not converge and yet the first few terms make a good approximation. The tech report explicitly considers Edgeworth approximations with just two terms, but demonstrates the integration tricks necessary to use more terms. The result is computed in closed form, no numerical integration required, and so may be much faster than other approaches.

One advantage of the Edgeworth approach is that it only depends on the moments of the distributions in the inequality. This means it provides an approximation that’s waiting to be used on new families of distributions. But because it’s not specific to a distribution family, its performance in a particular case needs to be explored. In the case of beta distributions, for example, even a single-term approximation does pretty well.

More blog posts on random inequalities

Beta inequalities in R

Someone asked me yesterday for R code to compute the probability P(X > Y + δ) where X and Y are independent beta random variables. I’m posting the solution here in case it benefits anyone else.

For an example of why you might want to compute this probability, see A Bayesian view of Amazon resellers.

Let X be a Beta(a, b) random variable and Y be a Beta(c, d) random variable. Denote PDFs by f and CDFs by F. Then the probability we need is

P(X > Y + \delta) &=& \int_\delta^1 \int_0^{x-\delta} f_X(x) \, f_Y(y)\, dy,dx \ &=& \int_\delta^1 f_X(x)\, F_Y(x-\delta) \, dx

If you just need to compute this probability a few times, here is a desktop application to compute random inequalities.

But if you need to do this computation repeated inside R code, you could use the following.

beta.ineq <- function(a, b, c, d, delta)
    integrand <- function(x) { dbeta(x, a, b)*pbeta(x-delta, c, d) }
    integrate(integrand, delta, 1, rel.tol=1e-4)$value

The code is as good or as bad as R’s integrate function. It’s probably accurate enough as long as none of the parameters a, b, c, or d are near zero. When one or more of these parameters is small, the integral is harder to compute numerically.

There is no error checking in the code above. A more robust version would verify that all parameters are positive and that delta is less than 1.

Here’s the solution to the corresponding problem for gamma random variables, provided delta is zero: A support one-liner.

And here is a series of blog posts on random inequalities.

Avoiding underflow in Bayesian computations

Here’s a common problem that arises in Bayesian computation. Everything works just fine until you have more data than you’ve seen before. Then suddenly you start getting infinite, NaN, or otherwise strange results. This post explains what might be wrong and how to fix it.

A posterior density is (proportional to) a likelihood function times a prior distribution. The likelihood function is a product. The number of data points is the number of terms in the product. If these numbers are less than 1, and you multiply enough of them together, the result will be too small to represent in a floating point number and your calculation will underflow to zero. Then subsequent operations with this number, such as dividing it by another number that has also underflowed to zero, may produce an infinite or NaN result.

The instinctive reaction regarding underflow or overflow is to use logs. And often that works. If you wanted to know where the maximum of the posterior density occurs, you could find the maximum of the logarithm of the posterior density. But in Bayesian computations you often need to integrate the posterior density times some functions. Now you cannot just work with logs because logs and integrals don’t commute.

One way around the problem is to multiply the integrand by a constant so large that there is no danger of underflow. Multiplying by constants does commute with integration.

So suppose your integrand is on the order of 10^−400, far below the smallest representable number. Do you need extended precision arithmetic? No, you just need to understand your problem.

If you multiply your integrand by 10^400 before integrating, then your integrand is roughly 1 in magnitude. Then do you integration, and remember that the result is 10^400 times the actual value.

You could take the following steps.

  1. Find m, the maximum of the log of the integrand.
  2. Let I be the integral of exp( log of the integrand − m ).
  3. Keep track that your actual integral is exp(m) I, or that its log is m + log I.

Note that you can be extremely sloppy in this process. You don’t need an accurate estimate of the maximum of the integrand per se. If you’re within a few dozen orders of magnitude, for example, that could be sufficient to carry out your integration without underflow.

One way to estimate the maximum is to use a frequentist estimator of your parameters as an approximate MLE, and assume that this is approximately where your posterior density takes on its maximum value. This might actually be very accurate, but it doesn’t need to be.

Note also that it’s OK if some of the evaluations of your integrand underflow to zero. You just don’t want the entire integral to underflow to zero. Places where the integrand is many orders of magnitude less than its maximum don’t contribute to the integration anyway. (It’s important that your integration pays attention to the region where the integrand is largest. A naive integration could entirely miss the most important region and completely underestimate the integral. But that’s a matter for another blog post.)

How to compute log factorial

Factorials grow very quickly, so quickly the results can easily exceed the limits of computer arithmetic. For example, 30! is too big to store in a typical integer and 200! is too big to store (even approximately) in a standard floating point number. It’s usually more convenient to work with the logarithms of factorials rather than factorials themselves.

So how might you compute log( n! )? You don’t want to compute n! first and then take logs because you’ll overflow for moderately large arguments. You want to compute the log factorial directly.

Since n! = 1 × 2 × 3 × … × n,

log(n!) = log(1) + log(2) + log(3) + … + log(n).

That gives a way to compute log(n!), but it’s slow for large arguments: the run time is proportional to the size of n.

If you only need to compute log(n!) for n within a moderate range, you could just tabulate the values. Calculate log(n!) for n = 1, 2, 3, …, N by any means, no matter how slow, and save the results in an array. Then at runtime, just look up the result.

But suppose you want to be able to compute log(n!) for any value of n such that log(n!) won’t overflow. That requires a very large value of n. Since log(n!) is on the order of n log (n) for large n, log(n!) won’t overflow even for some very large values of n. You’ll run out of memory to store your table before you’ll run out of values of n such that log(n!) doesn’t overflow.

So now the problem becomes how to evaluate log(n!) for large values of n. Say we tabulate log(n!) for n up to some size and then use a formula to calculate log(n!) for larger arguments. I’m going to switch notation now and work with the gamma function Γ(x) because most references state their results in terms of the gamma function rather than in terms of factorial. It’s easy to move between the two since Γ(n + 1) = n!.

Stirling’s approximation says that

log Γ(x) ≈ (x − 1/2) log(x) − x  + (1/2) log(2 π)

with an error on the order of 1/12x. So if n were around 1000, the approximation would be good to about four decimal places. What if we wanted more accuracy but not a bigger table? Stirling’s approximation above is part of an infinite series of approximations, an asymptotic series for log Γ(x):

log Γ(x) ≈ (x − 1/2) log(x) − x + (1/2) log(2 π) + 1/(12 x) − 1/(360 x3) + 1/(1260 x5) – …

This raises a couple questions.

  1. What is the form of a general term in the series?
  2. How accurate is the series when we truncate it at some point?

The answer to the first question is the term with x2m−1 in the denominator has coefficient B2m / (2m(2m − 1)) where the B‘s are Bernoulli numbers. Perhaps that’s not very satisfying, but that’s what it is.

Now to the question of accuracy. If you’ve never used asymptotic series before, you might be tempted to use one like you’d use a Taylor series: the more terms the better. But asymptotic series don’t work that way. Typically the error improves at first as you take more terms, reaches a minimum, then increases as you take more terms. Another difference is that while Taylor series approximations improve as arguments get smaller, asymptotic approximations improve as arguments get larger. That’s convenient for us since we’re looking for an approximation for n so large that it’s beyond our table of saved values.

For this particular series, the absolute value of the error in truncating the series is less than the absolute value of the first term that was left out.  Suppose we make a look-up table for the values 1 through 256. If we truncate the series after 1/(12 x), the error will be less than 1/(360 x3). If x > 256, log(x!) > 1419 and the error in the asymptotic approximation is less than 1/(360 × 224) = 1.65 × 10−10. Since the number we’re computing has at least four digits and the result is good to 10 decimal places, we should have at least 14 significant figures, near the limits of floating point precision. (For more details, see Anatomy of a floating point number.)

In summary, one way to compute log factorial is to pre-compute log(n!) for n = 1, 2, 3, … 256 and store the results in an array. For values of n ≤ 256, look up the result from the table. For n > 256, return

(x − 1/2) log(x) − x + (1/2) log(2 π) + 1/(12 x)

with x = n + 1. This has been coded up here.

You could include the 1/(360 x3) term or higher terms from the asymptotic series and use a smaller table. This would use less memory but would require more computation for arguments outside the range of the table.

Related posts

Generating Poisson random values

The most direct way of generating random samples from a Poisson distribution is efficient for some parameters and inefficient for others. Wikipedia attributes the following algorithm to Donald Knuth:

         Let L ← exp(−λ), k ← 0 and p ← 1.
         k ← k + 1.
         Generate uniform random number u in [0,1] and let p ← p × u.
    while p > L.
    return k − 1.

Here’s the idea behind the algorithm. The time between arrivals in a Poisson process is exponentially distributed. Count how many arrivals there are in an interval by simulating the times between arrivals and adding them up until the time sum spills over the interval.

The problem with this approach is that the expected run time of the loop is proportional to the parameter λ. This algorithm is commonly used despite its terrible performance for large arguments.

Below is an algorithm that has expected run time independent of the argument λ. The algorithm is fairly simple, though it takes a moderate amount of theoretical justification to explain. It goes back to 1979 and so may not the most efficient algorithm available. It is definitely not the most efficient algorithm if you need to generate a large number of samples using the same parameter λ. The paper describing the algorithm recommends using Knuth’s algorithm for values of λ less than 30 and this algorithm for larger values of λ.

c = 0.767 - 3.36/lambda
beta = PI/sqrt(3.0*lambda)
alpha = beta*lambda
k = log(c) - lambda - log(beta)

	u = random()
	x = (alpha - log((1.0 - u)/u))/beta
	n = floor(x + 0.5)
	if (n < 0)
	v = random()
	y = alpha - beta*x
	lhs = y + log(v/(1.0 + exp(y))^2)
	rhs = k + n*log(lambda) - log(n!)
	if (lhs <= rhs)
		return n

This is an accept-reject method based on a normal approximation. The performance improves as λ increases because the quality of the normal approximation improves. (Actually, it uses a logistic approximation to a normal approximation!) Theoretically it could be caught in an infinite loop, as with all accept-reject methods. However, the expected number of iterations is bounded. The continue statement means that if n is negative, the algorithm goes back to the top of the loop. The random() function is assumed to return a uniform random variable between 0 and 1.

A possible obstacle to turning the pseudocode above into actual code is computing log(n!). Naively computing n! and taking the log of the result will overflow for moderately large values of n. If you have a function available that computes the log of the gamma function, such as lgamma in C’s math.h, then evaluate that function at n+1. Otherwise, see How to compute log factorial.

Update: See Stand-alone code for numerical computing for an implementation of the pseudocode here.

The algorithm above is “method PA” from “The Computer Generation of Poisson Random Variables” by A. C. Atkinson, Journal of the Royal Statistical Society Series C (Applied Statistics) Vol. 28, No. 1. (1979), pages 29-35. Method PA is on page 32.

Related posts

Random inequalities IX: new tech report

Just posted: Exact calculation of inequality probabilities. This report summarizes previous results for calculating P(X > Y) where X and Y are random variables.

Previous posts on random inequalities:

Random inequalities VIII: folded normals

Someone who ran into my previous posts on random inequalities asked me how to compute random inequalities for folded normals. (A folded normal random variable is the absolute value of a normal random variable.) So the question is how to compute

Pr(|X| > |Y|)

where X and Y are normally distributed. Here’s my reply as a short tech report: Inequality probabilities for folded normal random variables.

Previous posts in this series:

Analytical results
Numerical results
Cauchy distributions
Beta distributions
Gamma distributions
Three or more random variables

Random inequalities VII: three or more variables

The previous posts in this series have looked at P(X > Y), the probability that a sample from a random variable X is greater than a sample from an independent random variable Y. In applications, X and Y have different distributions but come from the same distribution family.

Sometimes applications require computing P(X > max(Y, Z)). For example, an adaptively randomized trial of three treatments may be designed to assign a treatment with probability equal to the probability that that treatment has the best response. In a trial with a binary outcome, the variables X, Y, and Z may be beta random variables representing the probability of response. In a trial with a time-to-event outcome, the variables might be gamma random variables representing survival time.

Sometimes we’re interested in the opposite inequality, P(X < min(Y,Z)). This would be the case if we thought in terms of failures rather than responses, or wanted to minimize the time to a desirable event rather than maximizing the time to an undesirable event.

The maximum and minimum inequalities are related by the following equation:

P(X < min(Y, Z)) = P(X > max(Y, Z)) + 1 − P(X > Y) − P(X > Z).

These inequalities are used for safety monitoring rules as well as to determine randomization probabilities. In a trial seeking to maximize responses, a treatment arm X might be dropped if P(X > max(Y,Z)) becomes too small.

In principle one could design an adaptively randomized trial with n treatment arms for any n ≥ 2 based on

P(X1 > max(X2, …, Xn)).

In practice, the most common value of n by far is 2. Sometimes n is 3. I’m not familiar with an adaptively randomized trial with more than three arms. I’ve heard of an adaptively randomized trial that was designed with five arms, but I don’t believe the trial ran.

Computing P(X1 > max(X2, …, Xn)) by numerical integration becomes more difficult as n increases. For large n, simulation may be more efficient than integration. Computing P(X1 > max(X2, …, Xn)) for gamma random variables with n=3 was unacceptably slow in a previous version of our adaptive randomization software. The search for a faster algorithm lead to this paper: Numerical Evaluation of Gamma Inequalities.

Previous posts on random inequalities:

Random inequalities VI: gamma distributions

This post looks at computing P(X > Y) where X and Y are gamma random variables. These inequalities are central to the Thall-Wooten method of monitoring single-arm clinical trials with time-to-event outcomes. They also are central to adaptively randomized clinical trials with time-to-event outcomes.

When X and Y are gamma random variables P(X > Y) can be computed in terms of the incomplete beta function. Suppose X has shape αX and scale βX and Y has shape αY and scale βY. Then βXY/(βX Y+ βYX) has a beta(αY, αX) distribution. (This result is well-known in the special case of the scale parameters both equal to 1. I wrote up the more general result here but I don’t imagine I was the first to stumble on the generalization.) It follows that

P(X < Y) = P(B < βX/(βX+ βY))

where B is a beta(αY, αX) random variable.

For more details, see Numerical evaluation of gamma inequalities.

Previous posts on random inequalities: