# Lobatto integration

A basic idea in numerical integration is that if a method integrates polynomials exactly, it should do well on polynomial-like functions . The higher the degree of polynomial it integrates exactly, the more accurate we expect it will be on functions that behave like polynomials.

The best known example of this is Gaussian quadrature. However, this post shows why for some applications you might want to use Lobatto quadrature instead. Lobatto quadrature is similar in spirit to Gaussian quadrature, but allocates points to solve a slightly different optimization problem.

If you need to integrate a function over an interval, it matters a great deal where you choose to evaluate the function. The optimal choice, in term of what polynomials you can integrate exactly, is to use Gaussian quadrature. By evaluating the integrand at n optimally chosen points you can integrate polynomials of degree 2n – 1 or less exactly.

So if Gaussian integration is optimal, why would you want to do anything else? The Gaussian integration points are all interior to the interval of integration. In some applications, you have to evaluate your integrand at the end points, and so you want to optimize subject to a constraint: how can you best allocate your integration points subject to the constraint that two of your integration points are the two ends of the interval? The solution is Lobatto quadrature, also called Gauss-Lobatto quadrature.

By evaluating the integrand at n points, two of which are the end points, Lobatto’s method exactly integrates polynomials of degree 2n-3.

Suppose for whatever reason you already know the value of the integrand evaluated at the end points. You’ve got m more function evaluations to spend. If you use those m points for Gaussian quadrature, you can exactly integrate polynomials of degree

2m – 1

or less. But if you use Lobatto quadrature, your m interior evaluations plus your two known values at the end points give you a total of m+2 function evaluations, and so can integrate polynomials of degree

2(m + 2) – 3 = 2m + 1

or less exactly, two degrees higher than if we had used Gaussian quadrature.

Next, suppose you only know the value of the function you’re integrating at one end point. Say you’ve already integrate f(x) over the interval [1, 2] using Lobatto quadrature, and now you want to integrate over [2, 3]. You already know the value f(2) from your previous integration.

Suppose you have m new function evaluations you can afford, and you don’t know f(3). If you use Lobatto quadrature, f(3) has to come out of your function evaluation budget, so you can afford m – 1 interior integration points. You then know the value of f(x) at m+1 points: f(2) came for free, you evaluated f(x) at m – 1 interior points and at x = 3. This lets you exactly integrate polynomials of degree

2(m + 1) – 3 = 2m – 1

or less, the same as Gaussian quadrature. But if you then need to integrate over [3, 4], knowing f(3) gives you a head start on the next integration, and so on.

## Weights and integration points

You can look up the weights and integration points for Gaussian quadrature and Lobatto quadrature in, for example, Abramowitz and Stegun.

There is a nice symmetry between the two integration methods: Gaussian quadrature uses integration points based on the zeros of Legendre polynomials, and weights that depend on the derivatives of these polynomials. Lobatto quadrature is the other way around: integration points are given by the zeros of derivatives of Legendre polynomials, and the weights involve the Legendre polynomials themselves.

## Python example

Here we’ll implement the Gauss and Lobatto rules of order five. Most of the code is data on integration points and weights.

    gauss_points = [
-0.906179845938664,
-0.538469310105683,
0.0,
+0.538469310105683,
+0.906179845938664
]

gauss_weights = [
0.23692688505618908,
0.47862867049936647,
0.5688888888888889,
0.47862867049936647,
0.23692688505618908
]

lobatto_points = [
-1.0,
-0.6546536707079771,
0,
+0.6546536707079771,
+1.0
]

lobatto_weights = [
0.1,
0.5444444444444444,
0.7111111111111111,
0.5444444444444444,
0.1
]


The integration points and weights are symmetrical, so you could make the code more compact at the expense of making it a little more complicated. Putting the + in front of positive integration points is a little unconventional, but it emphasizes the symmetry by making the positive and negative weights align vertically.

Here’s our integration code:

    def integrate(f, xs, ws):
return sum(f(xs[i])*ws[i] for i in range(5))


where we pass it the function to integrate and either Gauss data or Lobatto data.

The following verifies that with 5 integration points, Gauss should be able to exactly integrate a 9th order polynomial, and Lobatto should be able to integrate a 7th order polynomial.

    print( integrate(lambda x: x**9 + 1,
gauss_points, gauss_weights) )

print( integrate(lambda x: x**7 + 1,
lobatto_points, lobatto_weights) )


Both print 2.0 as expected. The integral of an odd function over [-1, 1] is zero, and the integral of 1 over the same interval is 2.

Now let’s use both to integrate cosine over [-1, 1].

    print( integrate(cos, gauss_points, gauss_weights) )
print( integrate(cos, lobatto_points, lobatto_weights) )


The exact integral is 2 sin(1). Here are the results.

    Exact:   1.682941969
Gauss:   1.682941970
Lobatto: 1.682942320


So Gauss is correct to 8, almost 9, decimal places, and Lobatto is correct to 5 decimal places.

Next, let’s hard code a 3rd order Gauss rule for comparison.

    def gauss3(f):
k = 0.6**0.5
s = f(-k)*5 + f(0)*8 + f(k)*5
return s/9


We can verify that it integrates fifth order polynomials exactly:

    print(gauss3(lambda x: x**5 + 1))

and we can use it to integrate cosine:

    print(gauss3(cos))

This returns 1.68300, a little less accurate then the Lobatto rule above, illustrating that typically Lobatto will be more accurate than Gauss with the same number of function evaluations interior to the interval of integration.

## More on numerical integration

 Polynomials can’t have horizontal asymptotes, for example, and so we should not be surprised that a method that integrates high order polynomials exactly could still do poorly on, say, a normal probability density.

# Integral approximation trick

Here’s a simple integration approximation that works remarkably well in some contexts.

Suppose you have an integrand that looks roughly like a normal density, something with a single peak that drops off fairly quickly on either side of the peak. The majority of integrals that arise in basic applications of probability and statistics fit this description.

You can approximate the value of the integral as the area of a box. The height of the box is the value of the integrand at its peak. The width of the box is the FWHM: full width at half maximum, i.e. the distance between two points on where the integrand take on half its maximum value.

This post will include several examples to illustrate how well, or how poorly, the technique works.

## Gaussian example

First let’s look at exp(-x²), the normal density function apart for some scaling factors, integrated over the whole real line. The maximum of the function is 1, occurring at x = 0. So half the maximum is 1/2, and exp(-x²) = 1/2 at x = ± √ log 2. The FWHM is thus 2 √ log 2, and since our maximum is 1, the FWHM is the integral approximation. This gives 1.665, whereas the exact value is √π = 1.772.

## Polynomial example

Next, let’s look at integrating 1/(1 + x2n) for positive integers n. In every case, the maximum is 1. The integrand takes on half its maximum value at x = ± 1, so the FWHM is always 2. So we would approximate the integral as 2 in every case, independent of n.

When n = 1, the correct value of the integral is π and so our approximation of 2 is pretty bad. But that’s not surprising. We said above that our integrand needs to drop off fairly quickly, but that’s not the case here. We have a fat tailed distribution, the Cauchy, and so it’s not surprising that our approximation is poor.

However, as n increases, our integral drops off more quickly, and the approximation improves. When n = 2, the exact value of the integral is 2.221. When n = 3, the integral is 2.094. When n = 10 the integral is 2.008. And in the limit as n goes to infinity, the integral is 2 and our approximation is exact.

## Logistic example

The integrands in the examples above were all symmetric about the origin. That may give the misleading impression that our approximation only applies to symmetric integrals. The examples also all had a maximum of 1, which could give the impression that the integral approximation is just the FWHM, not the FWHM times the maximum. We’ll correct both of these impressions in the next example.

Our task is to approximate the integral which comes up in Bayesian logistic regression with an improper prior.

The integrand is symmetric about 0 when a = b. But otherwise the integrand is not symmetric, and the peak occurs at log(a/b) rather than at 0 and the value at that peak is We’ll use the following Python code to help with our calculations.

    from scipy import log, exp
from scipy.optimize import brentq

def logistic_max(a, b):
return a**a * b** b /(a+b)**(a+b)

def fwhm(a, b):
hm = 0.5*logistic_max(a, b)
f = lambda x: a*x - (a+b)*log(1 + exp(x)) - log(hm)
left = brentq(f, -2, 0)
right = brentq(f, 0, 2)
return right - left

def approx(a, b):
return logistic_max(a, b)*fwhm(a, b)


And now for a couple examples. When a = 3 and b = 2, we get an approximate value of 0.0761, compared to an exact value of 0.0833, a relative error of 8.6%.

When a  = 10 and b = 12, we get an approximate value of 2.647 × 10-7, compared to an exact value of 2.835 × 10-7, a relative error of 6.6%.

# Integration by long division

Since integration is the inverse of differentiation, you can think of integration as “dividing” by d.

J. P. Ballantine  shows that you can formally divide by d and get the correct integral. For example, he arrives at using long division! J. P. Ballantine. Integration by Long Division. The American Mathematical Monthly, Vol. 58, No. 2 (Feb., 1951), pp. 104-105

# Currying in calculus, PDEs, programming, and categories Currying is a simple but useful idea. (It’s named after logician Haskell Curry  and has nothing to do with spicy cuisine.) If you have a function of two variables, you can think of it as a function of one variable that returns a function of one variable. So starting with a function f(x, y), we can think of this as a function that takes a number x and returns a function f(x, -) of y. The dash is a placeholder as in this recent post.

## Calculus: Fubini’s theorem

If you’ve taken calculus then you saw this in the context of Fubini’s theorem: To integrate the function of two variables f(x, y), you can temporarily fix y and integrate the remaining function of x. This gives you a number, the value of an integral, for each y, so it’s a function of y. Integrate that function, and you have the value of the original function of two variables.

The first time you see this you may think it’s a definition, but it’s not. You can define the integral on the left directly, and it will equal the result of the two nested integrations on the right. Or at least the two sides will often be equal. The conditions on Fubini’s theorem tell you exactly when the two sides are equal.

## PDEs: Evolution equations

A more sophisticated version of the same trick occurs in partial differential equations. If you have an evolution equation, a PDE for a function on one time variable and several space variables, you can think of it as an ODE via currying. For each time value t, you get a function of the spatial variables. So you can think of your solution as a path in a space of functions. The spatial derivatives specify an operator on that space of functions.

(I’m glossing over details here because spelling everything out would take a lot of writing, and might obscure the big idea, which relevant for this post. If you’d like the full story, you can see, for example, my graduate advisor’s book. It was out of print when I studied it, but now it’s a cheap Dover paperback.)

In the Haskell programming language (also named after Haskell Curry) you get currying for free. In fact, there’s no other way to express a function of two variables. For example, suppose you want to implement the function f(xy) = x² + y.

    Prelude> f x y = x**2 + y

Then Haskell thinks of this as a function of one variable (i.e. x), that returns a function of one variable (i.e. f(x, -)) which itself returns a number (i.e. f(x, y)). You can see this by asking the REPL for the type of f:

    Prelude> :info f
f :: Floating a => a -> a -> a


Technically, Haskell, just like lambda calculus, only has functions of one variable. You could create a product datatype consisting of a pair of variables and have your function take that as an argument, but it’s still a function on one variable, though that variable is not atomic.

## Category theory

The way you’d formalize currying in category theory is to say that the following is a natural isomorphism: For more on what Hom means, see this post.

## Related posts

 In concordance with Stigler’s law of eponymy, currying was not introduced by Curry but Gottlob Frege. It was then developed by Moses Schönfinkel and developed further by Haskell Curry.

# Quantifying normal approximation accuracy

Probability is full of theorems that say that probability density approximates another as some parameter becomes large. All the dashed lines in the diagram below indicate a relationship like this. You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence.  In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables X and Y is defined as We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

    from scipy.integrate import quad
from scipy.stats import beta, gamma, norm
from scipy import inf
import matplotlib.pyplot as plt


## Beta distribution

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

    def KL_beta_norm(a, b):
b = beta(a, b)
n = norm(b.mean(), b.std())
f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x))
integral, error = quad(f, 0, 1)
return integral


And here we make our plot.

    x = [2**i for i in range(11)]
y = [KL_beta_norm(t, 2*t) for t in x]
plt.plot(x, y)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("a")
plt.ylabel("KL divergence")
plt.title("Comparing beta(a, 2a) and its normal approximation")
plt.savefig("beta_KL.svg")
plt.close()


The result is nearly linear on a log-log scale. I made the b parameter twice the a parameter to show that you don’t need symmetry. When you do have symmetry, i.e a = b, the approximation quality is better and the graph is even straighter.

## Gamma distribution

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

    def KL_gamma_norm(shape):
g = gamma(shape)
n = norm(g.mean(), g.std())
f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x))
mode = max(0, shape-1)
integral1, error1 = quad(f, 0, mode)
integral2, error2 = quad(f, mode, inf)
return integral1 + integral2


The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The quad() function has a parameter points to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale. You could do a similar analysis on the other approximations in the distribution relationship diagram above.

# Scaling and Monte Carlo integration methods

Here’s an apparent paradox. You’ll hear that Monte Carlo methods are independent of dimension, and that they scale poorly with dimension. How can both statements be true?

The most obvious way to compute multiple integrals is to use product methods, analogous to the way you learn to compute multiple integrals by hand. Unfortunately the amount of work necessary with this approach grows exponentially with the number of variables, and so the approach is impractical beyond modest dimensions.

The idea behind Monte Carlo integration, or integration by darts, is to estimate the area under a (high dimensional) surface by taking random points and counting how many fall below the surface. More details here. The error goes down like the reciprocal of the square root of the number of points. The error estimate has nothing to do with the dimension per se. In that sense Monte Carlo integration is indeed independent of dimension, and for that reason is often used for numerical integration in dimensions too high to use product methods.

Suppose you’re trying to estimate what portion of a (high dimensional) box some region takes up, and you know a priori that the proportion is in the ballpark of 1/2. You could refine this to a more accurate estimate with Monte Carlo, throwing darts at the box and seeing how many land in the region of interest.

But in practice, the regions we want to estimate are small relative to any box we’d put around them. For example, suppose you want to estimate the volume of a unit ball in n dimensions by throwing darts at the unit cube in n dimensions. When n = 10, only about 1 dart in 400 will land in the ball. When n = 100, only one dart out of 1070 will land inside the ball. (See more here.) It’s very likely you’d never have a dart land inside the ball, no matter how much computing power you used.

If no darts land inside the region of interest, then you would estimate the volume of the region of interest to be zero. Is this a problem? Yes and no. The volume is very small, and the absolute error in estimating a small number to be zero is small. But the relative is 100%.

(For an analogous problem, see this discussion about the approximation sin(θ) = θ for small angles. It’s not just saying that one small number is approximately equal to another small number: all small numbers are approximately equal in absolute error! The small angle approximation has small relative error.)

If you want a small relative error in Monte Carlo integration, and you usually do, then you need to come up with a procedure that puts a larger proportion of integration points in the region of interest. One such technique is importance sampling, so called because it puts more samples in the important region. The closer the importance sampler fits the region of interest, the better the results will be. But you may not know enough a priori to create an efficient importance sampler.

So while the absolute accuracy of Monte Carlo integration does not depend on dimension, the problems you want to solve with Monte Carlo methods typically get harder with dimension.

# Integrating polynomials over a sphere or ball

Spheres and balls are examples of common words that take on a technical meaning in math, as I wrote about here. Recall the unit sphere in n dimensions is the set of points with distance 1 from the origin. The unit ball is the set of points of distance less than or equal to 1 from the origin. The sphere is the surface of the ball.

Integrating a polynomial in several variables over a ball or sphere is easy. For example, take the polynomial xy² + 5x²z² in three variables. The integral of the first term, xy², is zero. If any variable in a term has an odd exponent, then the integral of that term is zero by symmetry. The integral over half of the sphere (ball) will cancel out the integral over the opposite half of the sphere (ball). So we only need to be concerned with terms like 5x²z².

Now in n dimensions, suppose the exponents of x1, x2, …, xn are a1, a2, …, an respectively. If any of the a‘s are odd, the integral over the sphere or ball will be zero, so we assume all the a‘s are even. In that case the integral over the unit sphere is simply where is the multivariate beta function and for each i we define bi = (ai + 1)/2. When n = 2 then B is the (ordinary) beta function.

Note that the integral over the unit sphere doesn’t depend on the dimension of the sphere.

The integral over the unit ball is which is proportional to the integral over the sphere, where the proportionality constant depends on the sum of the exponents (the original exponents, the a‘s, not the b‘s) and the dimension n.

Note that if we integrate the constant polynomial 1 over the unit sphere, we get the surface area of the unit sphere, and if we integrate it over the unit ball, we get the volume of the unit ball.

You can find a derivation for the integral results above in . The proof is basically Liouville’s trick for integrating the normal distribution density, but backward. Instead of going from rectangular to polar coordinates, you introduce a normal density and go from polar to rectangular coordinates.

 Gerald B. Folland, How to Integrate a Polynomial over a Sphere. The American Mathematical Monthly, Vol. 108, No. 5 (May, 2001), pp. 446-448.

# Hermite polynomials, expected values, and integration

In a previous post, I alluded to using Hermite polynomials in conjunction with higher-order Laplace approximation. In this post I’ll expand on what that means.

Hermite polynomials are orthogonal polynomials over the real line with respect to the weight given by the standard normal distribution. (There are two conventions for defining Hermite polynomials, what Wikipedia calls the physicist convention and the probabilist convention. We’re using the latter here.)

The first few Hermite polynomials are 1, x, x2 – 1, x3 – 3x, … . You can find the rest of the Hermite polynomials via the recurrence relation

Hn+1(x) = x Hnn Hn-1(x).

You could think of the Hermite polynomials as the right basis to use when working with normal probability distributions. Writing a polynomial as a linear combination of Hermite polyn0mials is a change of basis that makes integration easy: Here [k even] is the function that returns 1 if k is even and 0 otherwise. This notation was introduced by Kenneth Iverson in the APL programming language and has become moderately common in mathematics.

# Laplace approximation of an integral from Bayesian logistic regression

Define and This integral comes up in Bayesian logistic regression with a uniform (improper) prior. We will use this integral to illustrate a simple case of Laplace approximation.

The idea of Laplace approximation is to approximate the integral of a Gaussian-like function by the integral of a (scaled) Gaussian with the same mode and same curvature at the mode. That’s the most common, second order Laplace approximation. More generally, Laplace’s approximation depends on truncating a Taylor series, and truncating after the 2nd order term gives the approximation described above.

Let x0 be the place where f takes on its maximum and let g = log f. Note that g also takes on its maximum at x0 and so its first derivative is zero there. Then the (second order) Laplace approximation has the following derivation. We can show that and It follows that x0 = log (m/n) and the Laplace approximation for I(m, n) reduces to .

Now let’s compare the Laplace approximation to the exact value of the integral for a few values of m and n. We would expect the Laplace approximation to be more accurate when the integrand is shaped more like the Gaussian density. This happens when m and n are large and when they are more nearly equal. (The function f is a likelihood, and m+n is the number of data points in the likelihood. More data means the likelihood is more nearly normal. Also, m = n means f is symmetric, which improves the normal approximation.)

We’ll look at (m, n) = (2, 1), (20, 10), (200, 100), and (15, 15). Here’s the plot of f(x, 2, 1) with its normal approximation, scaled vertically so that the heights match. Even for small arguments, the Gaussian approximation fits well on the left side, but not so much on the right. For large arguments it would be hard to see the difference between the two curves.

Here are the results of the Laplace approximation and exact integration.

|-----+-----+---------------+---------------+--------------|
|   m |   n |        approx |         exact |    rel error |
|-----+-----+---------------+---------------+--------------|
|   2 |   1 |    0.45481187 |           0.5 | -0.090376260 |
|  20 |  10 |  4.9442206e-9 | 4.99250870e-9 | -0.009672111 |
|  15 |  15 | 8.5243139e-10 | 8.5956334e-10 | -0.008297178 |
| 200 | 100 | 3.6037801e-84 | 3.6072854e-84 | -0.000971728 |
|-----+-----+---------------+---------------+--------------|


The integrals were computed exactly using Mathematica; the results are rational numbers. 

Note that when m and n increase by a factor of 10, the relative error goes down by about a factor of 10. This is in keeping with theory that say the error should be O(1/(m+n)). Also, note that the error for (15, 15) is a little lower than that for (20, 10). The errors are qualitatively just what we expected.

***

 I initially used Mathematica to compute the integrals, but then realized that’s not necessary. The change of variables u = exp(x) / (1 + exp(x)) shows that

I(ab) = B(ab+1) + B(a+1, b)

where B is the beta function. Incidentally, this shows that if a and b are integers then  I(ab) is rational.

# Making a problem easier by making it harder

In the oral exam for my PhD, my advisor asked me a question about a differential equation. I don’t recall the question, but I remember the interaction that followed.

I was stuck, and my advisor countered by saying “Let me ask you a harder question.” I was still stuck, and so he said “Let me ask you an even harder question.” Then I got it.

By “harder” he meant “more general.” He started with a concrete problem, then made it progressively more abstract until I recognized it. His follow-up questions were logically harder but psychologically easier.

This incident came to mind when I ran across an example in Lawrence Evans’ control theory course notes. He uses the example to illustrate what he calls an example of mathematical wisdom:

It is sometimes easier to solve a problem by embedding it within a larger class of problems and then solving the larger class all at once.

The problem is to evaluate the integral of the sinc function: He does so by introducing the more general problem of evaluating the function which reduces to the sinc integral when α = 0.

We can find the derivative of I(α) by differentiating under the integral sign and integrating by parts twice. Therefore As α goes to infinity, I(α) goes to zero, and so C = π/2 and I(0) = π/2.

Incidentally, note that instead of computing an integral in order to solve a differential equation as one often does, we introduced a differential equation in order to compute an integral.