Painless project management

This evening, after I got off a phone call discussing a project with a colleague, I thought “Huh, I guess you could call that project management.”

I worked as a project manager earlier in my career, but what I’m doing now feels completely different and much more pleasant. Strip away the bureaucracy and politics, and project management just feels like getting work done.

I have several people working for me now, but project management doesn’t take much time. I still spend most of my time working on my own projects.

Quantifying normal approximation accuracy

Probability is full of theorems that say that probability density approximates another as some parameter becomes large. All the dashed lines in the diagram below indicate a relationship like this.


You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence.  In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

    from scipy.integrate import quad
    from scipy.stats import beta, gamma, norm
    from scipy import inf
    import matplotlib.pyplot as plt

Beta distribution

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

    def KL_beta_norm(a, b):
        b = beta(a, b)
        n = norm(b.mean(), b.std())
        f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x))    
        integral, error = quad(f, 0, 1)
        return integral

And here we make our plot.

    x = [2**i for i in range(11)]
    y = [KL_beta_norm(t, 2*t) for t in x]
    plt.plot(x, y)
    plt.ylabel("KL divergence")
    plt.title("Comparing beta(a, 2a) and its normal approximation")

The result is nearly linear on a log-log scale.

Kullback Leibler divergence from normal approximation to beta

I made the b parameter twice the a parameter to show that you don’t need symmetry. When you do have symmetry, i.e a = b, the approximation quality is better and the graph is even straighter.

Gamma distribution

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

    def KL_gamma_norm(shape):
        g = gamma(shape)
        n = norm(g.mean(), g.std())
        f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x))
        mode = max(0, shape-1)
        integral1, error1 = quad(f, 0, mode)
        integral2, error2 = quad(f, mode, inf)    
        return integral1 + integral2

The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The quad() function has a parameter points to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale.

Kullback Leibler divergence for normal approximation to gamma

You could do a similar analysis on the other approximations in the distribution relationship diagram above.

Related posts

Ordinary Potential Polynomials

I’ve run into potential polynomials a lot lately. Most sources I’ve seen are either unclear on how they are defined or use a notation that doesn’t sit well in my brain. Also, they’ll either give an implicit or explicit definition but not both. Both formulations are important. The implicit formulation suggests how potential polynomials arise in practice. The explicit formulation shows how you’d actually compute them when you need to.

The potential polynomials Ar,n are homogeneous polynomials of degree r in n variables. They are usually defined implicitly by

\left( 1 + \sum_{n=1}^\infty x_n z^n\right )^r = \sum_{n=0}^\infty A_{r, n}(x_1, x_2, \ldots, x_n)z^n

The can also be defined explicitly in terms of ordinary Bell polynomials Bn,k by

A_{r, n}(x_1, x_2, \ldots, x_n) = \sum_{k=1}^n {r \choose k} B_{n, k}(x_1, x_2, \ldots, x_{n-k+1})

Does anyone know why they’re called “potential” polynomials? Is there some analogy with a physical potential? [1]

By the way, potential polynomials are called “ordinary” in the same sense that Bell polynomials and generating functions are called ordinary: to contrast with exponential forms that insert factorial scaling factors. See the previous post for details.

* * *

[1] When I was learning electricity and magnetism, I was confused by the phrase “potential difference.” Did that mean that there isn’t a difference yet, but there was the potential for a difference? No, it means a difference in electrical potential. There’s still a sense in which “potential” carries with it the idea of possibility. A rock sitting on top of a mountain has a lot of gravitational potential, energy that could be released if the rock started falling.

Potential polynomials could sound like functions that aren’t polynomials yet but have the potential to be, maybe if they just tried harder.

Bell polynomials: partial, ordinary, and exponential

Bell polynomials come up naturally in a variety of contexts: combinatorics, analysis, statistics, etc. Unfortunately, the variations on Bell polynomials are confusingly named.

Analogy with differential equations

There are Bell polynomials of one variable and Bell polynomials of several variables. The latter are sometimes called “partial” Bell polynomials. In differential equations, “ordinary” means univariate (ODEs) and “partial” means multivariate (PDEs). So if “partial” Bell polynomials are the multivariate form, you might assume that “ordinary” Bell polynomials are the univariate form. Unfortunately that’s not the case.

It seems that the “partial” in the context of Bell polynomials was taken from differential equations, but the term “ordinary” was not. Ordinary and partial are not opposites in the context of Bell polynomials. A Bell polynomial can be ordinary and partial!

Analogy with generating functions

“Ordinary” in this context is the opposite of “exponential,” not the opposite of “partial.” The analogy is with generating functions, not differential equations. The ordinary generating function of a sequence multiplies the nth term by xn and sums. The exponential generating function of a sequence multiplies the nth term by xn/n! and sums. For example, the ordinary generating function for the Fibonacci numbers is

F(x) = \sum_{n=0}^\infty F_n x^n = \frac{x}{1 - x - x^2}

while the exponential generating function is

f(x) = \sum_{n=0}^\infty F_n \frac{x^n}{n!} = \frac{\exp(r_+x) - \exp(r_-x)}{\sqrt{5}}


 r_{\pm} = \frac{1 \pm \sqrt{5}}{2}

The definitions of exponential and ordinary Bell polynomials are given below, but the difference between the two that I wish to point out is that the former divides the kth polynomial argument by k! while the latter does not. They also differ by a scaling factor. The exponential form of Bn,k has a factor of n! where the ordinary form has a factor of k!.

“Ordinary” as a technical term

Based on the colloquial meaning of “ordinary” you might assume that it is the default. And indeed that is the case with generating functions. Without further qualification, generating function means ordinary generating function. You’ll primarily see explicit references to ordinary generating functions in a context where it’s necessary to distinguish them from some other kind of generating function. Usually the word “ordinary” will be written in parentheses to emphasize that an otherwise implicit assumption is being made explicit. In short, “ordinary” and “customary” correspond in the context of generating functions.

But in the context of Bell polynomials, it seems that the exponential form is more customary. At least in some sources, an unqualified reference to Bell polynomials refers to the exponential variety. That’s the case in SymPy where the function bell() implements exponential Bell polynomials and there is no function to compute ordinary Bell polynomials.


Now for the actual definitions. We can make our definitions more concise if we first define multi-index notation. A multi-index

\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_n)

is a set of n non-negative integers. The norm of a multi-index is defined by

|\alpha| = \alpha_1 + \alpha_2 + \cdots + \alpha_n

and the factorial of a multi-index is defined by

\alpha! = \prod_{k=1}^n \alpha_k!

If x = (x1, x2, …, xn) is an n-tuple of real numbers then x raised to the power of a multi-index α is defined by

x^\alpha = \prod_{k=1}^n x_k^{\alpha_k}

The ordinary Bell polynomial Bn,k is

B_{n,k}(x) = \sum \frac{k!}{\alpha!} x^\alpha

where x is a (nk +1)-tuple and α ranges over all multi-indices subject to two constraints: the norm of α is k and

\sum_{i=1}^{n-k+1} i \alpha_i = n

The exponential Bell polynomials can then be defined in terms of the ordinary bell polynomials by

B^\mathrm{exp}_{n,k}(x_1, x_2, \ldots, x_{n-k+1}) = \frac{n!}{k!} B^\mathrm{ord}_{n,k}\left( \frac{x_1}{1!}, \frac{x_2}{2!}, \cdots, \frac{x_{n-k+1}}{(n-k+1)!} \right )

Related posts

Ways to connect

If you visit this blog once in a while, here are a few ways to hear from me more regularly.


You can subscribe to the blog via RSS or email.

I often use SVG images because they look great on a variety of devices, but most email clients won’t display that format. If you subscribe by email, there’s always a link to open the article in a browser and see all the images.

I also have a monthly newsletter. It highlights the most popular posts of the month and usually includes a few words about what I’ve been up to.


I have 17 Twitter accounts that post daily on some topic and one personal account.

Twitter icons

The three most popular are CompSciFact, AlgebraFact, and ProbFact. These accounts are a little broader than the names imply. For example, if I run across something that doesn’t fall neatly into another mathematical category, I’ll post it to AlgebraFact. Miscellaneous applied math tends to end up on AnalysisFact.

You can find a list of all accounts and their descriptions here.

I don’t keep up with replies to my topical accounts, but I usually look at replies to my personal account. If you want to be sure I get your message, please call or send me email.

Occasionally people ask whether there’s a team of people behind my Twitter accounts. Nope. Just me. I delegate other parts of my business, but not Twitter. I schedule most of the technical account tweets in advance, but I write them. My personal account is mostly spontaneous.

Contact info

Here’s my contact info.

contact info

My phone number isn’t on there. It’s 832.422.8646. If you’d like, you can import my contact info as a vCard or use the QR code below.

Apple design, squircles, and curvature

A “squircle” is a sort of compromise between a square and circle, but one that differs from a square with rounded corners. It’s a shape you’ll see, for example, in some of Apple’s designs.


A straight line has zero curvature, and a circle with radius r has curvature 1/r. So in a rounded square the curvature jumps from 0 to 1/r where the flat side meets the circular corner. But in the figure above there’s no abrupt change in curvature but instead a smooth transition. More on that in just a bit.

To get a smoother transition from the corners to the flat sides, designers use what mathematicians would call Lp balls, curves satisfying

|x|^p + |y|^p = 1

in rectangular coordinates or

r = \frac{1}{\left(|\cos \theta|^p + |\sin \theta|^p \right)^{1/p}}in polar coordinates.

When p = 2 we get a circle. When p = 2.5, we get square version of the superellipse. As p increases the corners get sharper, pushing out toward the corner of a square. Some sources define the squircle to be the case p = 4 but others say p = 5. The image at the top of the post uses p = 4. [*]

To show how the curvature changes, let’s plot the curvature on top of the squircle. The inner curve is the squircle itself, and radial distance to the outer curve is proportional to the curvature.

Here’s the plot for p = 4.

curvature p = 4

And here’s the plot for p = 5.

curvature p = 5

If we were to make the same plot for a rounded square, the curvature would be zero over the flat sides and jump to some constant value over the circular caps. We can see above that the curvature is largest over the corners but continuously approaches zero toward the middle of the sides.

Related: Swedish Superellipse

[*] Use whatever value of p you like. The larger p is, the closer the figure becomes to a square. The closer p is to 2, the closer the figure is to a circle.

You can even use smaller values of p. The case p = 1 gives you a diamond. If p is less than 1, the sides bend inward. The plot below shows the case p = 0.5.

Lp ball p = 0.5

As p gets smaller, the sides pull in more. As p goes to zero, the figure looks more and more like a plus sign.

Liouville-Green approximation

Suppose you have a differential equation of the form

\frac{d^2y}{dx^2} = f(x)y

If the function f(x) is constant, the differential equation is easy to solve in closed form. But when it is not constant, it is very unlikely that closed form solutions exist. But there may be useful closed form approximate solutions.

There is no linear term in the equation above, but there is a standard change of variables to eliminate a linear term and put any second order linear equation with variable coefficients in the form above.

The Liouville-Green approximate solutions are linear combinations of the functions

f^{-1/4}(x)\exp\left(\pm\int f^{1/2}(x)\, dx \right )

The constant of integration doesn’t matter. It would only change the solution by a constant multiple, so any indefinite integral will do.

To try out the Liouville-Green approximation, let’s look at the equation

\frac{d^2y}{dx^2} = x^2y

Here f(x) = x² and so the LG approximation solutions are

A\, \frac{\exp\left(x^2/2 \right )}{ \sqrt{x} } + B\, \frac{\exp\left(-x^2/2 \right )}{ \sqrt{x} }

Let’s see how these solutions behave compare to a numerical solution of the equation. To make things more definite, we need initial conditions. Let’s say AB = 1, which corresponds to initial conditions y(1) = 2.25525 and y‘(1) = -0.0854354.

Here’s a plot of the L-G approximation and a numerical solution.

Plot of Liouville-Green approximation and numerical solution

Here’s the Mathematica code that was used to create the plot above.

s = NDSolve[
       y''[x] == x^2 y[x], 
       y[1] == 2.25525, 
       y'[1] == -0.0854354
    {x, 1, 3}
g[x_] = (Exp[-x^2/2] + Exp[x^2/2])/Sqrt[x]
    {Evaluate[y[x] /. s], g[x]}, 
    {x, 1, 3}, 
    PlotLabels -> {"Numerical", "LG"}

For more on Liouville-Green approximations, see Olver’s classic book Asymptotics and Special Functions.

Related postMechanical vibrations

Iterating between theory and code

Yesterday I said on Twitter “Time to see whether practice agrees with theory, moving from LaTeX to Python. Wish me luck.” I got a lot of responses to that because it describes the experience of a lot of people. Someone asked if I’d blog about this. The content is confidential, but I’ll talk about the process. It’s a common pattern.

I’m writing code based on a theoretical result in a journal article.

First the program gets absurd results. Theory pinpoints a bug in the code.

Then the code confirms my suspicions of an error in the paper.

The code also uncovers, not an error per se, but an important missing detail in the paper.

Then code and theory differ by about 1%. Uncertain whether this is theoretical approximation error or a subtle bug.

Then try the code for different arguments, ones which theory predicts will have less approximation error, and now code and theory differ by 0.1%.

Then I have more confidence in (my understanding of) the theory and in my code.

Eloquent mathematical writing

Sir Michael Atiyah recommends Hermann Weyl’s book The Classical Groups for its clarity and beautiful prose. From my interview with Atiyah:

Hermann Weyl is my great model. He used to write beautiful literature. Reading it was a joy because he put a lot of thought into it. Hermann Weyl wrote a book called The Classical Groups, very famous book, and at the same time a book appeared by Chevalley called Lie Groups. Chevalley’s book was compact, all the theorems are there, very precise, Bourbaki-style definitions. If you want to learn about it, it’s all there. Hermann Weyl’s book was the other way around. It is discursive, expansive, it’s a bigger book, it tells you much more than need to know. It’s the kind of book you read ten times. Chevalley’s book you read once. …

In the introduction he explains that he’s German, he’s writing in a foreign language, he apologizes saying he is writing in a language that was “not sung by the gods at my cradle.” You can’t get any more poetic than that. I’m a great admirer of his style, of his exposition and his mathematics. They go together.

Here’s the portion of the preface that Atiyah is quoting, where Weyl apologies eloquently for his lack of eloquence in writing English.

The gods have imposed upon my writing the yoke of a foreign tongue that was not sung at my cradle.

“Was dies heissen will, weiss jeder,
Der im Traum pferdlos geritten,”

I am tempted to say with Gottfried Keller. Nobody is more aware than myself of the attendant loss of vigor, ease and lucidity of expression.

John Cook interviewing Michael Atiyah at HLF 2013

Photographer Bernhard Kreutzer came by while I was interviewing Atiyah at the first Heidelberg Laureate Forum in 2013.

Focus on the most important terms

Consider the following Taylor series for sin(θ/7)

\sin\left(\frac{\theta}{7} \right ) = \frac{\theta}{7} - \frac{\theta^3}{2058} + \frac{\theta^5}{2016840} + {\cal O}(\theta^7)

and the following two functions based on the series, one takes only the first non-zero term

    def short_series(x):
        return 0.14285714*x

and a second that three non-zero terms.

    def long_series(x):
        return 0.1425714*x - 4.85908649e-04*x**3 + 4.9582515e-07*x**5

Which is more accurate? Let’s make a couple plots plot to see.

First here are the results on the linear scale.

Absolute error, linear scale

Note that the short series is far more accurate than the long series! The differences are more dramatic on the log scale. There you can see that you get more correct significant figures from the short series as the angle approaches zero.

Approximation error, log scale

What’s going on? Shouldn’t you get more accuracy from a longer Taylor series approximation?

Yes, but there’s an error in our code. The leading coefficient in long_series is wrong in the 4th decimal place. That small error in the most important term outweighs the benefit of adding more terms to the series. The simpler code, implemented correctly, is better than the more complicated code with a small error.

The moral of the story is to focus on the leading term. Until it’s right, the rest of the terms don’t matter.

Update: Based on some discussion after publishing this post, I think my comment about short_series being simpler misdirected attention from my main point. Here’s a variation on the same argument based on two equally complex functions.

    def typo1(x):
        return 0.1425714*x - 4.85908649e-04*x**3 + 4.9582515e-07*x**5

    def typo2(x):
        return 0.14285714*x - 5.85908649e-04*x**3 + 4.9582515e-07*x**5

Both functions have one typo, typo1 in the first non-zero coefficient and typo2 in the second non-zero coefficient. These two functions behave almost exactly like long_series and short_series respectively. A typo in the leading coefficient has a much bigger impact on accuracy than a typo in one of the other coefficients.

Related post: Life lessons from differential equations

Scaling and Monte Carlo integration methods

Here’s an apparent paradox. You’ll hear that Monte Carlo methods are independent of dimension, and that they scale poorly with dimension. How can both statements be true?

The most obvious way to compute multiple integrals is to use product methods, analogous to the way you learn to compute multiple integrals by hand. Unfortunately the amount of work necessary with this approach grows exponentially with the number of variables, and so the approach is impractical beyond modest dimensions.

The idea behind Monte Carlo integration, or integration by darts, is to estimate the area under a (high dimensional) surface by taking random points and counting how many fall below the surface. More details here. The error goes down like the reciprocal of the square root of the number of points. The error estimate has nothing to do with the dimension per se. In that sense Monte Carlo integration is indeed independent of dimension, and for that reason is often used for numerical integration in dimensions too high to use product methods.

Suppose you’re trying to estimate what portion of a (high dimensional) box some region takes up, and you know a priori that the proportion is in the ballpark of 1/2. You could refine this to a more accurate estimate with Monte Carlo, throwing darts at the box and seeing how many land in the region of interest.

But in practice, the regions we want to estimate are small relative to any box we’d put around them. For example, suppose you want to estimate the volume of a unit ball in n dimensions by throwing darts at the unit cube in n dimensions. When n = 10, only about 1 dart in 400 will land in the ball. When n = 100, only one dart out of 1070 will land inside the ball. (See more here.) It’s very likely you’d never have a dart land inside the ball, no matter how much computing power you used.

If no darts land inside the region of interest, then you would estimate the volume of the region of interest to be zero. Is this a problem? Yes and no. The volume is very small, and the absolute error in estimating a small number to be zero is small. But the relative is 100%.

(For an analogous problem, see this discussion about the approximation sin(θ) = θ for small angles. It’s not just saying that one small number is approximately equal to another small number: all small numbers are approximately equal in absolute error! The small angle approximation has small relative error.)

If you want a small relative error in Monte Carlo integration, and you usually do, then you need to come up with a procedure that puts a larger proportion of integration points in the region of interest. One such technique is importance sampling, so called because it puts more samples in the important region. The closer the importance sampler fits the region of interest, the better the results will be. But you may not know enough a priori to create an efficient importance sampler.

So while the absolute accuracy of Monte Carlo integration does not depend on dimension, the problems you want to solve with Monte Carlo methods typically get harder with dimension.

Integrating polynomials over a sphere or ball

Spheres and balls are examples of common words that take on a technical meaning in math, as I wrote about here. Recall the the unit sphere in n dimensions is the set of points with distance 1 from the origin. The unit ball is the set of points of distance less than or equal to 1 from the origin. The sphere is the surface of the ball.

Integrating a polynomial in several variables over a ball or sphere is easy. For example, take the polynomial xy² + 5x²z² in three variables. The integral of the first term, xy², is zero. If any variable in a term has an odd exponent, then the integral of that term is zero by symmetry. The integral over half of the sphere (ball) will cancel out the integral over the opposite half of the sphere (ball). So we only need to be concerned with terms like 5x²z².

Now in n dimensions, suppose the exponents of x1, x2, …, xn are a1, a2, …, an respectively. If any of the a‘s are odd, the integral over the sphere or ball will be zero, so we assume all the a‘s are even. In that case the integral over the unit sphere is simply

2 B(b_1, b_2, \ldots, b_n)


B(b_1, b_2, \ldots, b_n) = \frac{\Gamma(b_1) \Gamma(b_2) \cdots \Gamma(b_n)}{ \Gamma(b_1 + b_2 + \cdots + b_n)}

is the multivariate beta function and for each i we define bi = (ai + 1)/2. When n = 2 then B is the (ordinary) beta function.

Note that the integral over the unit sphere doesn’t depend on the dimension of the sphere.

The integral over the unit ball is

\frac{2 B(b_1, b_2, \ldots, b_n)}{ a_1 + a_2 + \cdots + a_n + n}

which is proportional to the integral over the sphere, where the proportionality constant depends on the sum of the exponents (the original exponents, the a‘s, not the b‘s) and the dimension n.

Note that if we integrate the constant polynomial 1 over the unit sphere, we get the surface area of the unit sphere, and if we integrate it over the unit ball, we get the volume of the unit ball.

You can find a derivation for the integral results above in [1]. The proof is basically Liouville’s trick for integrating the normal distribution density, but backward. Instead of going from rectangular to polar coordinates, you introduce a normal density and go from polar to rectangular coordinates.

[1] Gerald B. Folland, How to Integrate a Polynomial over a Sphere. The American Mathematical Monthly, Vol. 108, No. 5 (May, 2001), pp. 446-448.

Bounding the 3rd moment by the 4th moment

For a random variable X, the kth moment of X is the expected value of Xk.

For any random variable X with 0 mean, or negative mean, there’s an inequality that bounds the 3rd moment, m3 in terms of the 4th moment, m4:

m_3 \leq \left(\frac{4}{27} \right )^{1/4} m_4^{3/4}

The following example shows that this bound is the best possible.


u = (1 – √ 3)/√ 2
v = (1 + √ 3)/√ 2
p = (3 + √ 3) / 6

and suppose Xu with probability p and Xv with probability q = 1 – p. Then X has mean 0, and you can show that exact equality holds in the inequality above.

Here’s some Mathematica code to verify this claim.

    u = (1 - Sqrt[3])/Sqrt[2]
    v = (1 + Sqrt[3])/Sqrt[2]
    p = (Sqrt[3] + 1)/(2 Sqrt[3])
    q = 1 - p
    m3 = p u^3 + q v^3
    m4 = p u^4 + q v^4
    Simplify[ (4/27)^(1/4) m4^(3/4) / m3 ]

As hoped, the code returns 1.

Reference: Iosif Pinelis, Relations Between the First Four Moments, American Mathematical Monthly, Vol 122, No 5, pp 479-481.

Putting a brace under something in LaTeX

Here’s a useful LaTeX command that I learned about recently: \underbrace.

It does what it sounds like it does. It puts a brace under its argument.

I used this a few days ago in the post on the new prime record when I wanted to show that the record prime is written in hexadecimal as a 1 followed by a long string of Fs.

1\underbrace{\mbox{FFF \ldots FFF}}_\mbox{{\normalsize 9,308.229 F's}}

The code that produced is is

1\underbrace{\mbox{FFF \ldots FFF}}_\mbox{{\normalsize 9,308.229 F's}}

The sizing is a little confusing. Without \normalsize the text under the brace would be as large as the text above.