Here’s a close-up of the graph from a photo of my copy of A&S.
This was my reply.
It looks like a complex function of a complex variable. I assume the height is the magnitude and the markings on the graph are the phase. That would make it an elliptic function because it’s periodic in two directions.
It has one pole and one zero in each period. I think elliptic functions are determined, up to a constant, by their periods, zeros, and poles, so it should be possible to identify the function.
In fact, I expect it’s the Weierstrass p function. More properly, the Weierstrass ℘ function, sometimes called Weierstass’ elliptic function. (Some readers will have a font installed that will properly render ℘ and some not. More on the symbol ℘ here.)
The other day I ran across a blog post by Brian Hayes that linked to an article by David Cantrell on how to compute the inverse of the gamma function. Cantrell gives an approximation in terms of the Lambert W function.
In this post we’ll write a little Python code to kick the tires on Cantrell’s approximation. The post also illustrates how to do some common tasks using SciPy and matplotlib.
Here are the imports we’ll need.
import matplotlib.pyplot as plt
from scipy import pi, e, sqrt, log, linspace
from scipy.special import lambertw, gamma, psi
from scipy.optimize import root
First of all, the gamma function has a local minimum k somewhere between 1 and 2, and so it only makes sense to speak of its inverse to the left or right of this point. Gamma is strictly increasing for real values larger than k.
To find k we look for where the derivative of gamma is zero. It’s more common to work with the derivative of the logarithm of the gamma function than the derivative of the gamma function itself. That works just as well because gamma has a minimum where its log has a minimum. The derivative of the log of the gamma function is called ψ and is implemented in SciPy as scipy.special.psi. We use the function scipy.optimize.root to find where ψ is zero.
The root function returns more information than just the root we’re after. The root(s) are returned in the arrayx, and in our case there’s only one root, so we take the first element of the array:
How well goes this algorithm work? For starters, we’ll see how well it does when we do a round trip, following the exact gamma with the approximate inverse.
x = linspace(5, 30, 100)
This produces the following plot:
We get a straight line, as we should, so next we do a more demanding test. We’ll look at the absolute error in the approximate inverse. We’ll use a log scale on the x-axis since gamma values get large quickly.
y = gamma(x)
plt.plot(y, x- AIG(y))
This shows the approximation error is small, and gets smaller as its argument increases.
Cantrell’s algorithm is based on an asymptotic approximation, so it’s not surprising that it improves for large arguments.
The Mittag-Leffler function is a generalization of the exponential function. Since k!= Γ(k + 1), we can write the exponential function’s power series as
and we can generalize this to the Mittag=Leffler function
which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,
where erfc(x) is the complementary error function.
Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).
The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function Eα is Eα, 1.
Mittag-Leffler probability distributions
Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.
Continuous Mittag-Leffler distribution
The exponential function exp(-x) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is f(x) = exp(-x) and whose distribution function (CDF) is F(x) = 1 – exp(-x). The Mittag-Leffler distribution has CDF is 1 – Eα(-xα) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. 
Discrete Mittag-Leffler distribution
The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the kth term as the probability mass for k. That is,
The analogous discrete Mittag-Leffler distribution  has probability mass function
Fractional differential equations
In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation
is exp(ax), for 0 < μ < 1, the soltuion to the fractional differential equation
is axμ-1Eμ, μ(axμ). Note that this reduces to exp(ax) when μ = 1. 
 Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions
 Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1
 Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.
where C and S are the Fresnel functions, sometimes called the Fresnel cosine integral and Fresnel sine integral. Here’s a plot of the spiral.
Both Fresnel functions approach ½ as t → ∞ and so the curve slowly spirals toward (½, ½) in the first quadrant. And by symmetry, because both functions are odd, the curve spirals toward (-½, -½) in the third quadrant.
Here’s the Python code used to make the plot.
from scipy.special import fresnel
from scipy import linspace
import matplotlib.pyplot as plt
t = linspace(-7, 7, 1000)
y, x = fresnel(t)
The SciPy function fresnel returns both Fresnel functions at the same time. It returns them in the order (S, C) so the code reverses the order of these to match the Cornu curve.
One interesting feature of Cornu’s spiral is that its curvature increases linearly with time. This is easy to verify: because of the fundamental theorem of calculus, the Fresnel functions reduce to sines and cosines when you take derivatives, and you can show that the curvature at time t equals πt.
How fast does the curve spiral toward (½, ½)? Since the curvature at time t is πt, that says that at time t the curve is instantaneously bending like a circle of radius 1/πt. So the radius of the spiral is decreasing like 1/πt.
Cornu’s spiral was actually discovered by Euler. Cornu was an engineer who independently discovered the curve much later. Perhaps because Cornu used the curve in applications, his name is more commonly associated with the curve. At least I’ve more often seen it named after Cornu. This is an example of Stigler’s law that things are usually not named after the first person to discover them.
In an earlier post we proved that if you modulate a cosine carrier by a sine signal you get a signal whose sideband amplitudes are given by Bessel functions. Specifically:
When β = 0, we have the unmodulated carrier, cos(2π fct), on both sides. When β is positive but small, J0(β) is near 1, and so the frequency component corresponding to the carrier is only slightly diminished. Also, the sideband amplitudes, the values of Jn(β) for n ≠ 0, are small and decay rapidly as |n| increases. As β increases, the amplitude of the carrier component decreases, the sideband amplitudes increase, and the sidebands decay more slowly.
We can be much more precise: the energy in the modulated signal is the same as the energy in the unmodulated signal. As β increases, more enery transfers to the sidebands, but the total energy stays the same. This conservation of energy result applies to more complex signals than just pure sine waves, but it’s easier to demonstrate in the case of a simple signal.
To prove the energy stays constant, we show that the sum of the squares of the coefficients of the cosine components is the same for the modulated and unmodulated signal.The unmodulated signal is just cos(2π fct), and so the only coefficient is 1. That means we have to prove
This is a well-known result. For example, it is equation 9.1.76 in Abramowitz and Stegun. We’ll show how to prove it from first principles. We’ll actually prove a more general result, the Newmann-Schläffi addition formula, then show our result follows easily from that.
This means that when you expand the expression on the left as a power series in t, whatever is multiplied by tn is Jn(z) by definition. (There are other ways of defining the Bessel functions, but this way leads quickly to what we want to prove.)
We begin by factoring the Bessel generating function applied to z + w.
Next we expand both sides as power series.
and look at the terms involving tn on both sides. On the left this is Jn(z + w). On the right, we multiply two power series. We will get a term containing tn whenever we multiply terms tj and tk where j and k sum to n.
The equation above is the Newmann-Schläffi addition formula.
Sum of squared coefficients
To prove that the sum of the squared sideband coefficients is 1, we apply the addition formula with n = 0, z = β, and w = -β.
This proves what we were after:
We used a couple facts in the last step that we haven’t discussed. The first was that J0(0) = 1. This follows from the generating function by setting z to 0 and taking the limit as t → 0. The second was that J–m(-β) = Jm(β). You can also see this from the generating function since negating z has the same effect as swapping t and 1/t.
Frequency modulation combines a signal with a carrier wave by changing (modulating) the carrier wave’s frequency.
Starting with a cosine carrier wave with frequency fc Hz and adding a signal with amplitude β and frequency fm Hz results in the combination
The factor β is known as the modulation index.
We’d like to understand this signal in terms of cosines without any frequency modulation. It turns out the result is a set of cosines weighted by Bessel functions of β.
We will prove the equation above, but first we’ll discuss what it means for the amplitudes of the cosine components.
For small values of β, Bessel functions decay quickly, which means the first cosine component will be dominant. For larger values of β, the Bessel function values increase to a maximum then decay like one over the square root of the index. To see this we compare the coefficients for modulation index β = 0.5 and β = 5.0.
First, β = 0.5:
and now for β = 5.0:
For fixed β and large n we have
and so the sideband amplitudes eventually decay very quickly.
Update: See this post for what the equation above says about energy moving from the carrier to sidebands.
To prove the equation above, we need three basic trig identities
and a three Bessel function identities
The Bessel function identities above can be found in Abramowitz and Stegun as equations 9.1.42, 9.1.43, and 9.1.5.
And now the proof. We start with
and apply the sum identity for cosines to get
Now let’s take the first term
and apply one of our Bessel identities to expand it to
which can be simplified to
where the sum runs over all even integers, positive and negative.
Now we do the same with the second half of the cosine sum. We expand
which simplifies to
where again the sum is over all (odd this time) integers. Combining the two halves gives our result
What’s the connection between the hypergeometric distributions, hypergeometric functions, and hypergeometric series?
The hypergeometric distribution is a probability distribution with parameters N, M, and n. Suppose you have an urn containing N balls, M red and the rest, N – M blue and you select n balls at a time. The hypergeometric distribution gives the probability of selecting k red balls.
The probability generating function for a discrete distribution is the series formed by summing over the probability of an outcome k and xk. So the probability generating function for a hypergeometric distribution is given by
The summation is over all integers, but the terms are only non-zero for k between 0 and M inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)
It turns out that f is a hypergeometric function of x because it is can be written as a hypergeometric series. (Strictly speaking, f is a constant multiple of a hypergeometric function. More on that in a moment.)
A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function F(a, b; c; x) has a the power series
where (n)k is the kth rising power of n. It’s a sort of opposite of factorial. Start with n and multiply consecutive increasing integers for k terms. (n)0 is an empty product, so it is 1. (n)1 = n, (n)2 = n(n+1), etc.
If the ratio of the k+1st term to the kth term in a power series is a polynomial in k, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be
and so the corresponding hypergeometric function is F(-M, –n; N – M – n + 1; x). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying F(-M, –n; N – M – n + 1; x). Now
The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted 2F1 because it has two falling powers on top and one on bottom. In general, the hypergeometric function pFq has p rising powers in the denominator and q rising powers in the denominator.
The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:
where a = 1, b = k+1-M, c = k+1-n, d = k+2, and e = N+k+2-M–n.
The functions dilogarithm, trilogarithm, and more generally polylogarithm are meant to be generalizations of the logarithm. I first came across the dilogarithm in college when I was evaluating some integral with Mathematica, and they’ve paid a visit occasionally ever since.
Unfortunately polylogarithms are defined in several slightly different and incompatible ways. I’ll start by following An Atlas of Functions and then mention differences in A&S, SciPy, and Mathematica.
According to Atlas,
Polylogarithms are themselves special cases of Lerch’s function. Also known as Jonquière’s functions (Ernest Jean Philippe Fauque de Jonquières, 1820–1901, French naval officer and mathematician), they appear in the Feynman diagrams of particle physics.
The idea is to introduce an extra parameter ν in the power series for natural log:
When ν = 1 we get the ordinary logarithm, i.e. polyln1 = log. Then polyln2 is the dilogarithm diln, and polyln3 is the trilogarithm triln.
One advantage of the definition in Atlas is that the logarithm is a special case of the polylogarithm. Other conventions don’t have this property.
The venerable A&S defines dilogarithms in a way that’s equivalent to the negative of the definition above and does not define polylogarithms of any other order. SciPy’s special function library follows A&S. SciPy uses the name spence for the dilogarithm for reasons we’ll get to shortly.
Mathematica has the function PolyLog[ν, x] that evaluates to
So polylnν above corresponds to -PolyLog[ν, -x] in Mathematica. Matlab’s polylog is the same as Mathematica’s PolyLog.
Relation to other functions
Spence’s integral is the function of x given by the integral
and equals diln(x). Note that the SciPy function spence returns the negative of the integral above.
The Lerch function mentioned above is named for Mathias Lerch (1860–1922) and is defined by the integral
The connection with polylogarithms is easier to see from the series expansion:
The connection with polylogarithms is then
Note that the Lerch function also generalizes the Hurwitz zeta function, which in turn generalizes the Riemann zeta function. When x = 1, the Lerch function reduces to ζ(ν, u).
Suppose you have software to compute one hypergeometric function. How many more hypergeometric functions can you compute from it?
Hypergeometric functions satisfy a lot of identities, so you can bootstrap one such function into many more. That’s one reason they’re so useful in applications. For this post, I want to focus on just three formulas, the so-called linear transformation formulas that relate one hypergeometric function to another. (There are more linear formulas that relate one hypergeometric function to a combination of two others. I may consider those in a future post, but not today.)
The classical hypergeometric function has three parameters. The linear transformations discussed above relate the function with parameters (a, b, c) to those with parameters
(c – a, c – b, c)
(a, c – b, c)
(b, c – a, c)
Here are the linear transformations in detail:
How many more?
The three transformations above are the ones listed in Handbook of Mathematical Functions by Abramowitz and Stegun. How many more transformations can we create by combining them? To answer this, we represent each of the transformations with a matrix. The transformations correspond to multiplying the matrix by the column vector (a, b, c).
The question of how many transformations we can come up with can be recast as asking what is the order of the group generated by A, B, and C. (I’m jumping ahead a little by presuming these matrices generate a group. Conceivably the don’t have inverses, but in fact they do. The inverses of A, B, and C are A, B, and AC respectively.)
A little experimentation reveals that there are eight transformations: I, A, B, C, D = AB, E = AC, F = BC, and G = CB. So in addition to the identity I, the do-nothing transformation, we have found four more: D, E, F, and G. These last four correspond to taking (a, b, c) to
(c – a, b, c)
(a, c – b, c)
(b, a, c)
(c – b, c – a, c)
This means that if we have software to compute the hypergeometric function with one fixed set of parameters, we can bootstrap that into computing the hypergeometric function with up to seven more sets of parameters. (Some of the seven new combinations could be repetitive if, for example, a = b.)
Identifying the group
We have uncovered a group of order 8, and there are only 5 groups that size. We should be able to find a familiar group isomorphic to our group of transformations.
The five groups of order 8 are Z8 (cyclic group), Z4×Z2 (direct product), E8 (elementary Abelian group), D8 (dihedral group), and the quaternion group. The first three are Abelian, and our group is not, so our group must be isomorphic to either the quaternion group or the dihedral group. The quaternion group has only 1 element of order 2, and the dihedral group has 5 elements of order 2. Our group has five elements of order 2 (A, B, D, F, G) and so it must be isomorphic to the dihedral group of order 8, the rotations and reflections of a square. (Some authors call this D8, because the group has eight elements. Others call it D4, because it is the group based on a 4-sided figure.)
When I interviewed Daniel Spielman at this year’s Heidelberg Laureate Forum, we began our conversation by looking for common mathematical ground. The first thing that came up was orthogonal polynomials. (If you’re wondering what it means for two polynomials to be orthogonal, see here.)
JC: Orthogonal polynomials are kind of a lost art, a topic that was common knowledge among mathematicians maybe 50 or 100 years ago and now they’re obscure.
DS: The first course I taught I spent a few lectures on orthogonal polynomials because they kept coming up as the solutions to problems in different areas that I cared about. Chebyshev polynomials come up in understanding solving systems of linear equations, such as if you want to understand how the conjugate gradient method behaves. The analysis of error correcting codes and sphere packing has a lot of orthogonal polynomials in it. They came up in a course in multi-linear algebra I had in grad school. And they come up in matching polynomials of graphs, which is something people don’t study much anymore. … They’re coming back. They come up a lot in random matrix theory. … There are certain things that come up again and again and again so you got to know what they are.
The Airy functions Ai(x) and Bi(x) are independent solutions to the differential equation
For negative x they act something like sin(x) and cos(x). For positive x they act something like exp(x) and exp(-x). This isn’t surprising if you look at the differential equation. If you replace x with a negative constant, you get sines and cosines, and if you replace it with a positive constant, you get positive and negative exponentials.
The Airy functions can be related to Bessel functions as follows:
Here J is a “Bessel function of the first kind” and I is a “modified Bessel function of the first kind.” Also
To verify the equations above, and to show how to compute these functions in Python, here’s some code.
The SciPy function airy computes both functions, and their first derivatives, at once. I assume that’s because it doesn’t take much longer to compute all four functions than to compute one. The code for Ai2 and Bi2 below uses np.where instead of if … else so that it can operate on NumPy vectors all at once. You can plot Ai and Ai2 and see that the two curves lie on top of each other. The same holds for Bi and Bi2.
from scipy.special import airy, jv, iv
from numpy import sqrt, where
(ai, ai_prime, bi, bi_prime) = airy(x)
(ai, ai_prime, bi, bi_prime) = airy(x)
third = 1.0/3.0
hatx = 2*third*(abs(x))**1.5
return where(x > 0,
third*sqrt( x)*(iv(-third, hatx) - iv(third, hatx)),
third*sqrt(-x)*(jv(-third, hatx) + jv(third, hatx)))
third = 1.0/3.0
hatx = 2*third*(abs(x))**1.5
return where(x > 0,
sqrt( x/3.0)*(iv(-third, hatx) + iv(third, hatx)),
sqrt(-x/3.0)*(jv(-third, hatx) - jv(third, hatx)))
There is a problem with Ai2 and Bi2: they return nan at 0. A more careful implementation would avoid this problem, but that’s not necessary since these functions are only for illustration. In practice, you’d simply use airy and it does the right thing at 0.
The famous Fibonacci numbers are defined by the initial conditions
F0 = 0, F1 = 1
and the recurrence relation
Fn = Fn-1 + Fn-2
for n > 1.
The Fibonacci polynomials are defined similarly. The have the same initial conditions
F0(x) = 0, F1(x) = 1
but a slightly different recurrence relation
Fn(x) = xFn-1(x) + Fn-2(x)
for n > 1.
Several families of orthogonal polynomials satisfy a similar recurrence relationship
Fn(x) = p(x) Fn-1(x) + c(n) Fn-2(x).
The table below gives the values of p(x) and c(n) for a few families of polynomials.
2 – 2n
There are two common definitions of the Hermite polynomials, sometimes called the physicist convention and the probabilist convention. The table above gives the physicist convention. For the probabilist definition, change the 2’s to 1’s in the last row and leave the 1 alone.
(If you haven’t heard of orthogonal polynomials, you might be wondering “orthogonal to what?”. These polynomials are orthogonal in the sense of an inner product formed by multiplying two polynomials and a weight, then integrating. Orthogonal polynomials differ by the the weight and the limits of integration. For example, the (physicist) Hermite polynomials are orthogonal with weight function exp(-x2) and integrating over the entire real line. The probabilist Hermite polynomials are very similar, but use the standard normal distribution density exp(-x2/2)/√(2π) as the weight instead of exp(-x2).)