# Mittag-Leffler function and probability distribution

The Mittag-Leffler function is a generalization of the exponential function. Since k!= Γ(k + 1), we can write the exponential function’s power series as

and we can generalize this to the Mittag=Leffler function

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

and

where erfc(x) is the complementary error function.

## History

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function Eα is Eα, 1.

## Mittag-Leffler probability distributions

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

### Continuous Mittag-Leffler distribution

The exponential function exp(-x) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is f(x) = exp(-x) and whose distribution function (CDF) is F(x) = 1 – exp(-x). The Mittag-Leffler distribution has CDF is 1 – Eα(-xα) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

### Discrete Mittag-Leffler distribution

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the kth term as the probability mass for k. That is,

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

## Fractional differential equations

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

is exp(ax), for 0 < μ < 1, the soltuion to the fractional differential equation

is axμ-1 Eμ, μ(a xμ). Note that this reduces to exp(ax) when μ = 1. [3]

## References

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

# Cornu’s spiral

Cornu’s spiral is the curve parameterized by

where C and S are the Fresnel functions, sometimes called the Fresnel cosine integral and Fresnel sine integral. Here’s a plot of the spiral.

Both Fresnel functions approach ½ as t → ∞ and so the curve slowly spirals toward (½, ½) in the first quadrant. And by symmetry, because both functions are odd, the curve spirals toward (-½, -½) in the third quadrant.

Here’s the Python code used to make the plot.

    from scipy.special import fresnel
from scipy import linspace
import matplotlib.pyplot as plt

t = linspace(-7, 7, 1000)
y, x = fresnel(t)

plt.plot(x, y)
plt.axes().set_aspect("equal")
plt.show()


The SciPy function fresnel returns both Fresnel functions at the same time. It returns them in the order (S, C) so the code reverses the order of these to match the Cornu curve.

One interesting feature of Cornu’s spiral is that its curvature increases linearly with time. This is easy to verify: because of the fundamental theorem of calculus, the Fresnel functions reduce to sines and cosines when you take derivatives, and you can show that the curvature at time t equals πt.

How fast does the curve spiral toward (½, ½)? Since the curvature at time t is πt, that says that at time t the curve is instantaneously bending like a circle of radius 1/πt. So the radius of the spiral is decreasing like 1/πt.

Cornu’s spiral was actually discovered by Euler. Cornu was an engineer who independently discovered the curve much later. Perhaps because Cornu used the curve in applications, his name is more commonly associated with the curve. At least I’ve more often seen it named after Cornu. This is an example of Stigler’s law that things are usually not named after the first person to discover them.

* * *

# Energy in frequency modulated signals

In an earlier post we proved that if you modulate a cosine carrier by a sine signal you get a signal whose sideband amplitudes are given by Bessel functions. Specifically:

When β = 0, we have the unmodulated carrier, cos(2π fct), on both sides. When β is positive but small, J0(β) is near 1, and so the frequency component corresponding to the carrier is only slightly diminished. Also, the sideband amplitudes, the values of Jn(β) for n ≠ 0, are small and decay rapidly as |n| increases. As β increases, the amplitude of the carrier component decreases, the sideband amplitudes increase, and the sidebands decay more slowly.

We can be much more precise: the energy in the modulated signal is the same as the energy in the unmodulated signal. As β increases, more enery transfers to the sidebands, but the total energy stays the same. This conservation of energy result applies to more complex signals than just pure sine waves, but it’s easier to demonstrate in the case of a simple signal.

## Proof

To prove the energy stays constant, we show that the sum of the squares of the coefficients of the cosine components is the same for the modulated and unmodulated signal.The unmodulated signal is just cos(2π fct), and so the only coefficient is 1. That means we have to prove

This is a well-known result. For example, it is equation 9.1.76 in Abramowitz and Stegun. We’ll show how to prove it from first principles. We’ll actually prove a more general result, the Newmann-Schläffi addition formula, then show our result follows easily from that.

Whittaker and Watson define the Bessel functions by their generating function:

This means that when you expand the expression on the left as a power series in t, whatever is multiplied by tn is Jn(z) by definition. (There are other ways of defining the Bessel functions, but this way leads quickly to what we want to prove.)

We begin by factoring the Bessel generating function applied to zw.

Next we expand both sides as power series.

and look at the terms involving tn on both sides. On the left this is Jn(zw). On the right, we multiply two power series. We will get a term containing tn whenever we multiply terms tj and tk where j and k sum to n.

The equation above is the Newmann-Schläffi addition formula.

### Sum of squared coefficients

To prove that the sum of the squared sideband coefficients is 1,  we apply the addition formula with n = 0, z = β, and w = -β.

This proves what we were after:

We used a couple facts in the last step that we haven’t discussed. The first was that J0(0) = 1. This follows from the generating function by setting z to 0 and taking the limit as t → 0. The second was that Jm(-β) = Jm(β). You can also see this from the generating function since negating z has the same effect as swapping t and 1/t.

# Analyzing an FM signal

Frequency modulation combines a signal with a carrier wave by changing (modulating) the carrier wave’s frequency.

Starting with a cosine carrier wave with frequency fc Hz and adding a signal with amplitude β and frequency fm Hz results in the combination

The factor β is known as the modulation index.

We’d like to understand this signal in terms of cosines without any frequency modulation. It turns out the result is a set of cosines weighted by Bessel functions of β.

## Component amplitudes

We will prove the equation above, but first we’ll discuss what it means for the amplitudes of the cosine components.

For small values of β, Bessel functions decay quickly, which means the first cosine component will be dominant. For larger values of β, the Bessel function values increase to a maximum then decay like one over the square root of the index. To see this we compare the coefficients for modulation index β = 0.5 and β = 5.0.

First, β = 0.5:

and now for β = 5.0:

For fixed β and large n we have

and so the sideband amplitudes eventually decay very quickly.

Update: See this post for what the equation above says about energy moving from the carrier to sidebands.

## Proof

To prove the equation above, we need three basic trig identities

and a three Bessel function identities

The Bessel function identities above can be found in Abramowitz and Stegun as equations 9.1.42, 9.1.43, and 9.1.5.

and apply the sum identity for cosines to get

Now let’s take the first term

and apply one of our Bessel identities to expand it to

which can be simplified to

where the sum runs over all even integers, positive and negative.

Now we do the same with the second half of the cosine sum. We expand

to

which simplifies to

where again the sum is over all (odd this time) integers. Combining the two halves gives our result

Related post: Visualizing Bessel functions

# Connection between hypergeometric distribution and series

What’s the connection between the hypergeometric distributions, hypergeometric functions, and hypergeometric series?

The hypergeometric distribution is a probability distribution with parameters NM, and n. Suppose you have an urn containing N balls, M red and the rest, N – M blue and you select n balls at a time. The hypergeometric distribution gives the probability of selecting k red balls.

The probability generating function for a discrete distribution is the series formed by summing over the probability of an outcome k and xk. So the probability generating function for a hypergeometric distribution is given by

The summation is over all integers, but the terms are only non-zero for k between 0 and M inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)

It turns out that f is a hypergeometric function of x because it is can be written as a hypergeometric series. (Strictly speaking,  f is a constant multiple of a hypergeometric function. More on that in a moment.)

A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function F(a, bcx) has a the power series

where (n)k is the kth rising power of n. It’s a sort of opposite of factorial. Start with n and multiply consecutive increasing integers for k terms. (n)0 is an empty product, so it is 1. (n)1 = n, (n)2 = n(n+1), etc.

If the ratio of the k+1st term to the kth term in a power series is a polynomial in k, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be

and so the corresponding hypergeometric function is F(-M, –nN – M – n + 1; x). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying F(-M, –nN – M – n + 1; x). Now

and so

The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted 2F1 because it has two falling powers on top and one on bottom. In general, the hypergeometric function pFq has p rising powers in the denominator and q rising powers in the denominator.

The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:

where a = 1, bk+1-M, ck+1-n, d = k+2, and eN+k+2-Mn.

Thanks to Jan Galkowski for suggesting this topic via a comment on an earlier post, Hypergeometric bootstrapping.

* * *

# Dilogarithm, polylogarithm, and related functions

The functions dilogarithm, trilogarithm, and more generally polylogarithm are meant to be generalizations of the logarithm. I first came across the dilogarithm in college when I was evaluating some integral with Mathematica, and they’ve paid a visit occasionally ever since.

Unfortunately polylogarithms are defined in several slightly different and incompatible ways. I’ll start by following An Atlas of Functions and then mention differences in A&S, SciPy, and Mathematica.

## Polylogarithms

According to Atlas,

Polylogarithms are themselves special cases of Lerch’s function. Also known as Jonquière’s functions (Ernest Jean Philippe Fauque de Jonquières, 1820–1901, French naval officer and mathematician), they appear in the Feynman diagrams of particle physics.

The idea is to introduce an extra parameter ν in the power series for natural log:

When ν = 1 we get the ordinary logarithm, i.e. polyln1 = log. Then polyln2 is the dilogarithm diln, and polyln3 is the trilogarithm triln.

One advantage of the definition in Atlas is that the logarithm is a special case of the polylogarithm. Other conventions don’t have this property.

## Other conventions

The venerable A&S defines dilogarithms in a way that’s equivalent to the negative of the definition above and does not define polylogarithms of any other order. SciPy’s special function library follows A&S. SciPy uses the name spence for the dilogarithm for reasons we’ll get to shortly.

Mathematica has the function PolyLog[ν, x] that evaluates to

So polylnν above corresponds to -PolyLog[ν, -x] in Mathematica. Matlab’s polylog is the same as Mathematica’s PolyLog.

## Relation to other functions

Spence’s integral is the function of x given by the integral

and equals diln(x). Note that the SciPy function spence returns the negative of the integral above.

The Lerch function mentioned above is named for Mathias Lerch (1860–1922) and is defined by the integral

The connection with polylogarithms is easier to see from the series expansion:

The connection with polylogarithms is then

Note that the Lerch function also generalizes the Hurwitz zeta function, which in turn generalizes the Riemann zeta function. When x = 1, the Lerch function reduces to ζ(ν, u).

Related: Applied complex analysis

# Hypergeometric bootstrapping: implement one, get seven free

Suppose you have software to compute one hypergeometric function. How many more hypergeometric functions can you compute from it?

Hypergeometric functions satisfy a lot of identities, so you can bootstrap one such function into many more. That’s one reason they’re so useful in applications. For this post, I want to focus on just three formulas, the so-called linear transformation formulas that relate one hypergeometric function to another. (There are more linear formulas that relate one hypergeometric function to a combination of two others. I may consider those in a future post, but not today.)

## Linear transformations

The classical hypergeometric function has three parameters. The linear transformations discussed above relate the function with parameters (abc) to those with parameters

• (c – ac – bc)
• (ac – bc)
• (bc – ac)

Here are the linear transformations in detail:

## How many more?

The three transformations above are the ones listed in Handbook of Mathematical Functions by Abramowitz and Stegun. How many more transformations can we create by combining them? To answer this, we represent each of the transformations with a matrix. The transformations correspond to multiplying the matrix by the column vector  (abc).

The question of how many transformations we can come up with can be recast as asking what is the order of the group generated by AB, and C. (I’m jumping ahead a little by presuming these matrices generate a group. Conceivably the don’t have inverses, but in fact they do. The inverses of AB, and C are AB, and AC respectively.)

A little experimentation reveals that there are eight transformations: I, ABCDABEACFBC, and GCB. So in addition to the identity I, the do-nothing transformation, we have found four more: DEF, and G. These last four correspond to taking  (abc) to

• (c – abc)
• (ac – bc)
• (bac)
• (c – bc – ac)

This means that if we have software to compute the hypergeometric function with one fixed set of parameters, we can bootstrap that into computing the hypergeometric function with up to seven more sets of parameters. (Some of the seven new combinations could be repetitive if, for example, ab.)

## Identifying the group

We have uncovered a group of order 8, and there are only 5 groups that size. We should be able to find a familiar group isomorphic to our group of transformations.

The five groups of order 8 are Z8 (cyclic group), Z4×Z2 (direct product), E8 (elementary Abelian group), D8 (dihedral group), and the quaternion group. The first three are Abelian, and our group is not, so our group must be isomorphic to either the quaternion group or the dihedral group. The quaternion group has only 1 element of order 2, and the dihedral group has 5 elements of order 2. Our group has five elements of order 2 (ABDFG) and so it must be isomorphic to the dihedral group of order 8, the rotations and reflections of a square. (Some authors call this D8, because the group has eight elements. Others call it D4, because it is the group based on a 4-sided figure.)

Related: Applied complex analysis

# Uses for orthogonal polynomials

When I interviewed Daniel Spielman at this year’s Heidelberg Laureate Forum, we began our conversation by looking for common mathematical ground. The first thing that came up was orthogonal polynomials. (If you’re wondering what it means for two polynomials to be orthogonal, see here.)

JC: Orthogonal polynomials are kind of a lost art, a topic that was common knowledge among mathematicians maybe 50 or 100 years ago and now they’re obscure.

DS: The first course I taught I spent a few lectures on orthogonal polynomials because they kept coming up as the solutions to problems in different areas that I cared about. Chebyshev polynomials come up in understanding solving systems of linear equations, such as if you want to understand how the conjugate gradient method behaves. The analysis of error correcting codes and sphere packing has a lot of orthogonal polynomials in it. They came up in a course in multi-linear algebra I had in grad school. And they come up in matching polynomials of graphs, which is something people don’t study much anymore. … They’re coming back. They come up a lot in random matrix theory. … There are certain things that come up again and again and again so you got to know what they are.

* * *

More from my interview with Daniel Spielman:

# Multiple zeta

The Riemann zeta function, introduced by Leonard Euler, is defined by

where the sum is over all positive integers n.

Euler also introduced a multivariate generalization of the zeta function

where the sum is over all decreasing k-tuples of positive integers. This generalized zeta function satisfies the following beautiful identity:

The multivariate zeta function and identities such as the one above are important in number theory and are the subject of open conjectures.

Source

# Relating Airy and Bessel functions

The Airy functions Ai(x) and Bi(x) are independent solutions to the differential equation

For negative x they act something like sin(x) and cos(x). For positive x they act something like exp(x) and exp(-x). This isn’t surprising if you look at the differential equation. If you replace x with a negative constant, you get sines and cosines, and if you replace it with a positive constant, you get positive and negative exponentials.

The Airy functions can be related to Bessel functions as follows:

and

Here J is a “Bessel function of the first kind” and I is a “modified Bessel function of the first kind.” Also

To verify the equations above, and to show how to compute these functions in Python, here’s some code.

The SciPy function airy computes both functions, and their first derivatives, at once. I assume that’s because it doesn’t take much longer to compute all four functions than to compute one. The code for Ai2 and Bi2 below uses np.where instead of if ... else so that it can operate on NumPy vectors all at once. You can plot Ai and Ai2 and see that the two curves lie on top of each other. The same holds for Bi and Bi2.

from scipy.special import airy, jv, iv
from numpy import sqrt, where

def Ai(x):
(ai, ai_prime, bi, bi_prime) = airy(x)
return ai

def Bi(x):
(ai, ai_prime, bi, bi_prime) = airy(x)
return bi

def Ai2(x):
third = 1.0/3.0
hatx = 2*third*(abs(x))**1.5
return where(x > 0,
third*sqrt( x)*(iv(-third, hatx) - iv(third, hatx)),
third*sqrt(-x)*(jv(-third, hatx) + jv(third, hatx)))

def Bi2(x):
third = 1.0/3.0
hatx = 2*third*(abs(x))**1.5
return where(x > 0,
sqrt( x/3.0)*(iv(-third, hatx) + iv(third, hatx)),
sqrt(-x/3.0)*(jv(-third, hatx) - jv(third, hatx)))


There is a problem with Ai2 and Bi2: they return nan at 0. A more careful implementation would avoid this problem, but that’s not necessary since these functions are only for illustration. In practice, you’d simply use airy and it does the right thing at 0.

# Fibonacci numbers and orthogonal polynomials

The famous Fibonacci numbers are defined by the initial conditions

F0 = 0,  F1 = 1

and the recurrence relation

Fn = Fn-1 + Fn-2

for n > 1.

The Fibonacci polynomials are defined similarly. The have the same initial conditions

F0(x) = 0, F1(x) = 1

but a slightly different recurrence relation

Fn(x) = x Fn-1(x) + Fn-2(x)

for n > 1.

Several families of orthogonal polynomials satisfy a similar recurrence relationship

Fn(x) = p(x) Fn-1(x) + c(n) Fn-2(x).

The table below gives the values of p(x) and c(n) for a few families of polynomials.

 Family p(x) c(n) P0 P1(x) Fibonacci x 1 0 1 Chebyshev T 2x -1 1 x Chebyshev U 2x -1 1 2x Hermite 2x 2 – 2n 1 2x

There are two common definitions of the Hermite polynomials, sometimes called the physicist convention and the probabilist convention. The table above gives the physicist convention. For the probabilist definition, change the 2’s to 1’s in the last row and leave the 1 alone.

(If you haven’t heard of orthogonal polynomials, you might be wondering “orthogonal to what?”. These polynomials are orthogonal in the sense of an inner product formed by multiplying two polynomials and a weight, then integrating. Orthogonal polynomials differ by the the weight and the limits of integration. For example, the (physicist) Hermite polynomials are orthogonal with weight function exp(-x2) and integrating over the entire real line. The probabilist Hermite polynomials are very similar, but use the standard normal distribution density exp(-x2/2)/√(2π) as the weight instead of exp(-x2).)

* * *

# Giant components and the Lambert W-function

The Lambert W-function is the function w(z) implicitly defined by w exp(w) = z. When I first saw this, I thought that I’d seen something like this come up many times and that it would be really handy to know that this function has a name and has many software implementations1. Since then, however, I’ve hardly ever run into the W-function in application. But here’s an application I ran into recently.

A “giant component” in a random network is a component whose size grows in proportion to the size of the network. For a Poisson random network, let c = p(n – 1) be the expected number of vertices connected to any given vertex. If c > 1 then as n goes to infinity, there will be a giant component with probability 1. S, the proportion of nodes inside the a giant component, satisfies the equation

S = 1 – exp(-cS).

S plays a role similar to that of w in the definition of the W-function, which suggests the W-function might come in handy. And in fact, this book gives this solution for S:

S = 1 + w( –c exp(-c) )/c.

There are a couple issues. First, it’s not obvious that solution is correct. Second, we need to be more careful about how we define w(z) when z is negative.

Given some value of c, let S = 1 + w( –c exp(-c) )/c. Then

w( –c exp(-c) ) = –c(1 – S).

Apply the function f(x) = x exp( x ) to both sides of the equation above. By the definition of the W-function, we have

c exp(-c) = –c(1 – S) exp( –c(1 – S) ).

From there it easily follows that

S = 1 – exp(-cS).

Now let’s look more carefully at the definition of the W-function. For positive real z, w exp(w) = z has a unique real solution defining w. But in the calculations above, we have evaluated w at –c exp(-c), which is negative since c > 1. For negative arguments, we have to pick a branch of w. Which branch guarantees that S = 1 – exp(-cS)? Any branch2. No matter which solution to w exp(w) = z we pick, the resulting S satisfies the giant component equation. However, the wrong branch might result in a meaningless solution. Since S is a proportion, S should be between 0 and 1.

Let’s go back and say that for our purposes, we will take w(z) to mean the real number no less than -1 satisfying w exp(w) = z. Then for c > 1, the solution S = 1 + w( –c exp(-c) )/c is between 0 and 1. You can show that S = 1 – exp(-cS) has a unique solution for 0 < S < 1, so any other branch would not give a solution in this interval.

1 For example, in Python the Lambert W-function is scipy.special.lambertw. The function takes two arguments: z and an integer denoting the branch. By default, the branch argument is set to 0, giving the branch discussed in this post.

2 For -1/e < z < 0, there are two real branches of w(z), but there are also infinitely many complex branches.

# Addition formulas for Bessel functions

When trying to understand a complex formula, it helps to first ask what is being related before asking how they are related.

This post will look at addition theorems for Bessel functions. They related the values of Bessel functions at two points to the values at a third point.

Let a, b, and c be lengths of the sides of a triangle and let θ be the angle between the sides of length a and b. The triangle could be degenerate, i.e. θ could be π. The addition theorems here give the value of Bessel functions at c in terms of values of Bessel functions evaluated at b and c.

The first addition theorem says that

J0(c) = ∑ Jm(a) Jm(b) exp(imθ)

where the sum is over all integer values of m.

This says that the value of the Bessel function J0 at c is determined by a sort of inner product of values of Bessel functions of all integer orders at a and b. It’s not exactly an inner product because it is not positive definite. (Sometimes you’ll see the formula above written with a factor λ multiplying a, b, and c. This is because you can scale every side of a triangle by the same amount and not change the angles.)

Define the vectors x, y, and z to be the values of all Bessel functions evaluated at a, b, and c respectively. That is, for integer k, xk = Jk(a) and similar for y and z. Also, define the vector w by wk = exp(ikθ). Then the first addition theorem says

z0 = ∑ xm ym wm.

This is a little unsatisfying because it relates the value of one particular Bessel function at c to the values of all Bessel functions at a and b. We’d like to relate all Bessel function values at c to the values at a and b. That is, we’d like to relate the whole vector z to the vectors x and y.

Define the vector v by vk = exp(inψ) where ψ is the angle between the sides of length b and c. Then the formula we’re looking for is

zn = v-nxn+m ym wm.