Computing SVD and pseudoinverse

In a nutshell, given the singular decomposition of a matrix A,

A = U \Sigma V^*

the Moore-Penrose pseudoinverse is given by

A^+ = V \Sigma^+ U^*.

This post will explain what the terms above mean, and how to compute them in Python and in Matheamtica.

Singular Value Decomposition (SVD)

The singular value decomposition of a matrix is a sort of change of coordinates that makes the matrix simple, a generalization of diagonalization.

Matrix diagonalization

If a square matrix A is diagonalizable, then there is a matrix P such that

A = P^{-1} D P

where the matrix D is diagonal. You could think of P as a change of coordinates that makes the action of A as simple as possible. The elements on the diagonal of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors.

Unfortunately not all matrices can be diagonalized. Singular value decomposition is a way to do something like diagonalization for any matrix, even non-square matrices.

Generalization to SVD

Singular value decomposition generalizes diagonalization. The matrix Σ in SVD is analogous to D in diagonalization. Σ is diagonal, though it may not be square. The matrices on either side of Σ are analogous to the matrix P in diagonalization, though now there are two different matrices, and they are not necessarily inverses of each other. The matrices U and V are square, but not necessarily of the same dimension.

The elements along the diagonal of Σ are not necessarily eigenvalues but singular values, which are a generalization of eigenvalues. Similarly the columns of U and V are not necessarily eigenvectors but left singular vectors and right singular vectors respectively.

The star superscript indicates conjugate transpose. If a matrix has all real components, then the conjugate transpose is just the transpose. But if the matrix has complex entries, you take the conjugate and transpose each entry.

The matrices U and V are unitary. A matrix M is unitary if its inverse is its conjugate transpose, i.e. M* M = MM* = I.

Pseudoinverse and SVD

The (Moore-Penrose) pseudoinverse of a matrix generalizes the notion of an inverse, somewhat like the way SVD generalized diagonalization. Not every matrix has an inverse, but every matrix has a pseudoinverse, even non-square matrices.

Computing the pseudoinverse from the SVD is simple.


A = U \Sigma V^*


A^+ = V \Sigma^+ U^*

where Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape: if Σ is an m by n matrix, then Σ+ must be an n by m matrix.

We’ll give examples below in Mathematica and Python.

Computing SVD in Mathematica

Let’s start with the matrix A below.

A = \begin{bmatrix} 2 & -1 & 0 \\ 4 & 3 & -2 \end{bmatrix}

We can find the SVD of A with the following Mathematica commands.

    A = {{2, -1, 0}, {4, 3, -2}}
    {U, S, V} = SingularValueDecomposition[A]

From this we learn that the singular value decomposition of A is

\begin{bmatrix} \frac{1}{\sqrt{26}} & -\frac{5}{\sqrt{26}} \\ \frac{5}{\sqrt{26}} & \frac{1}{\sqrt{26}} \\ \end{bmatrix} \begin{bmatrix} \sqrt{30} & 0 & 0 \\ 0 & 2 & 0 \end{bmatrix} \begin{bmatrix} \frac{11}{\sqrt{195}} & \frac{7}{\sqrt{195}} & -\sqrt{\frac{5}{39}} \\ -\frac{3}{\sqrt{26}} & 2 \sqrt{\frac{2}{13}} & -\frac{1}{\sqrt{26}} \\ \frac{1}{\sqrt{30}} & \sqrt{\frac{2}{15}} & \sqrt{\frac{5}{6}} \end{bmatrix}

Note that the last matrix is not V but the transpose of V. Mathematica returns V itself, not its transpose.

If we multiply the matrices back together we can verify that we get A back.

    U . S. Transpose[V]

This returns

    {{2, -1, 0}, {4, 3, -2}}

as we’d expect.

Computing pseudoinverse in Mathematica

The Mathematica command for computing the pseudoinverse is simply PseudoInverse. (The best thing about Mathematica is it’s consistent, predictable naming.)


This returns

    {{19/60, 1/12}, {-(11/30), 1/6}, {1/12, -(1/12)}}

And we can confirm that computing the pseudoinverse via the SVD

    Sp = {{1/Sqrt[30], 0}, {0, 1/2}, {0, 0}}
    V . Sp . Transpose[U]

gives the same result.

Computing SVD in Python

Next we compute the singular value decomposition in Python (NumPy).

    >>> a = np.matrix([[2, -1, 0],[4,3,-2]])
    >>> u, s, vt = np.linalg.svd(a, full_matrices=True)

Note that np.linalg.svd returns the transpose of V, not the V in the definition of singular value decomposition.

Also, the object s is not the diagonal matrix Σ but a vector containing only the diagonal elements, i.e. just the singular values. This can save a lot of space if the matrix is large. The NumPy method svd has other efficiency-related options that I won’t go into here.

We can verify that the SVD is correct by turning s back into a matrix and multiply the components together.

    >>> ss = np.matrix([[s[0], 0, 0], [0, s[1], 0]])
    >>> u*ss*vt

This returns the matrix A, within floating point accuracy. Since Python is doing floating point computations, not symbolic calculation like Mathematica, the zero in A turns into -3.8e-16.

Note that the singular value decompositions as computed by Mathematica and Python differ in a few signs here and there; the SVD is not unique.

Computing pseudoinverse in Python

The pseudoinverse can be computed in NumPy with np.linalg.pinv.

    >>> np.linalg.pinv(a)
    matrix([[ 0.31666667,  0.08333333],
            [-0.36666667,  0.16666667],
            [ 0.08333333, -0.08333333]])

This returns the same result as Mathematica above, up to floating point precision.

Related posts

Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than x.

The idea behind the obesity index [1] is turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

    def simulate_obesity(dist, N):
        data = abs(dist.rvs(size=(N,4)))
        count = 0
        for row in range(N):
            X = sorted(data[row])
            if X[0] + X[3] > X[1] + X[2]:
                count += 1
        return count/N

    for dist in [norm, expon, cauchy]:
        print( simulate_obesity(dist, 10000) )

When I ran the Python code above, I got


This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
        print( simulate_obesity(dist, 10000) )



By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.

Curvature and automatic differentiation

Curvature is tedious to calculate by hand because it involves calculating first and second order derivatives. Of course other applications require derivatives too, but curvature is the example we’ll look at in this post.

Computing derivatives

It would be nice to write programs that only explicitly implement the original function and let software take care of finding the derivatives.

Numerical differentiation

Finite difference approximations for derivatives are nothing new. For example, Euler (1707–1783) used finite differences to numerically solve differential equations. But numerical differentiation can be inaccurate or unstable, especially for higher order derivatives.

Symbolic differentiation

Symbolic differentiation is another approach, having the computer manipulate expressions much as a person would do by hand. It works well for many problems, though it scales poorly for large problems. It also requires functions to be presented in traditional mathematical form, not in the form of source code.

Automatic differentiation

Automatic differentiation is a third way. Like numerical differentiation, it works with floating point numbers, not symbolic expressions. But unlike numerical differentiation, the result does not have approximation error.

As someone put it, automatic differentiation applies the chain rule to floating point numbers rather than to symbolic expressions.

Python implementation

I’ll use the Python library autograd to compute curvature and illustrate automatic differentiation. autograd is not the most powerful automatic differentiation library for Python, but it is the simplest I’ve seen.

We will compute the curvature of a logistic curve.

y = \frac{1}{1 + e^{-x}}

The curvature of the graph of a function is given by

\kappa(x) = \frac{|y''|}{(1 + y'^2)^{3/2}}

Here’s Python code using autograd to compute the curvature.

    import autograd.numpy as np
    from autograd import grad

    def f(x):
        return 1/(1 + np.exp(-x))

    f1 = grad(f)  # 1st derivative of f
    f2 = grad(f1) # 2nd derivative of f

    def curvature(x):
        return abs(f2(x))*(1 + f1(x)**2)**-1.5

Curvature plots

The graph is relatively flat in the middle and at the far ends. In between, the graph bends creating two regions of higher curvature.

    import matplotlib.pyplot as plt

    x = np.linspace(-5, 5, 300)
    plt.plot(x, f(x))
    plt.title("Logisitic curve")

Now let’s look at the curvature.

    y = [curvature(t) for t in x]
    plt.plot(x, y)

curvature for logistic curve

As we should expect, the curvature is small at the ends and in the middle, with local maxima in between.

We can also look at the signed curvature, the same expression as curvature but without the absolute value.

k(x) = \frac{y''}{(1 + y'^2)^{3/2}}

We plot this with the following code.

    def signed_curvature(x):
        return f2(x)*(1 + f1(x)**2)**-1.5

    y = [signed_curvature(t) for t in x]
    plt.plot(x, y)
    plt.title("Signed curvature")

The result looks more like a sine wave.

The positive values mean the curve is bending counterclockwise, and the negative values mean the curve is bending clockwise.

Related post: Squircles and curvature

Quantifying normal approximation accuracy

Probability is full of theorems that say that probability density approximates another as some parameter becomes large. All the dashed lines in the diagram below indicate a relationship like this.


You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence.  In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

    from scipy.integrate import quad
    from scipy.stats import beta, gamma, norm
    from scipy import inf
    import matplotlib.pyplot as plt

Beta distribution

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

    def KL_beta_norm(a, b):
        b = beta(a, b)
        n = norm(b.mean(), b.std())
        f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x))    
        integral, error = quad(f, 0, 1)
        return integral

And here we make our plot.

    x = [2**i for i in range(11)]
    y = [KL_beta_norm(t, 2*t) for t in x]
    plt.plot(x, y)
    plt.ylabel("KL divergence")
    plt.title("Comparing beta(a, 2a) and its normal approximation")

The result is nearly linear on a log-log scale.

Kullback Leibler divergence from normal approximation to beta

I made the b parameter twice the a parameter to show that you don’t need symmetry. When you do have symmetry, i.e a = b, the approximation quality is better and the graph is even straighter.

Gamma distribution

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

    def KL_gamma_norm(shape):
        g = gamma(shape)
        n = norm(g.mean(), g.std())
        f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x))
        mode = max(0, shape-1)
        integral1, error1 = quad(f, 0, mode)
        integral2, error2 = quad(f, mode, inf)    
        return integral1 + integral2

The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The quad() function has a parameter points to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale.

Kullback Leibler divergence for normal approximation to gamma

You could do a similar analysis on the other approximations in the distribution relationship diagram above.

Related posts

Distribution of matches between two shuffled decks

Take two desks of cards and shuffle them. They can be standard 52-card decks, though the number of cards in the decks doesn’t matter as long as they’re the same and the decks are fairly large.

Now count the number of times the two desks match, i.e. how many times the same card is in the same position in both desks. The number of matches is random, and its distribution is approximately Poisson with mean 1. Let’s do a simulation and see how close the results come to the predicted outcome.

Here’s the Python code:

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

def count_zeros(x):
    return len(x[x==0])

num_reps = 10000
deck_size = 52

matches = np.zeros(deck_size+1, dtype=int)

# Simulation
for _ in range(num_reps):

    # Shuffle two decks
    a = np.random.permutation(deck_size)
    b = np.random.permutation(deck_size)

    # Count how often they match
    num_matches = count_zeros(a-b)
    matches[ num_matches ] += 1

# Cut off outputs too small to see
cutoff = 8

# Matches predicted by a Poisson distribution with mean 1.
predicted = [num_reps*poisson(1).pmf(i) for i in range(cutoff)]

# Plot results
x = np.arange(cutoff)
w = 0.3 # bar width, matches[0:cutoff], w), predicted, w)
plt.legend(["actual", "predicted"])

And here’s the output based on 10,000 simulations:

Plot showing good fit between predicted and actual

About 1/3 of the time, you get no matches, another 1/3 of the time you get one match, and the rest of the time you get more. More precisely, according to the Poisson model zero matches and one match are both have probability 1/e.

Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

\frac{1}{n} \sum_{r = 1}^n \left( \left\lceil \frac{n}{r} \right\rceil - \frac{n}{r}\right )

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
    from numpy import ceil, arange

    def f(n):
        return sum( [ceil(n/r) - n/r for r in range(1, n)] )/n

    x = arange(1, 100)
    y = [f(n) for n in x]
    plt.plot(x, y)

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

\gamma = \lim_{n\to\infty} \left(\sum_{r = 1}^n \frac{1}{r} - \log n \right)

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

Equation for the Eiffel Tower

Robert Banks’s book Towing Icebergs, Falling Dominoes, and Other Adventures in Applied Mathematics describes the Eiffel Tower’s shape as approximately the logarithmic curve

y = -y_* \log\left(\frac{x}{x_0} \right )

where y* and x0 are chosen to match the tower’s dimensions.

Here’s a plot of the curve:

Eiffel tower curve plot

And here’s the code that produced the plot:

from numpy import log, exp, linspace, vectorize
import matplotlib.pyplot as plt

# Taken from "Towing Icebergs, Falling Dominoes,
# and Other Adventures in Applied Mathematics"
# by Robert B. Banks

# Constants given in Banks in feet. Convert to meters.
feet_to_meter = 0.0254*12
ystar  = 201*feet_to_meter
x0     = 207*feet_to_meter
height = 984*feet_to_meter

# Solve for where to cut off curve to match height of the tower.
# - ystar log xmin/x0 = height
xmin = x0 * exp(-height/ystar)

def f(x):
    if -xmin < x < xmin:
        return height
        return -ystar*log(abs(x/x0))
curve = vectorize(f)
x = linspace(-x0, x0, 400)

plt.plot(x, curve(x))
plt.xlim(-2*x0, 2*x0)
plt.title("Eiffel Tower")


Related post: When length equals area
The St. Louis arch is approximately a catenary, i.e. a hyperbolic cosine.

Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
    from scipy.stats import beta as beta
    from scipy import log2

    def infogain(post_a, post_b, prior_a, prior_b):

        p = beta(post_a, post_b).pdf
        q = beta(prior_a, prior_b).pdf

        (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
        return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
    print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

This is pronounced/interpreted several ways:

  • The divergence from Y to X
  • The relative entropy of X with respect to Y
  • How well Y approximates X
  • The information gain going from the prior Y to the posterior X
  • The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

exponential and gamma PDFs

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
    from scipy.stats import expon, gamma
    from scipy import inf

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, 0, inf)

    e = expon
    g = gamma(a = 2)

    print( KL(e, g) )
    print( KL(g, e) )

This returns

    (0.5772156649008394, 1.3799968612282498e-08)
    (0.4227843350984687, 2.7366807708872898e-09)

The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, -inf, inf)

and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )

This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)

and so both integrals are equal to within the error in the numerical integration.

Fourier-Bessel series and Gibbs phenomena

Fourier-Bessel series are analogous to Fourier series. And like Fourier series, they converge pointwise near a discontinuity with the same kind of overshoot and undershoot known as the Gibbs phenomenon.

Fourier-Bessel series

Bessel functions come up naturally when working in polar coordinates, just as sines and cosines come up naturally when working in rectangular coordinates. You can think of Bessel functions as a sort of variation on sine waves. Or even more accurately, a variation on sinc functions, where sinc(z) = sin(z)/z. [1]

A Fourier series represents a function as a sum of sines and cosines of different frequencies. To make things a little simpler here, I’ll only consider Fourier sine series so I don’t have to repeatedly say “and cosine.”

f(z) = \sum_{n=1}^\infty c_n \sin(n \pi z)

A Fourier-Bessel function does something similar. It represents a function as a sum of rescaled versions of a particular Bessel function. We’ll use the Bessel J0 here, but you could pick some other Jν.

Fourier series scale the sine and cosine functions by π times integers, i.e. sin(πz), sin(2πz), sin(3πz), etc. Fourier-Bessel series scale by the zeros of the Bessel function: J01z),  J02z),  J03z), etc. where λn are the zeros of J0. This is analogous to scaling sin(πz) by its roots: π, 2π, 3π, etc. So a Fourier-Bessel series for a function f looks like

f(z) = \sum_{n=1}^\infty c_n J_0(\lambda_n z).

The coefficients cn for Fourier-Bessel series can be computed analogously to Fourier coefficients, but with a couple minor complications. First, the basis functions of a Fourier series are orthogonal over [0, 1] without any explicit weight, i.e. with weight 1. And second, the inner product of a basis function doesn’t depend on the frequency. In detail,

\int_0^1 \sin(m \pi z) \, \sin(n \pi z) \, dz = \frac{\delta_{mn}}{2}.

Here δmn equals 1 if m = n and 0 otherwise.

Fourier-Bessel basis functions are orthogonal with a weight z, and the inner product of a basis function with itself depends on the frequency. In detail

\int_0^1 J_0(\lambda_m z) \, J_0(\lambda_n z) \, z\, dz = \frac{\delta_{mn}}{2} J_1(\lambda_n).

So whereas the coefficients for a Fourier sine series are given by

c_n = 2 \int_0^1 f(z)\, \sin(n\pi z) \,dz

the coefficients for a Fourier-Bessel series are given by

c_n = \frac{ 2\int_0^1 f(z)\, J_0(\lambda_n z) \, dz}{ J_1(\lambda_n)^2 }.

Gibbs phenomenon

Fourier and Fourier-Bessel series are examples of orthogonal series, and so by construction they converge in the norm given by their associated inner product. That means that if SN is the Nth partial sum of a Fourier series

\lim_{N\to\infty} \int_0^1 \left( f(x) - S_N(x) \right)^2 \, dz = 0
and the analogous statement for a Fourier-Bessel series is

\lim_{N\to\infty} \int_0^1 \left( f(x) - S_N(x) \right)^2 \, z\, dz = 0.

In short, the series converge in a (weighted) L² norm. But how do the series converge pointwise? A lot of harmonic analysis is devoted to answering this question, what conditions on the function f guarantee what kind of behavior of the partial sums of the series expansion.

If we look at the Fourier series for a step function, the partial sums converge pointwise everywhere except at the step discontinuity. But the way they converge is interesting. You get a sort of “bat ear” phenomena where the partial sums overshoot the step function at the discontinuity. This is called the Gibbs phenomenon after Josiah Willard Gibbs who observed the effect in 1899. (Henry Wilbraham observed the same thing earlier, but Gibbs didn’t know that.)

The Gibbs phenomena is well known for Fourier series. It’s not as well known that the same phenomenon occurs for other orthogonal series, such as Fourier-Bessel series. I’ll give an example of Gibbs phenomenon for Fourier-Bessel series taken from [2] and give Python code to visualize it.

We take our function f(z) to be 1 on [0, 1/2] and 0 on (1/2, 1]. It works out that

c_n = \frac{1}{\lambda_n} \frac{J_1(\lambda_n / 2)}{ J_1(\lambda_n)^2 }.

Python code and plot

Here’s the plot with 100 terms. Notice how the partial sums overshoot the mark to the left of 1/2 and undershoot to the right of 1/2.

Plot showing Gibbs phenomena for Fourier-Bessel series

Here’s the same plot with 1,000 terms.

Gibbs phenomena for 1000 terms of Fourier-Bessel series

Here’s the Python code that produced the plot.

    import matplotlib.pyplot as plt
    from scipy.special import j0, j1, jn_zeros
    from scipy import linspace

    N = 100 # number of terms in series

    roots = jn_zeros(0, N)
    coeff = [j1(r/2) / (r*j1(r)**2) for r in roots]
    z = linspace(0, 1, 200)

    def partial_sum(z):
        return sum([coeff[i]*j0(roots[i]*z) for i in range(N)])

    plt.plot(z, partial_sum(z))
    plt.ylabel("{}th partial sum".format(N))


[1] To be precise, as z goes to infinity

J_nu(z) sim sqrt{frac{2}{pi z}} cosleft(z - frac{1}{2} nu pi - frac{pi}{4} right)

and so the Bessel functions are asymptotically proportional to sin(z – φ)/√z for some phase shift φ.

[2] The Gibbs’ phenomenon for Fourier-Bessel Series. Temple H. Fay and P. Kendrik Kloppers. International Journal of Mathematical Education in Science and Technology. 2003. vol. 323, no. 2, 199-217.

Approximate inverse of the gamma function

The other day I ran across a blog post by Brian Hayes that linked to an article by David Cantrell on how to compute the inverse of the gamma function. Cantrell gives an approximation in terms of the Lambert W function.

In this post we’ll write a little Python code to kick the tires on Cantrell’s approximation. The post also illustrates how to do some common tasks using SciPy and matplotlib.

Here are the imports we’ll need.

      import matplotlib.pyplot as plt
      from scipy import pi, e, sqrt, log, linspace
      from scipy.special import lambertw, gamma, psi
      from scipy.optimize import root

First of all, the gamma function has a local minimum k somewhere between 1 and 2, and so it only makes sense to speak of its inverse to the left or right of this point. Gamma is strictly increasing for real values larger than k.

To find k we look for where the derivative of gamma is zero. It’s more common to work with the derivative of the logarithm of the gamma function than the derivative of the gamma function itself. That works just as well because gamma has a minimum where its log has a minimum. The derivative of the log of the gamma function is called ψ and is implemented in SciPy as scipy.special.psi. We use the function scipy.optimize.root to find where ψ is zero.

The root function returns more information than just the root we’re after. The root(s) are returned in the arrayx, and in our case there’s only one root, so we take the first element of the array:

      k = root(psi, 1.46).x[0]

Now here is Cantrell’s algorithm:

      c = sqrt(2*pi)/e - gamma(k)
      def L(x):
          return log((x+c)/sqrt(2*pi))
      def W(x):
          return lambertw(x)
      def AIG(x):
          return L(x) / W( L(x) / e) + 0.5

Cantrell uses AIG for Approximate Inverse Gamma.

How well goes this algorithm work? For starters, we’ll see how well it does when we do a round trip, following the exact gamma with the approximate inverse.

      x = linspace(5, 30, 100)
      plt.plot(x, AIG(gamma(x)))

This produces the following plot:

We get a straight line, as we should, so next we do a more demanding test. We’ll look at the absolute error in the approximate inverse. We’ll use a log scale on the x-axis since gamma values get large quickly.

      y = gamma(x)
      plt.plot(y, x- AIG(y))

This shows the approximation error is small, and gets smaller as its argument increases.

Cantrell’s algorithm is based on an asymptotic approximation, so it’s not surprising that it improves for large arguments.

Click to learn more about numerical integration consulting


Related posts:

Random squares

In geometry, you’d say that if a square has side x, then it has area x2.

In calculus, you’d say more. First you’d say that if a square has side near x, then it has area near x2. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δx to a side of length x corresponds to approximately a change of 2x Δx in the area.

In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ2? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ2. And you can approximate just how near the area is to μ2 using the delta method, analogous to the calculus discussion above.

If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. Random variables behave as you might expect when you stick them into linear functions, but offer surprises when you stick them into nonlinear functions.

Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.

First an analytical derivation. If X has a uniform [0, 1] distribution and ZX2, then the CDF of Z is

Prob(Zz) = Prob(X ≤ √z) = √ z.

and so the PDF for Z, the derivative of the CDF, is -1/2√z. From there you can compute the expected value by integrating z times the PDF.

You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.

      from random import random
      N = 1000000
      print( sum([random()**2 for _ in range(N)] )/N )

When I run this, I get 0.33386, close to 1/3.

Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.

      print( sum(expon.rvs(size=N)**2)/N )

When I ran this, I got 1.99934, close to the expected value of 2.

You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.

Distribution of numbers in Pascal’s triangle

This post explores a sentence from the book Single Digits:

Any number in Pascal’s triangle that is not in the outer two layers will appear at least three times, usually four.

Pascal’s triangle contains the binomial coefficients C(nr) where n ranges over non-negative numbers and r ranges from 0 to n. The outer layers are the elements with r equal to 0, 1, n-1, and n.

Pascal's triangle

We’ll write some Python code to explore how often the numbers up to 1,000,000 appear. How many rows of Pascal’s triangle should we compute? The smallest number on row n is C(n, 2). Now 1,000,000 is between C(1414, 2) and C(1415, 2) so we need row 1414. This means we need N = 1415 below because the row numbers start with 0.

I’d like to use a NumPy array for storing Pascal’s triangle. In the process of writing this code I realized that a NumPy array with dtype int doesn’t contain Python’s arbitrary-sized integers. This makes sense because NumPy is designed for efficient computation, and using a NumPy array to contain huge integers is unnatural. But I’d like to do it anyway, and the way to make it happen is to set dtype to object.

    import numpy as np
    from collections import Counter
    N = 1415 # Number of rows of Pascal's triangle to compute
    Pascal = np.zeros((N, N), dtype=object)
    Pascal[0, 0] = 1
    Pascal[1,0] = Pascal[1,1] = 1
    for n in range(2, N):
        for r in range(0, n+1):
            Pascal[n, r] = Pascal[n-1, r-1] + Pascal[n-1, r]
    c = Counter()
    for n in range(4, N):
        for r in range(2, n-1):
            p = Pascal[n, r]
            if p <= 1000000:
                c[p] += 1

When we run this code, we find that our counter contains 1732 elements. That is, of the numbers up to one million, only 1732 of them appear inside Pascal’s triangle when we disallow the outer two layers. (The second layer contains consecutive integers, so every positive integer appears in Pascal’s triangle. But most integers only appear in the second layer.)

When Single Digits speaks of “Any number in Pascal’s triangle that is not in the outer two layers” this cannot refer to numbers that are not in the outer two layers because every natural number appears in the outer two layers. Also, when it says the number “will appear at least three times, usually four” it is referring to the entire triangle, i.e. including the outer two layers. So another way to state the sentence quoted above is as follows.

Define the interior of Pascal’s triangle to be the triangle excluding the outer two layers. Every number n in the interior of Pascal’s triangle appears twice more outside of the interior, namely as C(n, 1) and C(nn-1). Most often n appears at least twice in the interior as well.

This means that any number you find in the interior of Pascal’s triangle, interior meaning not in the two outer layers, will appear at least three times in the full triangle, usually more.

Here are the statistics from our code above regarding just the interior of Pascal’s triangle.

  • One number, 3003, appears six times.
  • Six numbers appear four times: 120, 210, 1540, 7140, 11628, and 24310.
  • Ten numbers appear only once: 6, 20, 70, 252, 924, 3432, 12870,  48620, 184756, and 705432.
  • The large majority of numbers, 1715 out of 1732, appear twice.

How to create Green noise in Python

This is a follow-on to my previous post on green noise. Here we create green noise with Python by passing white noise through a Butterworth filter.

Green noise is in the middle of the audible spectrum (on the Bark scale), just where our hearing is most sensitive, analogous to the green light, the frequency where our eyes are most sensitive. See previous post for details, including an explanation of where the left and right cutoffs below come from.

Here’s the code:

from import write
from scipy.signal import buttord, butter, filtfilt
from scipy.stats import norm
from numpy import int16

def turn_green(signal, samp_rate):
    # start and stop of green noise range
    left = 1612 # Hz
    right = 2919 # Hz

    nyquist = (samp_rate/2)
    left_pass  = 1.1*left/nyquist
    left_stop  = 0.9*left/nyquist
    right_pass = 0.9*right/nyquist
    right_stop = 1.1*right/nyquist

    (N, Wn) = buttord(wp=[left_pass, right_pass],
                      ws=[left_stop, right_stop],
                      gpass=2, gstop=30, analog=0)
    (b, a) = butter(N, Wn, btype='band', analog=0, output='ba')
    return filtfilt(b, a, signal)

def to_integer(signal):
    # Take samples in [-1, 1] and scale to 16-bit integers,
    # values between -2^15 and 2^15 - 1.
    signal /= max(signal)
    return int16(signal*(2**15 - 1))

N = 48000 # samples per second

white_noise= norm.rvs(0, 1, 3*N) # three seconds of audio
green = turn_green(white_noise, N)
write("green_noise.wav", N, to_integer(green))

And here’s what it sounds like:

(download .wav file)

Let’s look at the spectrum to see whether it looks right. We’ll use one second of the signal so the x-axis coincides with frequency when we plot the FFT.

from scipy.fftpack import fft

one_sec = green[0:N]
plt.xlim((1500, 3000))

Here’s the output, concentrated between 1600 and 3000 Hz as expected:

spectral plot of green noise

Creating police siren sounds with frequency modulation

Yesterday I was looking into calculating fluctuation strength and playing around with some examples. Along the way I discovered how to create files that sound like police sirens. These are sounds with high fluctuation strength.

police car lights

The Python code below starts with a carrier wave at fc = 1500 Hz. Not surprisingly, this frequency is near where hearing is most sensitive. Then this signal is modulated with a signal with frequency fm. This frequency determines the frequency of the fluctuations.

The slower example produced by the code below sounds like a police siren. The faster example makes me think more of an ambulance or fire truck. Next time I hear an emergency vehicle I’ll pay more attention.

If you use a larger value of the modulation index β and a smaller value of the modulation frequency fm you can make a sound like someone tuning a radio, which is no coincidence.

Here are the output audio files in .wav format:



from import write
from numpy import arange, pi, sin, int16

def f(t, f_c, f_m, beta):
    # t    = time
    # f_c  = carrier frequency
    # f_m  = modulation frequency
    # beta = modulation index
    return sin(2*pi*f_c*t - beta*sin(2*f_m*pi*t))

def to_integer(signal):
    # Take samples in [-1, 1] and scale to 16-bit integers,
    # values between -2^15 and 2^15 - 1.
    return int16(signal*(2**15 - 1))

N = 48000 # samples per second
x = arange(3*N) # three seconds of audio

data = f(x/N, 1500, 2, 100)
write("slow.wav", N, to_integer(data))

data = f(x/N, 1500, 8, 100)
write("fast.wav", N, to_integer(data))

Related posts:

Click to learn more about consulting help with signal processing