# Computing SVD and pseudoinverse

In a nutshell, given the singular decomposition of a matrix A,

the Moore-Penrose pseudoinverse is given by

This post will explain what the terms above mean, and how to compute them in Python and in Mathematica.

## Singular Value Decomposition (SVD)

The singular value decomposition of a matrix is a sort of change of coordinates that makes the matrix simple, a generalization of diagonalization.

### Matrix diagonalization

If a square matrix A is diagonalizable, then there is a matrix P such that

where the matrix D is diagonal. You could think of P as a change of coordinates that makes the action of A as simple as possible. The elements on the diagonal of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors.

Unfortunately not all matrices can be diagonalized. Singular value decomposition is a way to do something like diagonalization for any matrix, even non-square matrices.

### Generalization to SVD

Singular value decomposition generalizes diagonalization. The matrix Σ in SVD is analogous to D in diagonalization. Σ is diagonal, though it may not be square. The matrices on either side of Σ are analogous to the matrix P in diagonalization, though now there are two different matrices, and they are not necessarily inverses of each other. The matrices U and V are square, but not necessarily of the same dimension.

The elements along the diagonal of Σ are not necessarily eigenvalues but singular values, which are a generalization of eigenvalues. Similarly the columns of U and V are not necessarily eigenvectors but left singular vectors and right singular vectors respectively.

The star superscript indicates conjugate transpose. If a matrix has all real components, then the conjugate transpose is just the transpose. But if the matrix has complex entries, you take the conjugate and transpose each entry.

The matrices U and V are unitary. A matrix M is unitary if its inverse is its conjugate transpose, i.e. M* M = MM* = I.

## Pseudoinverse and SVD

The (Moore-Penrose) pseudoinverse of a matrix generalizes the notion of an inverse, somewhat like the way SVD generalized diagonalization. Not every matrix has an inverse, but every matrix has a pseudoinverse, even non-square matrices.

Computing the pseudoinverse from the SVD is simple.

If

then

where Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements, leaving all the zeros alone, and making the matrix the right shape: if Σ is an m by n matrix, then Σ+ must be an n by m matrix.

We’ll give examples below in Mathematica and Python.

### Computing SVD in Mathematica

Let’s start with the matrix A below.

We can find the SVD of A with the following Mathematica commands.

    A = {{2, -1, 0}, {4, 3, -2}}
{U, S, V} = SingularValueDecomposition[A]


From this we learn that the singular value decomposition of A is

Note that the last matrix is not V but the transpose of V. Mathematica returns V itself, not its transpose.

If we multiply the matrices back together we can verify that we get A back.

    U . S. Transpose[V]


This returns

    {{2, -1, 0}, {4, 3, -2}}


as we’d expect.

### Computing pseudoinverse in Mathematica

The Mathematica command for computing the pseudoinverse is simply PseudoInverse. (The best thing about Mathematica is it’s consistent, predictable naming.)

    PseudoInverse[A]


This returns

    {{19/60, 1/12}, {-(11/30), 1/6}, {1/12, -(1/12)}}


And we can confirm that computing the pseudoinverse via the SVD

    Sp = {{1/Sqrt[30], 0}, {0, 1/2}, {0, 0}}
V . Sp . Transpose[U]


gives the same result.

### Computing SVD in Python

Next we compute the singular value decomposition in Python (NumPy).

    >>> a = np.matrix([[2, -1, 0],[4,3,-2]])
>>> u, s, vt = np.linalg.svd(a, full_matrices=True)


Note that np.linalg.svd returns the transpose of V, not the V in the definition of singular value decomposition.

Also, the object s is not the diagonal matrix Σ but a vector containing only the diagonal elements, i.e. just the singular values. This can save a lot of space if the matrix is large. The NumPy method svd has other efficiency-related options that I won’t go into here.

We can verify that the SVD is correct by turning s back into a matrix and multiply the components together.

    >>> ss = np.matrix([[s[0], 0, 0], [0, s[1], 0]])
>>> u*ss*vt


This returns the matrix A, within floating point accuracy. Since Python is doing floating point computations, not symbolic calculation like Mathematica, the zero in A turns into -3.8e-16.

Note that the singular value decompositions as computed by Mathematica and Python differ in a few signs here and there; the SVD is not unique.

### Computing pseudoinverse in Python

The pseudoinverse can be computed in NumPy with np.linalg.pinv.

    >>> np.linalg.pinv(a)
matrix([[ 0.31666667,  0.08333333],
[-0.36666667,  0.16666667],
[ 0.08333333, -0.08333333]])


This returns the same result as Mathematica above, up to floating point precision.

# Obesity index: Measuring the fatness of probability distribution tails

A probability distribution is called “fat tailed” if its probability density goes to zero slowly. Slowly relative to what? That is often implicit and left up to context, but generally speaking the exponential distribution is the dividing line. Probability densities that decay faster than the exponential distribution are called “thin” or “light,” and densities that decay slower are called “thick”, “heavy,” or “fat,” or more technically “subexponential.” The distinction is important because fat-tailed distributions tend to defy our intuition.

One surprising property of heavy-tailed (subexponential) distributions is the single big jump principle. Roughly speaking, most of the contribution to the sum of several heavy-tailed random variables comes from the largest of the samples. To be more specific, let “several” = 4 for reasons that’ll be apparent soon, though the result is true for any n. As x goes to infinity, the probability that

X1 + X2 + X3 + X4

is larger than a given x is asymptotically equal the probability that

max(X1, X2, X3, X4)

is larger than the same x.

The idea behind the obesity index [1] is to turn the theorem above around, making it an empirical measure of how thick a distribution’s tail is. If you draw four samples from a random variable and sort them, the obesity index is the probability that that the sum of the max and min, X1 + X4, is greater than the sum of the middle samples, X2 + X3.

The obesity index could be defined for any distribution, but it only measures what the name implies for right-tailed distributions. For any symmetric distribution, the obesity index is exactly 1/2. A Cauchy distribution is heavy-tailed, but it has two equally heavy tails, and so its obesity index is the same as the normal distribution, which has two light tails.

Note that location and scale parameters have no effect on the obesity index; shifting and scaling effect all the X values the same, so it doesn’t change the probability that X1 + X4 is greater than X2 + X3.

To get an idea of the obesity index in action, we’ll look at the normal, exponential, and Cauchy distributions, since these are the canonical examples of thin, medium, and thick tailed distributions. But for reasons explained above, we’ll actually look at the folded normal and folded Cauchy distributions, i.e. we’ll take their absolute values to create right-tailed distributions.

To calculate the obesity index exactly you’d need to do analytical calculations with order statistics. We’ll simulate the obesity index because that’s easier. It’s also more in the spirit of calculating the obesity index from data.

    from scipy.stats import norm, expon, cauchy

def simulate_obesity(dist, N):
data = abs(dist.rvs(size=(N,4)))
count = 0
for row in range(N):
X = sorted(data[row])
if X[0] + X[3] > X[1] + X[2]:
count += 1
return count/N

for dist in [norm, expon, cauchy]:
print( simulate_obesity(dist, 10000) )


When I ran the Python code above, I got

    0.6692
0.7519
0.8396


This ranks the three distributions in the anticipated order of tail thickness.

Note that the code above takes the absolute value of the random samples. This lets us pass in ordinary (unfolded) versions of the normal and Cauchy distributions, and its redundant for any distribution like the exponential that’s already positive-valued.

[I found out after writing this blog post that SciPy now has foldnorm and foldcauchy, but they don’t seem to work like I expect.]

Let’s try it on a few more distributions. Lognormal is between exponential and Cauchy in thickness. A Pareto distribution with parameter b goes to zero like x-1-b and so we expect a Pareto distribution to have a smaller obesity index than Cauchy when b is greater than 1, and a larger index when b is less than one. Once again the simulation results are what we’d expect.

The code

    for dist in [lognorm, pareto(2), pareto(0.5)]:
print( simulate_obesity(dist, 10000) )


returns

    0.7766
0.8242
0.9249


By this measure, lognormal is just a little heavier than exponential. Pareto(2) comes in lighter than Cauchy, but not by much, and Pareto(0.5) comes in heavier.

Since the obesity index is a probability, it will always return a value between 0 and 1. Maybe it would be easier to interpret if we did something like take the logit transform of the index to spread the values out more. Then the distinctions between Pareto distributions of different orders, for example, might match intuition better.

[1] Roger M. Cooke et al. Fat-Tailed Distributions: Data, Diagnostics and Dependence. Wiley, 2014.

# Curvature and automatic differentiation

Curvature is tedious to calculate by hand because it involves calculating first and second order derivatives. Of course other applications require derivatives too, but curvature is the example we’ll look at in this post.

## Computing derivatives

It would be nice to write programs that only explicitly implement the original function and let software take care of finding the derivatives.

### Numerical differentiation

Finite difference approximations for derivatives are nothing new. For example, Euler (1707–1783) used finite differences to numerically solve differential equations. But numerical differentiation can be inaccurate or unstable, especially for higher order derivatives.

### Symbolic differentiation

Symbolic differentiation is another approach, having the computer manipulate expressions much as a person would do by hand. It works well for many problems, though it scales poorly for large problems. It also requires functions to be presented in traditional mathematical form, not in the form of source code.

### Automatic differentiation

Automatic differentiation is a third way. Like numerical differentiation, it works with floating point numbers, not symbolic expressions. But unlike numerical differentiation, the result does not have approximation error.

As someone put it, automatic differentiation applies the chain rule to floating point numbers rather than to symbolic expressions.

## Python implementation

I’ll use the Python library autograd to compute curvature and illustrate automatic differentiation. autograd is not the most powerful automatic differentiation library for Python, but it is the simplest I’ve seen.

We will compute the curvature of a logistic curve.

The curvature of the graph of a function is given by

Here’s Python code using autograd to compute the curvature.

    import autograd.numpy as np
from autograd import grad

def f(x):
return 1/(1 + np.exp(-x))

f1 = grad(f)  # 1st derivative of f
f2 = grad(f1) # 2nd derivative of f

def curvature(x):
return abs(f2(x))*(1 + f1(x)**2)**-1.5


## Curvature plots

The graph is relatively flat in the middle and at the far ends. In between, the graph bends creating two regions of higher curvature.

    import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 300)
plt.plot(x, f(x))
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.title("Logisitic curve")
plt.savefig("logistic_curve.svg")


Now let’s look at the curvature.

    y = [curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$\kappa(x)$")
plt.title("Curvature")
plt.savefig("plot_logistic_curvature.svg")


As we should expect, the curvature is small at the ends and in the middle, with local maxima in between.

We can also look at the signed curvature, the same expression as curvature but without the absolute value.

We plot this with the following code.

    def signed_curvature(x):
return f2(x)*(1 + f1(x)**2)**-1.5

y = [signed_curvature(t) for t in x]
plt.plot(x, y)
plt.xlabel("$x$")
plt.ylabel(r"$k(x)$")
plt.title("Signed curvature")
plt.savefig("graph_signed_curvature.svg")


The result looks more like a sine wave.

The positive values mean the curve is bending counterclockwise, and the negative values mean the curve is bending clockwise.

Related post: Squircles and curvature

# Quantifying normal approximation accuracy

Probability is full of theorems that say that probability density approximates another as some parameter becomes large. All the dashed lines in the diagram below indicate a relationship like this.

You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence.  In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables X and Y is defined as

We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

    from scipy.integrate import quad
from scipy.stats import beta, gamma, norm
from scipy import inf
import matplotlib.pyplot as plt


## Beta distribution

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

    def KL_beta_norm(a, b):
b = beta(a, b)
n = norm(b.mean(), b.std())
f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x))
integral, error = quad(f, 0, 1)
return integral


And here we make our plot.

    x = [2**i for i in range(11)]
y = [KL_beta_norm(t, 2*t) for t in x]
plt.plot(x, y)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("a")
plt.ylabel("KL divergence")
plt.title("Comparing beta(a, 2a) and its normal approximation")
plt.savefig("beta_KL.svg")
plt.close()


The result is nearly linear on a log-log scale.

I made the b parameter twice the a parameter to show that you don’t need symmetry. When you do have symmetry, i.e a = b, the approximation quality is better and the graph is even straighter.

## Gamma distribution

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

    def KL_gamma_norm(shape):
g = gamma(shape)
n = norm(g.mean(), g.std())
f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x))
mode = max(0, shape-1)
integral1, error1 = quad(f, 0, mode)
integral2, error2 = quad(f, mode, inf)
return integral1 + integral2


The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The quad() function has a parameter points to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale.

You could do a similar analysis on the other approximations in the distribution relationship diagram above.

# Distribution of matches between two shuffled decks

Take two desks of cards and shuffle them. They can be standard 52-card decks, though the number of cards in the decks doesn’t matter as long as they’re the same and the decks are fairly large.

Now count the number of times the two desks match, i.e. how many times the same card is in the same position in both desks. The number of matches is random, and its distribution is approximately Poisson with mean 1. Let’s do a simulation and see how close the results come to the predicted outcome.

Here’s the Python code:

import numpy as np
from scipy.stats import poisson
import matplotlib.pyplot as plt

def count_zeros(x):
return len(x[x==0])

num_reps = 10000
deck_size = 52

matches = np.zeros(deck_size+1, dtype=int)

# Simulation
for _ in range(num_reps):

# Shuffle two decks
a = np.random.permutation(deck_size)
b = np.random.permutation(deck_size)

# Count how often they match
num_matches = count_zeros(a-b)
matches[ num_matches ] += 1

# Cut off outputs too small to see
cutoff = 8

# Matches predicted by a Poisson distribution with mean 1.
predicted = [num_reps*poisson(1).pmf(i) for i in range(cutoff)]

# Plot results
x = np.arange(cutoff)
w = 0.3 # bar width
plt.bar(x, matches[0:cutoff], w)
plt.bar(x+w, predicted, w)
plt.legend(["actual", "predicted"])
plt.xlabel("matches")
plt.ylabel("frequency")
plt.savefig("shuffled_desk_matches.svg")
plt.show()


And here’s the output based on 10,000 simulations:

About 1/3 of the time, you get no matches, another 1/3 of the time you get one match, and the rest of the time you get more. More precisely, according to the Poisson model zero matches and one match are both have probability 1/e.

# Average fraction round up

Pick a large number n. Divide n by each of the positive integers up to n and round the results up to the nearest integer. On average, how far do you round up?

Or in terms of probability, what is the expected distance between a fraction n/r, where n is large and fixed and r is chosen randomly between 1 and n, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

for large n, i.e. in the limit as n goes to ∞. Here ⌈x⌉ denotes the ceiling of x, the smallest integer greater than or equal to x.

Let’s plot this as a function of n and see what it looks like. Here’s the Python code.

    import matplotlib.pyplot as plt
from numpy import ceil, arange

def f(n):
return sum( ceil(n/r) - n/r for r in range(1, n) )/n

x = arange(1, 100)
y = [f(n) for n in x]
plt.plot(x, y)
plt.show()


And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the nth harmonic number and log n, i.e.

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant euler_gamma from numpy and add the

    plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at n = 1,000,000, we get 0.577258… while γ = 0.577215….

At n = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√n.

# Equation for the Eiffel Tower

Robert Banks’s book Towing Icebergs, Falling Dominoes, and Other Adventures in Applied Mathematics describes the Eiffel Tower’s shape as approximately the logarithmic curve

where y* and x0 are chosen to match the tower’s dimensions.

Here’s a plot of the curve:

And here’s the code that produced the plot:

from numpy import log, exp, linspace, vectorize
import matplotlib.pyplot as plt

# Taken from "Towing Icebergs, Falling Dominoes,
# and Other Adventures in Applied Mathematics"
# by Robert B. Banks

# Constants given in Banks in feet. Convert to meters.
feet_to_meter = 0.0254*12
ystar  = 201*feet_to_meter
x0     = 207*feet_to_meter
height = 984*feet_to_meter

# Solve for where to cut off curve to match height of the tower.
# - ystar log xmin/x0 = height
xmin = x0 * exp(-height/ystar)

def f(x):
if -xmin < x < xmin:
return height
else:
return -ystar*log(abs(x/x0))
curve = vectorize(f)

x = linspace(-x0, x0, 400)

plt.plot(x, curve(x))
plt.xlim(-2*x0, 2*x0)
plt.xlabel("Meters")
plt.ylabel("Meters")
plt.title("Eiffel Tower")

plt.axes().set_aspect(1)
plt.savefig("eiffel_tower.svg")


Related post: When length equals area
The St. Louis arch is approximately a catenary, i.e. a hyperbolic cosine.

# Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
from scipy.stats import beta as beta
from scipy import log2

def infogain(post_a, post_b, prior_a, prior_b):

p = beta(post_a, post_b).pdf
q = beta(prior_a, prior_b).pdf

(info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
return info


This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
print( infogain(3, 8, 3, 7) )


The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

# Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

This is pronounced/interpreted several ways:

• The divergence from Y to X
• The relative entropy of X with respect to Y
• How well Y approximates X
• The information gain going from the prior Y to the posterior X
• The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
from scipy.stats import expon, gamma
from scipy import inf

def KL(X, Y):
f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
return quad(f, 0, inf)

e = expon
g = gamma(a = 2)

print( KL(e, g) )
print( KL(g, e) )


This returns

    (0.5772156649008394, 1.3799968612282498e-08)
(0.4227843350984687, 2.7366807708872898e-09)


The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
return quad(f, -inf, inf)


and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )


This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)


and so both integrals are equal to within the error in the numerical integration.

# Fourier-Bessel series and Gibbs phenomena

Fourier-Bessel series are analogous to Fourier series. And like Fourier series, they converge pointwise near a discontinuity with the same kind of overshoot and undershoot known as the Gibbs phenomenon.

## Fourier-Bessel series

Bessel functions come up naturally when working in polar coordinates, just as sines and cosines come up naturally when working in rectangular coordinates. You can think of Bessel functions as a sort of variation on sine waves. Or even more accurately, a variation on sinc functions, where sinc(z) = sin(z)/z. [1]

A Fourier series represents a function as a sum of sines and cosines of different frequencies. To make things a little simpler here, I’ll only consider Fourier sine series so I don’t have to repeatedly say “and cosine.”

A Fourier-Bessel function does something similar. It represents a function as a sum of rescaled versions of a particular Bessel function. We’ll use the Bessel J0 here, but you could pick some other Jν.

Fourier series scale the sine and cosine functions by π times integers, i.e. sin(πz), sin(2πz), sin(3πz), etc. Fourier-Bessel series scale by the zeros of the Bessel function: J01z),  J02z),  J03z), etc. where λn are the zeros of J0. This is analogous to scaling sin(πz) by its roots: π, 2π, 3π, etc. So a Fourier-Bessel series for a function f looks like

The coefficients cn for Fourier-Bessel series can be computed analogously to Fourier coefficients, but with a couple minor complications. First, the basis functions of a Fourier series are orthogonal over [0, 1] without any explicit weight, i.e. with weight 1. And second, the inner product of a basis function doesn’t depend on the frequency. In detail,

Here δmn equals 1 if m = n and 0 otherwise.

Fourier-Bessel basis functions are orthogonal with a weight z, and the inner product of a basis function with itself depends on the frequency. In detail

So whereas the coefficients for a Fourier sine series are given by

the coefficients for a Fourier-Bessel series are given by

## Gibbs phenomenon

Fourier and Fourier-Bessel series are examples of orthogonal series, and so by construction they converge in the norm given by their associated inner product. That means that if SN is the Nth partial sum of a Fourier series

and the analogous statement for a Fourier-Bessel series is

In short, the series converge in a (weighted) L² norm. But how do the series converge pointwise? A lot of harmonic analysis is devoted to answering this question, what conditions on the function f guarantee what kind of behavior of the partial sums of the series expansion.

If we look at the Fourier series for a step function, the partial sums converge pointwise everywhere except at the step discontinuity. But the way they converge is interesting. You get a sort of “bat ear” phenomena where the partial sums overshoot the step function at the discontinuity. This is called the Gibbs phenomenon after Josiah Willard Gibbs who observed the effect in 1899. (Henry Wilbraham observed the same thing earlier, but Gibbs didn’t know that.)

The Gibbs phenomena is well known for Fourier series. It’s not as well known that the same phenomenon occurs for other orthogonal series, such as Fourier-Bessel series. I’ll give an example of Gibbs phenomenon for Fourier-Bessel series taken from [2] and give Python code to visualize it.

We take our function f(z) to be 1 on [0, 1/2] and 0 on (1/2, 1]. It works out that

## Python code and plot

Here’s the plot with 100 terms. Notice how the partial sums overshoot the mark to the left of 1/2 and undershoot to the right of 1/2.

Here’s the same plot with 1,000 terms.

Here’s the Python code that produced the plot.

    import matplotlib.pyplot as plt
from scipy.special import j0, j1, jn_zeros
from scipy import linspace

N = 100 # number of terms in series

roots = jn_zeros(0, N)
coeff = [j1(r/2) / (r*j1(r)**2) for r in roots]
z = linspace(0, 1, 200)

def partial_sum(z):
return sum( coeff[i]*j0(roots[i]*z) for i in range(N) )

plt.plot(z, partial_sum(z))
plt.xlabel("z")
plt.ylabel("{}th partial sum".format(N))
plt.show()


## Footnotes

[1] To be precise, as z goes to infinity

and so the Bessel functions are asymptotically proportional to sin(z – φ)/√z for some phase shift φ.

[2] The Gibbs’ phenomenon for Fourier-Bessel Series. Temple H. Fay and P. Kendrik Kloppers. International Journal of Mathematical Education in Science and Technology. 2003. vol. 323, no. 2, 199-217.