The Laplace limit

An earlier post discussed how to solve Kepler’s equation

ME − e sin(E)

using a sine series. You could also solve Kepler’s equation using a power series, which Lagrange did in 1771. Both approaches express E as a function of e and M, but from different perspectives. Bessel thought of his solution as a sum of sines in M, with coefficients that depend on e. Lagrange thought of his solution as a power series in e whose coefficients involve sines in M. You can rearrange the terms of either solution into the other.

The most interesting thing about the power series solution, in my opinion, is that it only converges for e less than roughly 2/3 while the sine series solution is valid for all e < 1. In astronomical terms, this means the power series solution works for the orbit of some planets but not others!

In our solar system, the planets all have eccentricity well below 2/3, but not all minor planets do. For example, the orbit of Eris has eccentricity 0.4407 but the orbit of Sedna has eccentricity 0.8549. And in other solar systems there are planets with eccentricity much greater than 2/3.

The Laplace limit

The radius of convergence for Lagrange’s power series solution is called the Laplace limit. Its value is eL = 0.6627…. There’s no obvious reason why there’s anything special about this value. There’s no astronomical reason for this value. It’s an artifact of the power series form of the solution.

If the series works for e = 0.66, you would reasonably think it works for e = 0.67, but that’s not the case. And if you’re observant, you might notice that although the series works for e = 0.66, it takes longer to converge than for smaller values of e; the rate of convergence is slowing down, warning you of danger ahead.

The exact value of eL is the unique real solution to the equation

x \exp\left(\sqrt{1 + x^2}\right) = 1 + \sqrt{1 + x^2}

There’s no obvious reason for this either. It has to do with finding the largest circle that can fit in a lens-shaped region of convergence. More on that here.

We can calculate eL with the following Python code.

from math import exp
from scipy.optimize import root_scalar

def f(x):
    t = (1 + x*x)**0.5
    return x*math.exp(t) - 1 - t

sol = root_scalar(f, bracket=[0, 1], method='brentq')
print(sol.root)

This prints 0.6627434193491817.

Series details

We can use the Lagrange inversion formula to find the series, just as Lagrange did two and a half centuries ago.

E = M+ \sum_{n=1}^{\infty} \frac{e^n}{n!} \frac{d^{\,n-1}}{dM^{\,n-1}} \left(\sin^n M\right)

The powers of sine can be expanded into the sum of sines of various frequencies and differentiated, leading to the equation

E = M+ \sum_{n=1}^{\infty} \frac{e^n}{2^{\,n-1}n!} \sum_{k=0}^{\lfloor n/2\rfloor} (-1)^k \binom{n}{k} (n-2k)^{n-1} \sin\!\big((n-2k)M\big)

 

A crank formula for π

I ran across a cranky formula for π based on physical constants here

\pi = \left( \frac{E}{\frac{1}{2}mc^{2}} \right)^{1/2} \left[ J_{\lambda} \cdot \lambda^{5} \left( e^{\frac{hc}{\lambda kT}} - 1 \right) \right]

and decided to play around with it.

The source describes λ as “wavelength (chosen in the microwave region)” and I thought perhaps you could chose a value of λ to make the equation work. But as a comment pointed out, the bracketed expression is simply 2hc², independent of λ, due to Planck’s blackbody law. That means we can simplify the expression above to

\pi = 2\sqrt{2} hc \sqrt{\frac{E}{m}}

Now the values of h and c are known. In fact, they’re now exactly known by definition: other SI units are defined in terms of h and c. The mass of an electron is known to 11 significant figures.

But E in the equation above is “Total energy of the universe.” I don’t even know what that means. Does it refer to the observable universe? Does it include dark energy? Does it include the energy equivalent of mass?

I asked a couple LLMs that the total energy of the universe might mean and what its value might be, and they said something like “Depends. It might be zero. It might be infinite. But if I had to say, I’d say around 1070 Joules.”

If we solve the equation above for E we get 2.8480347886530404 × 1019 Joules. I have no idea how to justify that.

The expression for π is not dimensionless. I suppose you could choose some nonstandard units that would make the equation work.

The source I linked to above cites Mathematical Cranks by Underwood Dudley, but I couldn’t find it in the book.

Related posts

From Kepler to Bessel

The previous post very briefly said that the integral representation for Bessel functions was motived by solving Kepler’s equation. This post will go into more detail.

Kepler’s equation

There are multiple ways to describe the position of a planet in an elliptical orbit around a star. For historical reasons, these descriptions have arcane names such as mean anomaly, true anomaly, and eccentric anomaly. This post explains how these three are related.

For this post, it is enough to say that often you know mean anomaly M and want to know eccentric anomaly E. These are related via Kepler’s equation

M = E - e \sin E
where e is the eccentricity of the orbit. You’d like to solve for E as a function of M and e, but there’s no elementary way to do that.

One way to solve Kepler’s equation is to take a guess at E and plug it into the right hand side of

E = M + e \sin E
to get a new E, and keep iterating until the two sides are closer together. I write more about this here.

Another approach to solving Kepler’s equation is to use Newton’s method. I write more about that here.

Still another approach is to expand E in a sine series and find the series coefficients. An advantage to this approach is that once you have the coefficients, you have an expression for E as a function of M, and you can plug in more values of M without having to solve Kepler’s equation for each value of M separately.

Sine series coefficients

Kepler’s equation is easy to solve at E = 0 and at E = π. In both cases, EM. So the function E − M is zero at both ends of [0, π], which suggests we try to expand E − M in a sine series

E - M = \sum_{n=1}^\infty a_n \sin nM

We then calculate the Fourier coefficients an as usual.

\begin{align*} a_n &= \frac{2}{\pi} \int_0^\pi (E-M) \sin(nM) \, dM \\ &= \frac{2}{n \pi} \int_0^\pi (E^\prime - 1) \cos(nM)\, dM \\ &= \frac{2}{n \pi} \int_0^\pi \cos(nM) E^\prime(M) \, dM \\ &= \frac{2}{n \pi} \int_0^\pi \cos\Big(nE - ne\sin(E)\Big) E^\prime(M) \, dM \\ &= \frac{2}{n} \left\{ \frac{1}{\pi} \int_0^\pi \cos\Big(nE - ne \sin(E)\Big)\, dE\right\} \\ &= \frac{2}{n} J_n(ne) \end{align*}

The second line uses integration by parts. The third line uses Kepler’s equation. The last line uses the definition of the Bessel functions Jn given in the previous post.

Mr. Bessel’s eponymous functions

Yesterday I wrote a post showing that the trapezoid rule evaluates the integral

\int_{-\pi}^\pi \cos(\sin(x) + x)\, dx

very efficiently. But how do we know what the exact integral is for comparison? If you ask Mathematica, it will tell you the integral equals −2π J1(1) where J1 is a Bessel function. This may seem like rabbit out of a hat, but it’s actually a simple calculation given the integral definition of Bessel functions:

J_n(z) = \frac{1}{\pi}\int_0^\pi \cos(n\theta - z\sin(\theta))\, d\theta

Since cosine is even, we can write our integral over [−π, π] as twice the integral over [0, π]. Then a change of variables turns this into the definition of Jn(z) with n = 1 and z = 1.

A deeper question is what have we accomplished by just giving a new name to essentially the same problem we started with. Another question is why in the world are Bessel functions defined as above.

As for what we’ve accomplished, we’ve related out integration problem to a very well-studied function. Bessel functions have been studied for two centuries and it’s easy to find software to evaluate them. Even the usually minimalist command line calculator bc has a function j(x, n) for evaluating Jn(x) for integer values of n. We could calculate our integral to 50 decimal places as follows.

~$ bc -l
>>> scale = 50
>>>  -8*a(1)*j(1,1)
-2.76491937476833705153256665538788207487495025542883

Note that bc doesn’t have a value of π built in, but a(x) evaluates the arctangent function, and π = 4 arctan(1).

There are multiple ways of defining Bessel functions. The three main ways would be in terms of their power series, in terms of the differential equations they solve, and in terms of their integral representation. Friedrich Bessel defined what we now call Bessel functions of the first kind, the Jn functions, in terms of their integral representations.

Why did Bessel care about these integrals? They came out of his calculations in celestial mechanics. One example of this is solving Kepler’s equation with Fourier series; the Fourier coefficients are given by Bessel functions. This is worked out in detail in the next post.

Bessel functions had occurred in applications before Mr. Bessel drew attention to them. The functions are named after him because he was the first to systematically study them.

Mathematics is developed inductively but taught deductively. It’s common for things to be kicked around for years before someone decides they deserve a name and systematic study. See this post on the central limit theorem for another example. The CLT is older than the Gaussian distribution, even older than Gauss.

Related posts

Integrating smooth periodic functions

Several posts lately have looked at the function

f(x) = cos(sin(x) + x).

This post will look at the function from a different angle. It’s a smooth function with period 2π. For reasons I wrote about here, this means that the trapezoid rule should find its integral very efficiently.

In general, the error in the trapezoid rule is on the order of 1/N² where N is the number of integration points. To be more specific, the error in integrating a function f over [a, b] with N points is bounded by

(baM / 12N²

where M is the maximum absolute value of the second derivative of f. So in our case we should expect the error to be less than 82.67/N². In fact we do much better than that. The error does not decrease quadratically, as it does in general, but exponentially.

With just 16 integration points, we’ve reached the limit of floating point representation.

Partitions over permutations

I was thinking more about the cosine approximation to the Gaussian

exp(−z²) ≈ (1 + cos(sin(z) + z))/2

that I wrote about last week. The two expressions above are close along the real axis but not along the imaginary axis.

If ziy, the right side grows much faster than the left, behaving like exp(exp(y)).

This led to me looking up the power series for the double exponential function exp(exp(y)). This is an interesting series because the coefficient of xn is

e Bn / n!

where Bn is the nth Bell number, which equals the number of ways to partition a set of n labeled items [1]. And of course n! is the number of ways to permute a set of n labeled items. So the nth coefficient in the power series for exp(exp(y)) is the ratio of the number of partitions to permutations for a set of n labeled things, multiplied by e.

The number of ways to partition a set of n things grows quickly as n increases, almost as quickly as the number of permutations, and so the series for the double exponential function converges very slowly.

Computing

SymPy has a function bell for computing Bell numbers, so you could compute the ratio of partitions to permutations as follows.

from sympy import bell, factorial
f = lambda n: bell(n)/factorial(n)

This returns a number of type sympy.core.numbers.Rational and so the result is exact. You can cast it to float for convenience.

Asymptotics

If we look at only the terms in the asymptotic series for log Bn and log n! that grow with n we have

log Bn ~ n log n − n log log n
log n! ~ n log n − ½ log n

and so

log( Bnn! ) ~ ½ log n − n log log n

There’s also an asymptotic series for log( Bnn! ) involving the Lambert W function:

log( Bnn! ) ~ n/r − 1 − n log r

where r = W(n).

Related posts

[1] It’s important that the items are labeled. Partition numbers are the number of partitions of an unlabeled set. Partition numbers are much smaller than Bell numbers.

It’s not just Taylor series

There is still active discussion on X about the approximation

exp(−x²) ≈ (1 + cos(sin(x) + x))/2

and some are saying this can just be explained by Taylor series: the series for the two sides differ for the first time at the x6 term, so that’s why you get a good approximation. As I wrote yesterday, that’s only part of it.

If it were just about Taylor series you could use

exp(−x²) ≈ 1 − x² + x4/2

which also has error O(x6). But this approximation is only good for fairly small x, say in [−0.5, 0.5], whereas the approximation at the top of the post is good over [−4, 4]. When x = 4, the error in the cosine approximation is 0.002579 but the error in the Taylor approximation is 113, five orders of magnitude larger.

If the accuracy of the cosine approximation were due to Taylor series, then we’d expect the function

exp(−x²) − (1 + cos(sin(x) + x))/2

to be small not just over the interval [−4, 4] but over a disk of radius 4 in the complex plane. But it’s not. When x = 4i the function is on the order of 1013.

Both the cosine approximation and the Taylor approximation are good for small disks, say of radius 0.5. They’re both bad for much larger disks, and in fact the cosine approximation is worse.

 

Spot checking polynomial identities

If a polynomial identity holds at a few random points, it’s very like true. We’ll make this statement more precise, but first let’s look at some applications.

You may want to test an identity that naturally presents itself as a statement that two polynomials are equal. Or you might use something like the binomial coefficient trick to reframe a problem that isn’t obviously an identity about polynomials. And with algebraic circuits, you can reformulate a wide range of computations as polynomial identities; this is widely used in zero-knowledge proofs.

The theorem alluded to at the top of the post is the Schwartz-Zippel lemma. It is formulated in terms of the probability of a non-zero polynomial P evaluating to zero. To prove that two polynomials Q1 and Q2 are equal, you can show that

P = Q1(x) − Q2(x) = 0.

Schwartz-Zippel lemma

Let F be a (typically large) finite field and let P be a non-zero polynomial in n variables

P(x1, x2, x3, …, xn)

of total degree d. If we choose the x‘s randomly from F then the probability that P evaluates to zero is no more than d/|F|. [1]

If the total degree d is small relative to the size of the field, then the probability of P evaluating to zero is small. As long as d is less than |F|, you can evaluate the polynomial k times to make

(d / |F|)k

as small as you’d like. If d isn’t too large, and F is large, like the integers mod p = 2255 − 19 used in cryptography, one polynomial evaluation might be enough to give convincing evidence that the polynomial is zero.

 

[1] The Schwartz-Zippel lemma in its full generality applies to polynomials over an integral domain R with variables drawn from S, a finite subset of R. Here we’re setting RSF.

Online (one-pass) algorithms

Canonical example

The sample variance of a set of numbers is defined in terms of the sum of the squared distances from each point to the mean.

s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2

So it would seem that you first need to calculate the mean, then go back and compute the squared differences from the mean. And yet sample variance can be computed in one pass through the data.

You’ll find two equivalent equations in statistics books: the one described above and another based on the sum of the data points and the sum of the data points squared.

s^2 = \frac{1}{n(n-1)}\left(n\sum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_i\right)^2\right)

While this equation is theoretically correct, it is numerically unstable. Code that directly implements this equation can return a negative value for a quantity that is theoretically positive. I’ve seen this happen with real data, causing a program to crash when taking the square root of the variance to get the standard deviation.

However, there is an algorithm that computes mean and variance in one pass that is accurate and numerically stable. This algorithm was developed by B. P. Welford in 1962. I discuss Welford’s algorithm and give code for implementing it here.

Online algorithms

Welford’s algorithm is known in computer science as an “online” algorithm. This term was coined well before the Internet. For example, see the paper [1] from 1965.

But of course now “online” means something else, and so the technical and colloquial uses of “online algorithm” have split. Technical literature uses the phrase to describe the kinds of algorithms in this post. Most people would take “online algorithm” to mean code that runs on a remote server. You may see “streaming algorithm” as a contemporary technical term, but I’d still search on “online algorithm” to find papers.

Computing higher moments online

Welford’s algorithm computes the first two moments, mean and variance, of a data set online. It is also possible to compute skewness and kurtosis online, as well as higher moments.

Online regression

Simple linear regression is closely related to calculating mean and variance, and it is possible to compute simple regression coefficients online. I have some old notes on this here.

This post was motivated by an email asking me about multiple regression. It is also possible to compute multiple regression coefficients online, but I haven’t done this. I found a couple references, [2] and [3], but I have not read them. There is a simple procedure for two predictor variables but I believe things get a little more complicated with three or more predictors, requiring a recursive least squares algorithm.

Related posts

The notion of online algorithms is closely related to the notion of a fold in functional programming. Here are several posts on computing things with folds.

[1] One-Tape, Off-Line Turing Machine Computations by F. C. Hennie. Information and Control. 8, 553-578 (1965). Available here. In this paper Hennie writes “In an on-line computation the input data are supplied to the machine, one symbol at a time, at a special input terminal. … In an off-line computation all of the input symbols are written on one of the machine’s tapes prior to the start of the computation.

[2] Arthur Albert and Robert W. Sittler, “A Method for Computing Least Squares Estimators that Keep Up with the Data,” Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 3(3), 384–417, 1965. DOI: 10.1137/0303026.

[3] Petre Stoica and Per Ashgren. Exact initialization of the recursive least-squares algorithm. Int. J. Adapt. Control Signal Process. 2002; 16:219&ndashh;230.

Turning K-L divergence into a metric

Kullback-Leibler divergence

Kullback-Leibler divergence is defined for two random variables X and Y by

D_{KL}(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

K-L divergence is non-negative, and it’s zero if and only if X and Y have the same distribution. But it is not a metric, for reasons explained here. For one thing, it’s not symmetric.

Jeffreys divergence

We can fix the symmetry problem by defining

J(X, Y) = D_{KL}(X || Y)  + D_{KL}(Y || X)

The J above stands for Jeffreys, for Harold Jeffreys. J is called either the symmetrized K-L divergence or Jeffreys’ divergence. It’s still a divergence, not a distance.

A distance (metric) d has to have four properties:

  1. d(x, x) = 0
  2. d(xy) > 0 if xy
  3. d(xy) = d(yx)
  4. d(xz) ≤ d(xy) + d(yz)

K-L divergence satisfies the first two properties. Jeffreys’ divergence satisfies the first three, but not the last one, the triangle inequality.

To show that J doesn’t satisfy the triangle inequality, let X, Y, and Z be Bernoulli random variables with p equal to 0.1, 0.2, and 0.3 respectively. Then the following Python code shows that the divergence from X to Y, plus the divergence from Y to Z, is less than the divergence from X to Z. This would be like saying you could get from LA to NYC faster by having a layover in Denver rather than taking a direct flight.

from math import log

kl = lambda p, q: p*log(p/q) + (1-p)*log((1-p)/(1-q))
j  = lambda p, q: kl(p, q) + kl(q, p)

a = j(0.1, 0.2)
b = j(0.2, 0.3)
c = j(0.1, 0.3)
print(a + b, c)

This prints 0.135 and 0.270.

Jensen-Shannon distance

Jensen-Shannon distance turns K-L divergence into a metric as follows. First, define the random variable M to be the average of X and Y. Then average the K-L divergence from M to each of X and Y. This defines the Jensen-Shannon divergence. It’s still not a metric, but its square root is, which defines the Jensen-Shannon distance.

\begin{align*} M &= (X + Y)/2 \\ \text{JSD}(X, Y) &= \tfrac{1}{2}D_{KL}(X || M) + \tfrac{1}{2}D_{KL}(Y || M) \\ d_{JS}(X, Y) &= \sqrt{\text{JSD}(X, Y)} \end{align*}

The following code gives an example of Jensen-Shannon distance satisfying the triangle inequality.

def d(p, q):
    m = 0.5*(p + q)
    jsd = 0.5*kl(p, m) + 0.5*kl(q, m) 
    return jsd**0.5

a = d(0.1, 0.2)
b = d(0.2, 0.3)
c = d(0.1, 0.3)
print(a + b, c)

This prints 0.1817 and 0.1801. Now a layover makes the trip longer.