Online (one-pass) algorithms

Canonical example

The sample variance of a set of numbers is defined in terms of the sum of the squared distances from each point to the mean.

s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2

So it would seem that you first need to calculate the mean, then go back and compute the squared differences from the mean. And yet sample variance can be computed in one pass through the data.

You’ll find two equivalent equations in statistics books: the one described above and another based on the sum of the data points and the sum of the data points squared.

s^2 = \frac{1}{n(n-1)}\left(n\sum_{i=1}^n x_i^2 -\left(\sum_{i=1}^n x_i\right)^2\right)

While this equation is theoretically correct, it is numerically unstable. Code that directly implements this equation can return a negative value for a quantity that is theoretically positive. I’ve seen this happen with real data, causing a program to crash when taking the square root of the variance to get the standard deviation.

However, there is an algorithm that computes mean and variance in one pass that is accurate and numerically stable. This algorithm was developed by B. P. Welford in 1962. I discuss Welford’s algorithm and give code for implementing it here.

Online algorithms

Welford’s algorithm is known in computer science as an “online” algorithm. This term was coined well before the Internet. For example, see the paper [1] from 1965.

But of course now “online” means something else, and so the technical and colloquial uses of “online algorithm” have split. Technical literature uses the phrase to describe the kinds of algorithms in this post. Most people would take “online algorithm” to mean code that runs on a remote server. You may see “streaming algorithm” as a contemporary technical term, but I’d still search on “online algorithm” to find papers.

Computing higher moments online

Welford’s algorithm computes the first two moments, mean and variance, of a data set online. It is also possible to compute skewness and kurtosis online, as well as higher moments.

Online regression

Simple linear regression is closely related to calculating mean and variance, and it is possible to compute simple regression coefficients online. I have some old notes on this here.

This post was motivated by an email asking me about multiple regression. It is also possible to compute multiple regression coefficients online, but I haven’t done this. I found a couple references, [2] and [3], but I have not read them. There is a simple procedure for two predictor variables but I believe things get a little more complicated with three or more predictors, requiring a recursive least squares algorithm.

Related posts

The notion of online algorithms is closely related to the notion of a fold in functional programming. Here are several posts on computing things with folds.

[1] One-Tape, Off-Line Turing Machine Computations by F. C. Hennie. Information and Control. 8, 553-578 (1965). Available here. In this paper Hennie writes “In an on-line computation the input data are supplied to the machine, one symbol at a time, at a special input terminal. … In an off-line computation all of the input symbols are written on one of the machine’s tapes prior to the start of the computation.

[2] Arthur Albert and Robert W. Sittler, “A Method for Computing Least Squares Estimators that Keep Up with the Data,” Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 3(3), 384–417, 1965. DOI: 10.1137/0303026.

[3] Petre Stoica and Per Ashgren. Exact initialization of the recursive least-squares algorithm. Int. J. Adapt. Control Signal Process. 2002; 16:219&ndashh;230.

Turning K-L divergence into a metric

Kullback-Leibler divergence

Kullback-Leibler divergence is defined for two random variables X and Y by

D_{KL}(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

K-L divergence is non-negative, and it’s zero if and only if X and Y have the same distribution. But it is not a metric, for reasons explained here. For one thing, it’s not symmetric.

Jeffreys divergence

We can fix the symmetry problem by defining

J(X, Y) = D_{KL}(X || Y)  + D_{KL}(Y || X)

The J above stands for Jeffreys, for Harold Jeffreys. J is called either the symmetrized K-L divergence or Jeffreys’ divergence. It’s still a divergence, not a distance.

A distance (metric) d has to have four properties:

  1. d(x, x) = 0
  2. d(xy) > 0 if xy
  3. d(xy) = d(yx)
  4. d(xz) ≤ d(xy) + d(yz)

K-L divergence satisfies the first two properties. Jeffreys’ divergence satisfies the first three, but not the last one, the triangle inequality.

To show that J doesn’t satisfy the triangle inequality, let X, Y, and Z be Bernoulli random variables with p equal to 0.1, 0.2, and 0.3 respectively. Then the following Python code shows that the divergence from X to Y, plus the divergence from Y to Z, is less than the divergence from X to Z. This would be like saying you could get from LA to NYC faster by having a layover in Denver rather than taking a direct flight.

from math import log

kl = lambda p, q: p*log(p/q) + (1-p)*log((1-p)/(1-q))
j  = lambda p, q: kl(p, q) + kl(q, p)

a = j(0.1, 0.2)
b = j(0.2, 0.3)
c = j(0.1, 0.3)
print(a + b, c)

This prints 0.135 and 0.270.

Jensen-Shannon distance

Jensen-Shannon distance turns K-L divergence into a metric as follows. First, define the random variable M to be the average of X and Y. Then average the K-L divergence from M to each of X and Y. This defines the Jensen-Shannon divergence. It’s still not a metric, but its square root is, which defines the Jensen-Shannon distance.

\begin{align*} M &= (X + Y)/2 \\ \text{JSD}(X, Y) &= \tfrac{1}{2}D_{KL}(X || M) + \tfrac{1}{2}D_{KL}(Y || M) \\ d_{JS}(X, Y) &= \sqrt{\text{JSD}(X, Y)} \end{align*}

The following code gives an example of Jensen-Shannon distance satisfying the triangle inequality.

def d(p, q):
    m = 0.5*(p + q)
    jsd = 0.5*kl(p, m) + 0.5*kl(q, m) 
    return jsd**0.5

a = d(0.1, 0.2)
b = d(0.2, 0.3)
c = d(0.1, 0.3)
print(a + b, c)

This prints 0.1817 and 0.1801. Now a layover makes the trip longer.

The Meta logo and fitting Besace curves

I saw a post yesterday saying that the Meta logo is a Besace curve.

Meta logo

A Besace curve has the implicit form

(x^2 - by)^2 = a^2(x^2 - y^2)

and the parametric form

\begin{align*} x &= a\cos(t) - b \sin(t) \\ y &= -\sin(t) x\end{align*}

where t ranges over [0, 2π].

So given a Besace curve, such as the Meta logo, how do you find the parameters a and b to fit the curve?

We can rewrite the parametric expression for x as a sine with a phase shift (see notes here)

x = A \sin(t + \phi)

where

\begin{align*} A &= \sqrt{a^2 + b^2} \\ \phi &= -\arctan(a/b)\end{align*}

Also, we can rewrite the parametric expression for y as

\begin{align*} y &= A \sin(t) \sin(t + \phi) \\ &= \frac{A}{2} \left(\cos(\phi) - \cos(2t + \phi)\right) \\ \end{align*}

Now the extreme values of x and y are easier to see. The maximum value of x is A and the minimum value is −A. The maximum value of y is A(cos(φ) + 1)/2 and the minimum value is A(cos(φ) − 1)/2.

We can simplify the cosine of an artangent (see here) to find the height, i.e. the difference between the maximum and minimum y value, in terms of a and b.

\begin{align*} \cos(\phi) &= \cos(-\arctan(a/b)) \\ &= \frac{1}{\sqrt{1 + (a/b)^2}} \\ &= \frac{b}{\sqrt{a^2 + b^2}} \end{align*}

Then the height is given by

\begin{align*} h &= \frac{A}{2}(\cos(\phi) + 1) - \frac{A}{2}(\cos(\phi) - 1) \\ &= A \cos(\phi) + A \\ &= b + \sqrt{a^2 + b^2} \end{align*}

The width is given by

w = 2A = 2\sqrt{a^2 + b^2}

and so

b = h - w/2

and

a = \pm \sqrt{\frac{w^2}{4} - b^2}

Now the Meta logo is drawn with a thick line, and the line width isn’t constant. It’s a little fuzzy what the height and width of the middle of the curve are, but I estimated h = 120 and w = 200 from one image. This leads to b = 20 and a = 97.98.

The Mathematica code

ParametricPlot[{a Cos[t] + 
   b Sin[t], -Sin[t] ( a Cos[t] + b Sin[t])}, {t, 0, 2 Pi}, 
 PlotStyle -> Thickness[0.05]]

produces the following image.

Mathematica approximation of Meta logo

This is reminiscent of the Meta logo, but not a great match. I suspect the logo is not exactly a Besace curve. You could tinker with the a and b parameters and the aspect ratio to get a closer match. The logo may have been inspired by a Besace curve and then drawn by hand.

Calculating the expected range of normal samples

The previous post looked at the expected IQ range in a jury of 12. This post will look more generally at computing the expected range of n samples from a N(0, 1) random variable. This will give the expected range in units of σ, i.e. multiply the results by σ if your σ isn’t 1.

As mentioned in the previous post, the expected range is given by

d_n = 2n \int_{-\infty}^\infty \Phi(x)^{n-1} \, x\,\phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal. The integral can be calculated in closed form for n ≤ 5, but in general it requires numerical integration [1].

The following Python code can compute dn.

from scipy.stats import norm
from scipy.integrate import quad
import numpy as np

def d(n):
    integrand = lambda x: x*norm.pdf(x)*norm.cdf(x)**(n-1)
    res, info = quad(integrand, -np.inf, np.inf)
    return 2*n*res

For large n we have the asymptotic approximation

d_n = 2 \Phi^{-1}\left( \frac{n \,–\, 0.375}{ n + 0.25} \right)

which we could implement in Python by

def approx(n):
    return 2*norm.ppf((n - 0.375)/(n + 0.25))

For very large n the asymptotic expression may be more accurate than the integral due to numerical integration error.

Here are a few example values.

|-----+-------|
|   n |   d_n |
|-----+-------|
|   2 | 1.128 |
|   3 | 1.693 |
|   5 | 2.326 |
|  10 | 3.078 |
|  12 | 3.258 |
|  23 | 3.858 |
|  50 | 4.498 |
| 100 | 5.015 |
|-----+-------|

[1] Order Statistics by H. A. David. John Wiley & Sons. 1970.

Expected IQ spread on a jury

There’s been some discussion online lately about how a large difference in IQ makes it difficult for two people to communicate. There have been studies that confirm this effect. The difficulty is not insurmountable, but it takes deliberate effort to overcome.

Someone dismissed this communication difficulty by pointing out that the expected difference in IQ between two individuals is around 17, suggesting that most communication is between people who differ by more than one standard deviation in IQ. But this calculation assumes people are chosen at random, which they usually are not. People tend to live around and work around others of similar intelligence.

However, a jury is a random sample. It’s not a perfect random sample. For one thing, it starts with a random sample of people who are registered to vote, or who have a drivers license, not all individuals. Furthermore, the pool of potential jurors is reduced to a jury through the process of voir dire, which is not random.

For this post I will make the simplifying assumption that a jury is a random sample from a population with normally distributed IQ with standard deviation σ = 15. The mean doesn’t matter here, but you could assume it’s 100 if you’d like.

By symmetry, the expected range of n samples from a normal random variable is twice the maximum. For n = 12 the range is about 3.26σ, which corresponds to nearly 50 IQ points.

This suggests there’s usually a big spread of IQ on a jury. Even if IQ doesn’t measure intelligence, it measures something, and that something varies a lot over 12 people chosen at random [1].

Related posts

[1] In case you’re interested in the technical details, the expected range of n samples from a standard normal random variable is given by

d_n = 2n \int_{-\infty}^\infty \Phi(x)^{n-1} \, x\,\phi(x) \, dx

where φ and Φ are the PDF and CDF of a standard normal. Multiply this by σ to get the range of a normal random variable with standard deviation σ. As for how to calculate dn, see the next post.

Hilbert transform as an infinite matrix

The previous post linked to a post I wrote a few years ago about the Hilbert transform and Fourier series. That post says that if the Fourier series of a function is

f(t) = \sum_{n=1}^\infty \left\{ a_n \sin(nt) + b_n\cos(nt) \right\}

then the Fourier series of its Hilbert transform is

f_H(x) = \sum_{n=1}^\infty \left\{ -b_n \sin(nx) + a_n\cos(nx) \right\}

When I looked back at that post I thought about how if you thought of the Fourier coefficients as elements of an infinite vector then the Hilbert transform can be represented as multiplying by an infinite block matrix.

\left[ \begin{array}{cc|cc|cc|c} 0 & -1 & 0 & 0 & 0 & 0 & \cdots \\ 1 & 0 & 0 & 0 & 0 & 0 & \cdots \\ \hline 0 & 0 & 0 & -1 & 0 & 0 & \cdots \\ 0 & 0 & 1 & 0 & 0 & 0 & \cdots \\ \hline 0 & 0 & 0 & 0 & 0 & -1 & \cdots \\ 0 & 0 & 0 & 0 & 1 & 0 & \cdots \\ \hline \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right] \left[ \begin{array}{c} a_1 \\ b_1 \\ \hline a_2 \\ b_2 \\ \hline a_3 \\ b_3 \\ \hline \vdots \end{array} \right]

I rarely see infinite matrices except in older math books. Apparently they were more fashionable a few decades ago than they are now. I suppose the notation falls between two stools, too concrete for some tastes and not concrete enough for others. The former folks would prefer something like H and the latter would prefer the sum above.

Real and imaginary parts

The previous post announced some notes I wrote up based on an article by Henry Baker implementing functions of a complex variable in terms of functions of a real variable. That is, it finds functions u(x, y) and v(x, y) such that

f(x + iy) = u(x, y) + i v(x, y)

where xyu, and v are all real-valued. Not only that, but if f is an elementary function, so are u and v. Here “elementary” has a technical meaning, but essentially it means functions that you could evaluate on a scientific calculator. A couple of the functions might be unfamiliar, such as sgn and atan2, but there are no functions like the gamma function that are defined in terms of integrals.

One application of Baker’s equations would be to bootstrap a math library that doesn’t support complex numbers into one that does. But the equations could be useful in pure math when you’d like to have a convenient expression for the real or imaginary part of a function.

The real and imaginary parts of a complex analytic function are harmonic functions. So the functions on the right hand side of Baker’s equations satisfy Laplace’s equation:

uxx + uyy = 0

and

vxx + vyy = 0.

Furthermore, the functions u and v form harmonic conjugate pairs, meaning each is the Hilbert transform of the other.

Building complex functions out of real parts

A couple months ago I wrote about how to compute the sine and cosine of a complex number using only real functions of real variables using the equations

\begin{align*} \sin(x + iy) &= \sin x \cosh y + i \cos x \sinh y \\ \cos(x + iy) &= \cos x \cosh y - i \sin x \sinh y \\ \end{align*}

You can do something analogous for all the elementary functions, though some of the equations are quite a bit more complicated than the ones above. See the equations here.

The equations come from a paper by Henry G. Baker, cited in the linked page. I wrote up Baker’s equations in LaTeX, then used ChatGPT to generate Python code from the LaTeX to numerically verify the equations and my typesetting of them. This caught a few typos on my part.

The test code evaluated the equations at points from each quadrant. All matched NumPy, implying that Baker and NumPy use the same branch cuts on inverse functions.

***

This post is part of a thread that has gone on for a few days. Maybe it’s the last post in the thread; we’ll see.

It all started with a post on Markov’s equation

x² + y² + z² = 3 xyz

and an approximation to the equation that has a closed-form solution. That led to the identity

cosh( arccosh(a) + arccosh(b) ) = ab + √(a² − 1) √(b² − 1).

The approximation to Markov’s equation only needed the identity to be valid for real a and b greater than 1. But when I looked closer at the identity I found several complications with branch cuts. The identity doesn’t hold everywhere using the principle branch of the square root function. But if you define √(z² − 1) to have a branch cut along [−1, 1] then the equation holds everywhere in the complex plane. And that led to my writing up some notes on how to define all the elementary inverse functions in terms of log.

Someone reading these posts suggested I look at a paper that mentioned “couth” and “uncouth” function pairs, which led to this post and its warm up.

I find all this interesting because it’s an advanced perspective on a questions that are latent in an intro calculus class. What exactly do functions like arccos mean and why where they defined as they were? These are fairly deep and interesting questions that are swept under the rug, and swept there for good reason. A calculus class has to cover an enormous amount of material and there’s no time to dwell on fine points. Some of my favorite posts look back leisurely on things that go by in a blur when you’re a student.

Couth and uncouth function pairs

“You can’t always get what you want. But sometimes you get what you need.” — The Rolling Stones

Circular functions and hyperbolic functions aren’t invertible, but we invert them anyway. These functions map many points in the domain to each point in the range, and we invert them by mapping a point in the range back to some point in the domain. Often this works as expected, but sometimes it doesn’t.

In the previous post I said

You can relate each trig function “foo” with its hyperbolic counterpart “fooh” by applying one of the functions to iz then multiplying by a constant c that depends on foo: ci for sin and tan, c = 1 for cos and sec, and c = −i for csc and cot.

In symbols,

c foo(z) = fooh(iz).

Let’s suppose foo and fooh are invertible, ignoring any complications, and solve foo(z) = w for z. We get

i foo−1(w) = fooh−1(cw)

or using “arc” nomenclature for inverse functions

i arcfoo(w) = arcfooh(cw).

When the naive calculation above holds, except possibly at a finite number of points, we say the pair (foo, fooh) is couth. Otherwise we say the pair is uncouth. These term were coined by Robert Corless and his coauthors in their paper [1].

Whether the pair (foo, fooh) is couth depends not only on foo and fooh, but also on the details of how arcfoo and arcfooh are defined.

In Python’s NumPy library, the pairs (sin, sinh) and (tan, tanh) are couth, but the pair (cos, cosh) is uncouth.

Numpy doesn’t define the reciprocal functions sec, sech, csc, csch, cot, and coth. I used to find that annoying, but I’m beginning to think that was wise. These functions cause problems. For example, there may be two reasonable ways to define these functions, one of which forms a couth pair and one of which forms an uncouth pair.

For example, how should you define cot and coth? There would be no disagreement over the definition

cot = lambda x: 1/tan(x)

but there are at least two definitions of inverse coth that you’ll find in practice:

arccot = lambda z: 0.5*pi - arctan(z)
arccot = lambda z: arctan(1/z).

Both definitions have their advantages, but the former is uncouth while the latter is couth. You can verify that both definitions are the same at z = 1 but not at z = −1.

With the following definitions, the pairs (cos, cosh) and (sec, sech) are uncouth and the rest are couth.

from numpy import *

csc     = lambda x: 1/sin(x)
sec     = lambda x: 1/cos(x)
cot     = lambda x: 1/tan(x)
csch    = lambda x: 1/sinh(x)
sech    = lambda x: 1/cosh(x)
coth    = lambda x: 1/tanh(x)

arccot  = lambda z: arctan(1/z)
arcsec  = lambda z: arccos(1/z)
arccsc  = lambda z: arcsin(1/z)
arccoth = lambda z: arctanh(1/z)
arcsech = lambda z: arccosh(1/z)
arccsch = lambda z: arcsinh(1/z)

[1] “According to Abramowitz and Stegun” or arccoth needn’t be uncouth. Robert M. Corless et al. ACM SIGSAM Bulletin, Volume 34, Issue 2, pp 58 – 65 https://doi.org/10.1145/362001.362023

Circular and hyperbolic functions differ by rotations

The difference between a circular function and a hyperbolic function is a rotation or two.

For example, cosh(z) = cos(iz). You can read that as saying that to find the hyperbolic cosine of z, first you rotate z a quarter turn to the left (i.e. multiply by i) and then take the cosine.

For another example, sinh(z) = −i sin(iz). This says that you can calculate the hyperbolic sine of z by rotating z to the left, taking the sine, and then rotating to the right.

You can relate each trig function “foo” with its hyperbolic counterpart “fooh” by applying one of the functions to iz then multiplying by a constant c that depends on foo: ci for sin and tan, c = 1 for cos and sec, and c = −i for csc and cot.

Note that if the constant for foo is c, the constant for 1/foo is 1/c. So, for example, the constant for tan is i and the constant for cot is 1/i = −i.

We have four groups of equations, depending on whether the left side of the equation is foo(iz), fooh(iz), foo(z), or fooh(z).

This post was written as a warm-up for the next post on couth and uncouth function pairs.

foo(iz)

\begin{align*} \sin(iz) & = \phantom{-}i\sinh(z) \\ \cos(iz) & = \phantom{-i}\cosh(z) \\ \tan(iz) & = \phantom{-}i\tanh(z) \\ \csc(iz) & = -i\text{csch}(z) \\ \sec(iz) & = \phantom{-i}\text{sech}(z) \\ \cot(iz) & = -i\coth(z) \\ \end{align*}

fooh(iz)


\begin{align*} \sinh(iz) & = \phantom{-}i\sin(z) \\ \cosh(iz) & = \phantom{-i}\cos(z) \\ \tanh(iz) & = \phantom{-}i\tan(z) \\ \text{csch}(iz) & = -i\csc(z) \\ \text{sech}(iz) & = \phantom{-i}\sec(z) \\ \coth(iz) & = -i\cot(z) \\ \end{align*}

foo(z)

\begin{align*} \sin(z) & = -i\sinh(iz) \\ \cos(z) & = \phantom{-i}\cosh(iz) \\ \tan(z) & = -i\tanh(iz) \\ \csc(z) & = \phantom{-}i\text{csch}(iz) \\ \sec(z) & = \phantom{-i}\text{sech}(iz) \\ \cot(z) & = \phantom{-}i\coth(iz) \\ \end{align*}

fooh(z)

\begin{align*} \sinh(z) & = -i\sin(iz) \\ \cosh(z) & = \phantom{-i}\cos(iz) \\ \tanh(z) & = -i\tan(iz) \\ \text{csch}(z) & = \phantom{-}i\csc(iz) \\ \text{sech}(z) & = \phantom{-i}\sec(iz) \\ \coth(z) & = \phantom{-}i\cot(iz) \\ \end{align*}