Drawing with a compass on a globe

Take a compass and draw a circle on a globe. Then take the same compass, opened to the same width, and draw a circle on a flat piece of paper. Which circle has more area?

If the circle is small compared to the radius of the globe, then the two circles will be approximately equal because a small area on a globe is approximately flat.

To get an idea what happens for larger circles, let’s a circle on the globe as large as possible, i.e. the equator. If the globe has radius r, then to draw the equator we need our compass to be opened a width of √2 r, the distance from the north pole to the equator along a straight line cutting through the globe.

The area of a hemisphere is 2πr². If we take our compass and draw a circle of radius √2 r on a flat surface we also get an area of 2πr². And by continuity we should expect that if we draw a circle that is nearly as big as the equator then the corresponding circle on a flat surface should have approximately the same area.

Interesting. This says that our compass will draw a circle with the same area whether on a globe or on a flat surface, at least approximately, if the width of the compass sufficiently small or sufficiently large. In fact, we get exactly the same area, regardless of how wide the compass is opened up. We haven’t proven this, only given a plausibility argument, but you can find a proof in [1].

Note that the width w of the compass is the radius of the circle drawn on a flat surface, but it is not the radius of the circle drawn on the globe. The width w is greater than the radius of the circle, but less than the distance along the sphere from the center of the circle. In the case of the equator, the radius of the circle is r, the width of the compass is √2 r , and the distance along the sphere from the north pole to the equator is πr/2.

Related posts

[1] Nick Lord. On an alternative formula for the area of a spherical cap. The Mathematical Gazette, Vol. 102, No. 554 (July 2018), pp. 314–316

The negative binomial distribution and Pascal’s triangle

The Poisson probability distribution gives a simple, elegant model for count data. You can even derive from certain assumptions that data must have a Poisson distribution. Unfortunately reality doesn’t often go along with those assumptions.

A Poisson random variable with mean λ also has variance λ. But it’s often the case that data that would seem to follow a Poisson distribution has a variance greater than its mean. This phenomenon is called over-dispersion: the dispersion (variance) is larger than a Poisson distribution assumption would allow.

One way to address over-dispersion is to use a negative binomial distribution. This distribution has two parameters, r and p, and has the following probability mass function (PMF).

P(X = x) = \binom{r + x - 1}{x} p^r(1-p)^x

As the parameter r goes to infinity, the negative binomial distribution converges to a Poisson distribution. So you can think of the negative binomial distribution as a generalization of the Poisson distribution.

These notes go into the negative binomial distribution in some detail, including where its name comes from.

If the parameter r is a non-negative integer, then the binomial coefficients in the PMF for the negative binomial distribution are on the (r+1)st diagonal of Pascal’s triangle.

Pascal's triangle

The case r = 0 corresponds to the first diagonal, the one consisting of all 1s. The case r = 1 corresponds to the second diagonal consisting of consecutive integers. The case r = 2 corresponds to the third diagonal, the one consisting of triangular numbers. And so forth.

Related posts

Variance matters more than mean in the extremes

Suppose you have two normal random variables, X and Y, and that the variance of X is less than the variance of Y.

Let M be an equal mixture of X and Y. That is, to sample from M, you first chose X or Y with equal probability, then you choose a sample from the random variable you chose.

Now suppose you’ve observed an extreme value of M. Then it is more likely the that the value came from Y. The means of X and Y don’t matter, other than determining the cutoff for what “extreme” means.

High-level math

To state things more precisely, there is some value t such that the posterior probability that a sample m from M came from Y, given that |m| > t, is greater than the posterior probability that m came from X.

Let’s just look at the right-hand tails, even though the principle applies to both tails. If X and Y have the variance, but the mean of X is greater, then larger values of Z are more likely to have come from X. Now suppose the variance of Y is larger. As you go further out in the right tail of M, the posterior probability of an extreme value having come from Y increases, and eventually it surpasses the posterior probability of the sample having come from Y. If X has a larger mean than Y that will delay the point at which the posterior probability of Y passes the posterior probability of X, but eventually variance matters more than mean.

Detailed math

Let’s give a name to the random variable that determines whether we choose X or Y. Let’s call it C for coin flip, and assume C takes on 0 and 1 each with probability 1/2. If C = 0 we sample from X and if C = 1 we sample from Y. We want to compute the probability P(C = 1 | Mt).

Without loss of generality we can assume X has mean 0 and variance 1. (Otherwise transform X and Y by subtracting off the mean of X then divide by the standard deviation of X.) Denote the mean of Y by μ and the standard deviation by σ.

From Bayes’ theorem we have

\text{P}(C = 1 \mid M \geq t) = \frac{ \text{P}(Y \geq t) }{ \text{P}(X \geq t) + \text{P}(Y \geq t) } = \frac{\Phi^c\left(\frac{t-\mu}{\sigma}\right)}{\Phi^c(t) + \Phi^c\left(\frac{t-\mu}{\sigma}\right)}

where Φc(t) = P(Zt) for a standard normal random variable.

Similarly, to compute P(C = 1 | Mt) just flip the direction of the inequality signs replace Φc(t) = P(Zt) with Φ(t) = P(Zt).

The calculation for P(C = 1  |  |M| ≥ t) is similar

Example

Suppose Y has mean −2 and variance 10. The blue curve shows that a large negative sample from M very likely comes from Y and the orange line shows that large positive sample very likely comes from Y as well.

The dip in the orange curve shows the transition zone where Y‘s advantage due to a larger mean gives way to the disadvantage of a smaller variance. This illustrates that the posterior probability of Y increases eventually but not necessarily monotonically.

Here’s a plot showing the probability of a sample having come from Y depending on its absolute value.

Related posts

Ptolemy’s theorem

Draw a quadrilateral by pick four arbitrary points on a circle and connecting them cyclically.

inscribed quadrilateral

Now multiply the lengths of the pairs of opposite sides. In the diagram below this means multiplying the lengths of the two horizontal-ish blue sides and the two vertical-ish orange sides.

quadrilateral with opposite sides colored

Ptolemy’s theorem says that the sum of the two products described above equals the product of the diagonals.

inscribed quadrilateral with diagonals

To put it in colorful terms, the product of the blue sides plus the product of the orange sides equals the product of the green diagonals.

The converse of Ptolemy’s theorem also holds. If the relationship above holds for a quadrilateral, then the quadrilateral can be inscribed in a circle.

Note that if the quadrilateral in Ptolemy’s theorem is a rectangle, then the theorem reduces to the Pythagorean theorem.

Related posts

Rule for converting trig identities into hyperbolic identities

There is a simple rule of thumb for converting between (circular) trig identities and hyperbolic trig identities known as Osborn’s rule: stick an h on the end of trig functions and flip signs wherever two sinh functions are multiplied together.

Examples

For example, the circular identity

sin(θ + φ) = sin(θ) cos(φ) + cos(θ) sin(φ)

becomes the hyperbolic identity

sinh(θ + φ) = sinh(θ) cosh(φ) + cosh(θ) sinh(φ)

but the identity

2 sin(θ) sin(φ) = cos(θ − φ) − cos(θ + φ)

becomes

2 sinh(θ) sinh(φ) = cosh(θ + φ) − cosh(θ − φ)

because there are two sinh terms.

Derivation

Osborn’s rule isn’t deep. It’s a straight-forward application of Euler’s theorem:

exp(iθ) = cos(θ) + i sin(θ).

More specifically, Osborn’s rule follows from two corollaries of Euler’s theorem:

sin(iθ) = i sinh(θ)
cos(iθ) = cosh(θ)

Why bother?

The advantage of Osborn’s rule is that it saves time, and perhaps more importantly, it reduces the likelihood of making a mistake.

You could always derive any identity you need on the spot. All trig identities—circular or hyperbolic—are direct consequences of Euler’s theorem. But it saves time to work at a higher level of abstraction. And as I’ve often said in the context of more efficient computer usage, the advantage of doing things faster is not so much the time directly saved but the decreased probability of losing your train of thought.

Caveats

Osborn’s rule included implicit expressions of sinh, such as in tanh = sinh / cosh. So, for example, the circular identity

tan(2θ) = 2 tan(θ) / (1 − tan²(θ))

becomes

tanh(2θ) = 2 tanh(θ) / (1 + tanh²(θ))

because the tanh² term implicitly contains two sinh terms.

Original note

Osborn’s original note [1] from 1902 is so short that I include the entire text below:

Related posts

[1] G. Osborn. Mnemonic for Hyperbolic Formulae. The Mathematical Gazette, Vol. 2, No. 34 (Jul., 1902), p. 189

Interpolation and the cotanc function

This weekend I wrote three posts related to interpolation:

The first post looks at reducing the size of mathematical tables by switching for linear to quadratic interpolation. The immediate application is obsolete, but the principles apply to contemporary problems.

The second post looks at alternatives to Lagrange interpolation that are much better suited to hand calculation. The final post is a tangent off the middle post.

Tau and sigma functions

In the process of writing the posts above, I looked at Chambers Six Figure Mathematical Tables from 1964. There I saw a couple curious functions I hadn’t run into before, functions the author called τ and σ.

τ(x) = x cot x
σ(x) = x csc x

So why introduce these two functions? The cotangent and cosecant functions have a singularity at 0, and so its difficult to tabulate and interpolate these functions. I touched on something similar in my recent post on interpolating the gamma function: because the function grows rapidly, linear interpolation gives bad results. Interpolating the log of the gamma function gives much better results.

Chambers tabulates τ(x) and σ(x) because these functions are easy to interpolate.

The cotanc function

I’ll refer to Chambers’ τ function as the cotanc function. This is a whimsical choice, not a name used anywhere else as far as I know. The reason for the name is as follows. The sinc function

sinc(x) = sin(x)/x

comes up frequently in signal processing, largely because it’s the Fourier transform of the indicator function of an interval. There are a few other functions that tack a c onto the end of a function to indicate it has been divided by x, such as the jinc function.

The function tan(x)/x is sometimes called the tanc function, though this name is far less common than sinc. The cotangent function is the reciprocal of the tangent function, so I’m calling the reciprocal of the tanc function the cotanc function. Maybe it should be called the cotank function just for fun. For category theorists this brings up images of a tank that fires backward.

Practicality of the cotanc function

As noted above, the cotangent function is ill-behaved near 0, but the cotanc function is very nicely behaved near 0. The cotanc function has singularities at non-zero multiples of π but multiplying by x removes the singularity at 0.

As noted here, interpolation error depends on the size of the derivatives of the function being interpolated. Since the cotanc function is flat and smooth, it has small derivatives and thus small interpolation error.

Binomial coefficients with non-integer arguments

When n and r are positive integers, with nr, there is an intuitive interpretation of the binomial coefficient C(n, r), namely the number of ways to select r things from a set of n things. For this reason C(n, r) is usually pronounced “n choose r.”

But what might something like C(4.3, 2)? The number of ways to choose two giraffes out of a set of 4.3 giraffes?! There is no combinatorial interpretation for binomial coefficients like these, though they regularly come up in applications.

It is possible to define binomial coefficients when n and r are real or even complex numbers. These more general binomial coefficients are in this liminal zone of topics that come up regularly, but not so regularly that they’re widely known. I wrote an article about this a decade ago, and I’ve had numerous occasions to link to it ever since.

The previous post implicitly includes an application of general binomial coefficients. The post alludes to coefficients that come up in Bessel’s interpolation formula but doesn’t explicitly say what they are. These coefficients Bk can be defined in terms of the Gaussian interpolation coefficients, which are in turn defined by binomial coefficients with non-integer arguments.

\begin{eqnarray*} G_{2n} &=& {p + n - 1 \choose 2n} \\ G_{2n+1} &=& {p + n \choose 2n + 1} \\ B_{2n} &=& \frac{1}{2}G_{2n} \\ B_{2n+1} &=& G_{2n+1} - \frac{1}{2} G_{2n} \end{eqnarray*}

Note that 0 < p < 1.

The coefficients in Everett’s interpolation formula can also be expressed simply in terms of the Gauss coefficients.

\begin{eqnarray*} E_{2n} &=& G_{2n} - G_{2n+1} \\ F_{2n} &=& G_{2n+1} \\ \end{eqnarray*}

Interpolating the gamma function

Suppose you wanted to approximate Γ(10.3). You know it’s somewhere between Γ(10) = 9! and Γ(11) = 10!, and linear interpolation would give you

Γ(10.3) ≈ 0.7 × 9! + 0.3 × 10! = 1342656.

But the exact value is closer to 716430.69, and so our estimate is 53% too high. Not a very good approximation.

Now let’s try again, applying linear interpolation to the log of the gamma function. Our approximation is

log Γ(10.3) ≈ 0.7 × log 9! + 0.3 × log 10! = 13.4926

while the actual value is 13.4820, an error of about 0.08%. If we take exponentials to get an approximation of Γ(10.3), not log Γ(10.3), the error is larger, about 1%, but still much better than 53% error.

The gamma function grows very quickly, and so the log gamma function is usually easier to work with.

As a bonus, the Bohr–Mollerup theorem says that log gamma is a convex function. This tells us that not only does linear interpolation give an approximation, it gives us an upper bound.

The Bohr–Mollerup theorem essentially says that the gamma function is the only function that extends factorial from a function on the integers to a log-convex function on the real numbers. This isn’t quite true since it’s actually &Gamma(x + 1) that extends factorial. Showing the gamma function is unique is the hard part. In the preceding paragraph we used the easy direction of the theorem, saying that gamma is log-convex.

Related posts

Too clever Monte Carlo

One way to find the volume of a sphere would be to imagine the sphere in a box, randomly select points in the box, and count how many of these points fall inside the sphere. In principle this would work in any dimension.

The problem with naive Monte Carlo

We could write a program to estimate the volume of a high-dimensional sphere this way. But there’s a problem: very few random samples will fall in the sphere. The ratio of the volume of a sphere to the volume of a box it fits in goes to zero as the dimension increases. We might take a large number of samples and none of them fall inside the sphere. In this case we’d estimate the volume as zero. This estimate would have small absolute error, but 100% relative error.

A more clever approach

So instead of actually writing a program to randomly sample a high dimensional cube, let’s imagine that we did. Instead of doing a big Monte Carlo study, we could be clever and use theory.

Let n be our dimension. We want to draw uniform random samples from [−1, 1]n and see whether they land inside the unit sphere. So we’d draw n random samples from [−1, 1] and see whether the sum of their squares is less than or equal to 1.

Let Xi be a uniform random variable on [−1, 1]. We want to know the probability that

X1² + X2² + X3² + … + Xn² ≤ 1.

This would be an ugly calculation, but since we’re primarily interested in the case of large n, we can approximate the sum using the central limit theorem (CLT). We can show, using the transformation theorem, that each Xi² has mean 1/3 and variance 4/45. The CLT says that the sum has approximately the distribution of a normal random variable with mean n/3 and variance 4n/45.

Too clever by half

The approach above turns out to be a bad idea, though it’s not obvious why.

The CLT does provide a good approximation of the sum above, near the mean. But we have a sum with mean n/3, with n large, and we’re asking for the probability that the sum is less than 1. In other words, we’re asking for the probability in the tail where the CLT approximation error is a bad (relative) fit. More on this here.

This post turned out to not be about what I thought it would be about. I thought this post would lead to a asymptotic approximation for the volume of an n-dimensional sphere. I would compare the approximation to the exact value and see how well it did. Except it did terribly. So instead, this post a cautionary tale about remembering how convergence works in the CLT.

Related posts

Evaluating a class of infinite sums in closed form

The other day I ran across the surprising identity

\sum_{n=1}^\infty \frac{n^3}{2^n} = 26

and wondered how many sums of this form can be evaluated in closed form like this. Quite a few it turns out.

Sums of the form

\sum_{n=1}^\infty \frac{n^k}{c^n}

evaluate to a rational number when k is a non-negative integer and c is a rational number with |c| > 1. Furthermore, there is an algorithm for finding the value of the sum.

The sums can be evaluated using the polylogarithm function Lis(z) defined as

\text{Li}_s(z) = \sum_{n=1}^\infty \frac{z^n}{n^s}

using the identity

\sum_{n=1}^\infty \frac{n^k}{c^n} = \text{Li}{-k}\left(\frac{1}{c}\right)

We then need to have a way to evaluate Lis(z). This cannot be done in closed form in general, but it can be done when s is a negative integer as above. To evaluate Lik(z) we need to know two things. First,

Li_1(z) = -\log(1-z)

and second,

\text{Li}_{s-1}(z) = z \frac{d}{dz} \text{Li}_s(z)

Now Li0(z) is a rational function of z, namely z/(1 − z). The derivative of a rational function is a rational function, and multiplying a rational function of z by z produces another rational function, so Lis(z) is a rational function of z whenever s is a non-positive integer.

Assuming the results cited above, we can prove the identity

\sum_{n=1}^\infty \frac{n^3}{2^n} = 26

stated at the top of the post.The sum equals Li−3(1/2), and

\text{Li}_{-3}(z) = \left(z \frac{d}{dz}\right)^3 \frac{z}{1-z} = \frac{z(1 + 4z + z^2)}{(1-z)^4}

The result comes from plugging in z= 1/2 and getting out 26.

When k and c are positive integers, the sum

\sum_{n=1}^\infty \frac{n^k}{c^n}

is not necessarily an integer, as it is when k = 3 and c = 2, but it is always rational. It looks like the sum is an integer if c= 2; I verified that the sum is an integer for c = 2 and k = 1 through 10 using the PolyLog function in Mathematica.

Update: Here is a proof that the sum is an integer when n = 2. From a comment by Theophylline on Substack.

The sum is occasionally an integer for larger values of c. For example,

\sum_{n=1}^\infty \frac{n^4}{3^n} = 15

and

\sum_{n=1}^\infty \frac{n^8}{3^n} = 17295

Related posts