Magical learning

I asked two questions on twitter yesterday. The previous post summarized the results for a question about books that I asked from my personal Twitter account.

This post will summarize the results of a question I asked from @AnalysisFact.

Here are the results by category. Some are not strictly theorems but rather topics or conjectures.

Computer science

  • Curry-Howard correspondence
  • P=NP or not
  • Collatz Conjecture


  • Cohen’s forcing theorem
  • Godel’s Incompleteness theorem
  • Continuum hypothesis
  • Zorn’s lemma

Algebra and number theory

  • The ABC conjecture (theorem?)
  • Prime number theorem.
  • Riemann hypothesis
  • Fundamental theorem of finite abelian groups
  • Classification of finite simple groups
  • Fermat’s last theorem, the unpublished Fermat version

Topology and differential geometry

  • Thurston’s geometrization conjecture
  • Gauss Bonnet theorem
  • de Rham’s theorem
  • Grothendieck Riemann Roch theorem


  • Banach-Tarski theorem.
  • Stokes theorem
  • Carleson-Hunt theorem
  • The epsilon/delta definition of continuity
  • Universal approximation theorem

Differential equations

  • Navier–Stokes existence and smoothness postulate
  • The relativistic version of the Shrodinger equation
  • Atiyah-Singer index theorem

Mathematical physics

  • E = mc²
  • Noether’s theorem
  • Liouville’s theorem


  • Existence of general equilibrium prices
  • Farkas’ lemma
  • The graph minor theorem
  • Central limit theorem

Mischievous genie

A couple people picked up the fact that in folk stories, being granted a wish doesn’t usually turn out well.

M. Shah: uh oh. Is this one of those malicious genies that twists language used to make the wish so that you are granted some horrific wish?

Jumpy the Hat: You now understand every single thing about irrational numbers but it’s all wasted because you’re cursed to become NJ Wildberger and you don’t think they exist

M Shah: or you want to thoroughly understand some theorem about Weierstrass’s monster. But little did anyone know that Weierstrass actually did have a pet monster. And it ends up biting your head off because it doesn’t like other things that are continuous.

Low-rank matrix perturbations

Here are a couple of linear algebra identities that can be very useful, but aren’t that widely known, somewhere between common knowledge and arcane. Neither result assumes any matrix has low rank, but their most common application, at least in my experience, is in the context of something of low rank added to something of full rank.

Sylvester’s determinant theorem

The first is Sylvester’s determinant theorem. If A is a n by k matrix and B is a k by n matrix, then

\det(I_n + AB) = \det(I_k + BA)

where I is an identity matrix whose dimension is given by its subscript. The theorem is true for any k, but it’s especially useful when k is small because the matrices on the right side are of size k. If k = 1, the right side is the determinant of a 1 by 1 matrix, i.e. just a number!

Woodbury matrix inversion lemma

The second is known as the matrix inversion lemma or Woodbury’s matrix identity. It says

\left(A+UCV \right)^{-1} = A^{-1} - A^{-1}U \left(C^{-1}+VA^{-1}U \right)^{-1} VA^{-1}

which is a lot to take in at once. We’ll unpack it a little at a time.

First of all, the matrices have whatever properties are necessary for the equation to make sense: A and C are square, invertible matrices, though possibly of different sizes.

Suppose A is n by n. Then U necessarily has n rows. Let k be the number of columns in U. Then C is k by k and V is k by n.

To simplify things a little, let C be the identity.

\left(A+UV \right)^{-1} = A^{-1} - A^{-1}U \left(I+VA^{-1}U \right)^{-1} VA^{-1}

Think of k as small, maybe even 1. Then UV is a low-rank perturbation of A. Suppose you have computed the inverse of A, or know something about the inverse of A, and want to know how that inverse changes when you change A by adding a rank k matrix to it.

To simplify things further, assume A is the identity matrix. Now the matrix inversion lemma reduces to something similar to Sylvester’s theorem above.

\left(I+UV \right)^{-1} = I - U \left(I+VU \right)^{-1} V

To make things clearer, we’ll add subscripts on the identity matrices as above.

\left(I_n+UV \right)^{-1} = I_n - U \left(I_k+VU \right)^{-1} V

If k is small, then the matrix between U and V on the right side is small. If k = 1, it’s just a number, and its inverse is just it’s reciprocal.

Related posts

Almost prime generators and almost integers

Here are two apparently unrelated things you may have seen before. The first is an observation going back to Euler that the polynomial

n^2 - n + 41

produces a long sequence of primes. Namely, the values are prime for n = 1, 2, 3, …, 40.

The second is that the number

e^{\pi \sqrt{163}}

is extraordinarily close to an integer. This number is known as Ramanujan’s constant. It differs from its nearest integer by 3 parts in 1030. Ramanujan’s constant equals


There is a connection between these two facts: The polynomial

n^2 - n + k

returns primes for n = 1, 2, 3, …, k-1 primes if 4k – 1 is a Heegner number, and

e^{\pi \sqrt{d}}

is almost an integer if d is a (large) Heegner number.

Source: The Book of Numbers by Conway and Guy.

Heegner numbers

So what’s a Heegner number and how many are there? An integer d is a Heegner number if the ring generated by appending √-d to the integers has unique factorization. There are nine such numbers:

1, 2, 3, 7, 11, 19, 43, 67, 163.

There’s deeper stuff going on here than I understand—modular forms, the j-function, etc.—so this post won’t explain everything. There’s something unsatisfying about saying something is “almost” an integer without quantifying. There’s a way to be more precise, but we won’t go there. Instead, we’ll just play with the results.

Mathematica computation

First we look at the claim that n² – n + k produces primes for n = 1 through k – 1 if 4k – 1 is a Heegner number. The Heegner numbers of the form 4k + 1 are 2, 3, 5, 11, and 17. The following code shows that the claim is true for these values of k.

k = {2, 3, 5, 11, 17}
claim[x_] := AllTrue[
  Table[n^2 - n + x, {n, x - 1}], 
AllTrue[k, claim]

This returns True, so the claim is true.

As for exp(π √d) being close to an integer, this apparently only true for the last three Heegner numbers.

h = {1, 2, 3, 7, 11, 19, 43, 67, 163}
For[i = 1, i < 10, i++, 
        Exp[ Pi Sqrt[ h[[i]] ] ], 

(The function AccountingForm suppresses scientific notation, making it easier to see where the decimal portion of the number starts.)

Here are the results:


I manually edited the output to align the decimal points and truncate the decimal places beyond that needed to show that the last number is not an integer.

Partition numbers and Ramanujan’s approximation

The partition function p(n) counts the number of ways n unlabeled things can be partitioned into non-empty sets. (Contrast with Bell numbers that count partitions of labeled things.)

There’s no simple expression for p(n), but Ramanujan discovered a fairly simple asymptotic approximation:

p(n) \sim \frac{1}{4n\sqrt{3}} \exp(\pi \sqrt{2n/3})

How accurate is this approximation? Here’s a little Matheamtica code to see.

p[n_] := PartitionsP[n]
approx[n_] := Exp[ Pi Sqrt[2 n/3]] / (4 n Sqrt[3])
relativeError[n_] := Abs[p[n] - approx[n]]/p[n]
ListLinePlot[Table[relativeError[n], {n, 100}]]

So for values of n around 100, the approximation is off by about 5%.

Since it’s an asymptotic approximation, the relative error decreases (albeit slowly, apparently) as n increases. The relative error for n = 1,000 is about 1.4% and the relative error for n = 1,000,000 is about 0.044%.

Update: After John Baez commented on the oscillation in the relative error I decided to go back and look at it more carefully. Do the oscillations end or do they just become too small to see?

To answer this, let’s plot the difference in consecutive terms.

relerr[a_, b_] := Abs[a - b]/b
t = Table[relerr[p[n], approx[n]], {n, 300}];
ListLinePlot[Table[ t[[n + 1]] - t[[n]], {n, 60}]]

first differences of relative error

The plot crosses back and forth across the zero line, indicating that the relative error alternately increases and decreases, but only up to a point. Past n = 25 the plot stays below the zero line; the sign changes in the first differences stop.

But now we see that the first differences themselves alternate! We can investigate the alternation in first differences by plotting second differences [1].

    Table[ t[[n + 2]] - 2 t[[n + 1]] + t[[n]], 
    {n, 25, 120}]

first differences of relative error

So it appears that the second differences keep crossing the zero line for a lot longer, so far out that it’s hard to see. In fact the second differences become positive and stay positive after n = 120. But the second differences keep alternating, so you could look at third differences …

Related posts: Special numbers


[1] The code does a small algebraic simplification that might make some people scratch their heads. All it does is simplify

(tn+2tn+1) – (tn+1tn).

Mathematics of Deep Note

THX deepnote logo score

I just finished listening to the latest episode of Twenty Thousand Hertz, the story behind “Deep Note,” the THX logo sound.

There are a couple mathematical details of the sound that I’d like to explore here: random number generation, and especially Pythagorean tuning.

Random number generation

First is that part of the construction of the sound depended on a random number generator. The voices start in a random configuration and slowly reach the target D major chord at the end.

Apparently the random number generator was not seeded in a reproducible way. This was only mentioned toward the end of the show, and a teaser implies that they’ll go more into this in the next episode.

Pythagorean tuning

The other thing to mention is that the final chord is based on Pythagorean tuning, not the more familiar equal temperament.

The lowest note in the final chord is D1. (Here’s an explanation of musical pitch notation.) The other notes are D2, A2, D3, A3, D4, A4, D5, A5, D6, and F#6.


Octave frequencies are a ratio of 2:1, so if D1 is tuned to 36 Hz, then D2 is 72 Hz, D3 is 144 Hz, D4 is 288 Hz, D5 is 576 Hz, and D6 is 1152 Hz.


In Pythagorean tuning, fifths are in a ratio of 3:2. In equal temperament, a fifth is a ratio of 27/12 or 1.4983 [1], a little less than 3/2. So Pythagorean fifths are slightly bigger than equal temperament fifths. (I explain all this here.)

If D2 is 72 Hz, then A2 is 108 Hz. It follows that A3 would be 216 Hz, A4 would be 432 Hz (flatter than the famous A 440), and A5 would be 864 Hz.

Major thirds

The F#6 on top is the most interesting note. Pythagorean tuning is based on fifths being a ratio of 3:2, so how do you get the major third interval for the highest note? By going up by fifths 4 times from D4, i.e. D4 -> A4 -> E5 -> B5 -> F#6.

The frequency of F#6 would be 81/16 of the frequency of D4, or 1458 Hz.

The F#6 on top has a frequency 81/64 that of the D# below it. A Pythagorean major third is a ratio of 81/64 = 1.2656, whereas an equal temperament major third is f 24/12 or 1.2599 [2]. Pythagorean tuning makes more of a difference to thirds than it does to fifths.

A Pythagorean major third is sharper than a major third in equal temperament. Some describe Pythagorean major chords as brighter or sweeter than equal temperament chords. That the effect the composer was going for and why he chose Pythagorean tuning.


Then after specifying the exact pitches for each note, the composer actually changed the pitches of the highest voices a little to make the chord sound fuller. This makes the three voices on each of the highest notes sound like three voices, not just one voice. Also, the chord shimmers a little bit because the random effects from the beginning of Deep Note never completely stop, they are just diminished over time.

Related posts


[1] The exponent is 7/12 because a half step is 1/12 of an octave, and a fifth is 7 half steps.

[2] The exponent is 4/12 because a major third is 4 half steps.

Musical score above via THX Ltd on Twitter.

Stirling numbers, including negative arguments

Stirling numbers are something like binomial coefficients. They come in two varieties, imaginatively called the first kind and second kind. Unfortunately it is the second kind that are simpler to describe and that come up more often in applications, so we’ll start there.

Stirling numbers of the second kind

The Stirling number of the second kind

S_2(n,k) = \left\{ {n \atop k} \right\}

counts the number of ways to partition a set of n elements into k non-empty subsets. The notation on the left is easier to use inline, and the subscript reminds us that we’re talking about Stirling numbers of the second kind. The notation on the right suggests that we’re dealing with something analogous to binomial coefficients, and the curly braces suggest this thing might have something to do with counting sets.

Since the nth Bell number counts the total number of ways to partition a set of n elements into any number of non-empty subsets, we have

B_n = \sum_{k=1}^n \left\{ {n \atop k}\right\}

Another connection to Bell is that S2(n, k) is the sum of the coefficients in the partial exponential Bell polynomial Bn, k.

Stirling numbers of the first kind

The Stirling numbers of the first kind

S_1(n,k) = \left[ {n \atop k} \right]

count how many ways to partition a set into cycles rather than subsets. A cycle is a sort of ordered subset. The order of elements matters, but only a circular way. A cycle of size k is a way to place k items evenly around a circle, where two cycles are considered the same if you can rotate one into the other. So, for example, [1, 2, 3] and [2, 3, 1] represent the same cycle, but [1, 2, 3] and [1, 3, 2] represent different cycles.

Since a set with at least three elements can be arranged into multiple cycles, Stirling numbers of the first kind are greater than or equal to Stirling numbers of the second kind, given the same arguments.

Addition identities

We started out by saying Stirling numbers were like binomial coefficients, and here we show that they satisfy addition identities similar to binomial coefficients.

For binomial coefficients we have

{n \choose k} = {n - 1 \choose k} + {n-1 \choose k-1}.

To see this, imagine we start with the set of the numbers 1 through n. How many ways can we select a subset of k items? We have selections that exclude the number 1 and selections that include the number 1. These correspond to the two terms on the right side of the identity.

The analogous identities for Stirling numbers are

\left\{{n \atop k} \right\}= k \left\{ {n-1 \atop k} \right\} + \left\{ {n-1 \atop k-1} \right\}

\left[ {n \atop k} \right]= (n-1) \left[ {n-1 \atop k} \right] + \left[ {n-1 \atop k-1} \right]
The combinatorial proofs of these identities are similar to the argument above for binomial coefficients. If you want to partition the numbers 1 through n into k subsets (or cycles), either 1 is in a subset (cycle) by itself or not.

More general arguments

Everything above has implicitly assumed n and k were positive, or at least non-negative, numbers. Let’s look first at how we might handle zero arguments, then negative arguments.

It works out best if we define S1(0, k) and S2(0, k) to be 1 if k is 0 and zero otherwise. Also we define S1(n, 0) and S2(n, 0) to be 1 if n is 0 and zero otherwise. (See Concrete Mathematics.)

When either n or k is zero the combinatoric interpretation is strained. If we let either be negative, there is no combinatorial interpretation, though it can still be useful to consider negative arguments, much like one does with binomial coefficients.

We can take the addition theorems above, which are theorems for non-negative arguments, and treat them as definitions for negative arguments. When we do, we get the following beautiful dual relationship between Stirling numbers of the first and second kinds:

\left\{{n \atop k} \right\}= \left[ {-k \atop -n} \right]

Related posts

Tetrahedral numbers

Start with the sequence of positive integers:

1, 2, 3, 4, …

Now take partial sums, the nth term of the new series being the sum of the first n terms of the previous series. This gives us the triangular numbers, so called because they count the number of coins at each level of a triangular arrangement:

1, 3, 6, 10, …

If we repeat this again we get the tetrahedral numbers, the number of balls on each level of a tetrahedral stack of balls.

1, 4, 10, 20, …

We can repeat this process and general define Tn, d, the nth tetrahedral number of dimension d, recursively. We define Tn, 1 = n and for d > 1,

T(n, d) = \sum_{i=1}^n T(i, d-1)

This is just a formalization of the discussion above.

It turns out there’s a simple expression for tetrahedral number of all dimensions:

T(n, d) = \binom{n+d-1}{d}

Here’s a quick proof by induction. The theorem clearly holds when n = 1 or d = 1. Now assume it hold for all n < m and d < m.

\begin{eqnarray*} T(n, m) &=& \sum_{i=1}^n T(n, m-1) \\ &=& T(n, m-1) + \sum_{i=1}^{n-1} T(n, m-1) \\ &=& T(n, m-1) + T(n-1, m) \\ &=& \binom{n+m-2}{m-1} + \binom{n+m-2}{m} \\ &=& \binom{n+m-1}{m} \end{eqnarray*}

The last line follows from the binomial addition identity

\binom{a}{b} = \binom{a-1}{b} + \binom{a-1}{b-1}

It also turns out that Tn, d is the number of ways to select d things from a set of n with replacement.

Related posts:

Bell numbers

The nth Bell number is the number of ways to partition a set of n labeled items. It’s also equal to the following sum.

B_n = \frac{1}{e}\sum_{k=0}^\infty \frac{k^n}{k!}

You may have to look at that sum twice to see it correctly. It looks a lot like the sum for en except the roles of k and n are reversed in the numerator.

The sum reminded me of the equation

\frac{d}{de}e^x = x e^{x-1}

that I blogged about a few weeks ago. It’s correct, but you may have to stare at it a little while to see why.

Incidentally, the nth Bell number is the nth moment of a Poisson random variable with mean 1.

There’s also a connection with Bell polynomials. The nth Bell number is the sum of the coefficients of the nth complete Bell polynomial. The nth Bell polynomial is sum over k of the partial exponential Bell polynomials Bn,k. The partial (exponential) Bell polynomials are defined here.

Computing Bell numbers

You can compute Bell numbers in Python with sympy.bell and in Mathematical with BellB. You can compute Bell numbers by hand, or write your own function in a language that doesn’t provide one, with the recursion relation B0 = 1 and

B_n = \sum_{k=0}^{n-1}{ n-1 \choose k}B_k

for n > 0. Bell numbers grow very quickly, faster than exponential, so you’ll need extended precision integers if you want to compute very many Bell numbers.

For theoretical purposes, it is sometimes helpful to work with the exponential generating function of the Bell numbers. The nth Bell number is n! times the coefficient of xn in exp(exp(x) – 1).

\sum_{n=0}^\infty \frac{B_n}{n!} x^n = \exp(\exp(x) -1)

An impractical but amusing way to compute Bell numbers would be via simulation, finding the expected value of nth powers of random draws from a Poisson distribution with mean 1.

Relative error in the central limit theorem

If you average a large number independent versions of the same random variable, the central limit theorem says the average will be approximately normal. That is the absolute error in approximating the density of the average by the density of a normal random variable will be small. (Terms and conditions apply. See notes here.)

But the central limit theorem says nothing about relative error. Relative error can diverge to infinity while absolute error converges to zero. We’ll illustrate this with an example.

The average of N independent exponential(1) random variables has a gamma distribution with shape N and scale 1/N.

As N increases, the average becomes more like a normal in distribution. That is, the absolute error in approximating the distribution function of gamma random variable with that of a normal random variable decreases. (Note that we’re talking about distribution functions (CDFs) and not densities (PDFs). The previous post discussed a surprise with density functions in this example.)

The following plot shows that the difference between the distributions functions get smaller as N increases.

But when we look at the ratio of the tail probabilities, that is Pr(X > t) / Pr(Y  > t) where X is the average of N exponential r.v.s and Y is the corresponding normal approximation from the central limit theorem, we see that the ratios diverge, and they diverge faster as N increases.

To make it clear what’s being plotted, here is the Python code used to draw the graphs above.

import matplotlib.pyplot as plt
from scipy.stats import gamma, norm
from scipy import linspace, sqrt

def tail_ratio(ns):

    x = linspace(0, 4, 400)
    for n in ns:
        gcdf = gamma.sf(x, n, scale = 1/n)
        ncdf = norm.sf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gcdf/ncdf

    plt.legend(["n = {}".format(n) for n in ns])

def cdf_error(ns):

    x = linspace(0, 6, 400)
    for n in ns:
        gtail = gamma.cdf(x, n, scale = 1/n)
        ntail = norm.cdf(x, loc=1, scale=sqrt(1/n))
        plt.plot(x, gtail-ntail)

    plt.legend(["n = {}".format(n) for n in ns])
ns = [1, 4, 16]

Central limit theorem and Runge phenomena

I was playing around with something this afternoon and stumbled on something like Gibbs phenomena or Runge phenomena for the Central Limit Theorem.

The first place most people encounter Gibbs phenomena is in Fourier series for a step function. The Fourier series develops “bat ears” near the discontinuity. Here’s an example I blogged about before not with Fourier series but with analogous Chebyshev series.

Gibbs phenomena for Chebyshev interpolation

The series converges rapidly in the middle of the flat parts, but under-shoots and over-shoots near the jumps in the step function.

Runge phenomena is similar, where interpolating functions under- and over-shoot the function they’re approximating.

Runge example

Both plots above come from this post.

Here’s the example I ran across with the central limit theorem. The distribution of the average of a set of exponential random variables converges to the distribution of a normal random variable. The nice thing about the exponential distribution is that the averages have a familiar distribution: a gamma distribution. If each exponential has mean 1, the average has a gamma distribution with shape N and scale 1/N. The central limit theorem says this is converging in distribution to a normal distribution with the same mean and variance.

The plot below shows the difference between the density function of the average of N exponential random variables and the density function for its normal approximation, for N = 10 and for N = 400.

Notice that the orange line, corresponding to N = 400, is very flat, most of the time. That is, the normal approximation fits very well. But there’s this spike in the middle, something reminiscent of Gibbs phenomena or Runge phenomena. Going from 10 to 400 samples the average error decreases quite a bit, but the maximum error doesn’t go down much at all.

If you go back and look at the Central Limit Theorem, or its more quantitative counterpart the Berry-Esseen theorem, you’ll notice that it applies to the distribution function, not the density, or in other words, the CDF, not the PDF. I think the density functions do converge in this case because the exponential function has a smooth density, but the rate of convergence depends on the norm. It looks like the convergence is fast in square (L²) norm, but slow in sup norm. A little experiment shows that this is indeed the case.

Maybe the max norm doesn’t converge at all, i.e. the densities don’t converge pointwise. It looks like the max norm may be headed toward a horizontal asymptote, just like Gibbs phenomena.

Update: It seems we do not have uniform convergence. If we let N = 1,000,000, the sup norm of the error 0.1836. It appears the sup norm of the error is approaching a lower bound of approximately this value.

Related posts

Calendars and continued fractions

Calendars are based on three frequencies: the rotation of the Earth on its axis, the rotation of the moon around the Earth, and the rotation of the Earth around the sun. Calendars are complicated because none of these periods is a simple multiple of any other. The ratios are certainly not integers, but they’re not even fractions with a small denominator.

As I wrote about the other day in the context of rational approximations for π, the most economical rational approximations of an irrational number come from convergents of continued fractions. The book Calendrical Calculations applies continued fractions to various kinds of calendars.

Ratio of years to days

The continued fraction for the number of days in a year is as follows.

365.2424177 = 365 + \cfrac{1}{4 + \cfrac{1}{7+ \cfrac{1}{1 + \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{5 + \ldots }}}}}}

This means that the best approximations for the length of a year are 365 days plus a fraction of 1/4, or 7/29, or 8/33, or 23/95, etc.

You could have one leap day every four years. More accurate would be 7 leap days every 29 years, etc. The Gregorian calendar has 97 leap days every 400 years.

Ratio of years to months

If we express the ratio of the length of the year to the length of the month, we get

\frac{365.24244}{29.53059} = 12 +\cfrac{1}{2 + \cfrac{1}{1+ \cfrac{1}{2 + \cfrac{1}{1 + \cfrac{1}{1 + \cfrac{1}{18 + \cfrac{1}{3 + \ldots }}}}}}}

By taking various convergents we get 37/3, 99/8, 136/11, etc. So if you want to design successively more accurate lunisolar calendars, you’d have 37 lunar months every 3 years, 99 lunar months every 8 years, etc.

Related posts

Computing smooth max without overflow

Erik Erlandson sent me a note saying he found my post on computing the soft maximum helpful. (If you’re unfamiliar with the soft maximum, here’s a brief description of what it is and how you might use it.) Erik writes

I used your post on practical techniques for computing smooth max, which will probably be the difference between being able to actually use my result and not. I wrote up what I’m planning to do here.

His post extends the idea in my post to computing the gradient and Hessian of the soft max.

Combinatorics, just beyond the basics

Most basic combinatorial problems can be solved in terms of multiplication, permutations, and combinations. The next step beyond the basics, in my experience, is counting selections with replacement. Often when I run into a problem that is not quite transparent, it boils down to this.

Examples of selection with replacement

Here are three problems that reduce to counting selections with replacement.

Looking ahead in an experiment

For example, suppose you’re running an experiment in which you randomize to n different treatments and you want to know how many ways the next k subjects can be assigned to treatments. So if you had treatments A, B, and C, and five subjects, you could assign all five to A, or four to A and one to B, etc. for a total of 21 possibilities. Your choosing 5 things from a set of 3, with replacement because you can (and will) assign the same treatment more than once.

Partial derivatives

For another example, if you’re taking the kth order partial derivatives of a function of n variables, you’re choosing k things (variables to differentiate with respect to) from a set of n (the variables). Equality of mixed partials for smooth functions says that all that matters is how many times you differentiate with respect to each variable. Order doesn’t matter: differentiating with respect to x and then with respect to y gives the same result as taking the derivatives in the opposite order, as long as the function you’re differentiating has enough derivatives. I wrote about this example here.

Sums over even terms

I recently had an expression come up that required summing over n distinct variables, each with a non-negative even value, and summing to 2k. How many ways can that be done? As many as dividing all the variable values in half and asking that they sum to k. Here the thing being chosen the variable, and since the indexes sum to k, I have a total of k choices to make, with replacement since I can chose a variable more than once. So again I’m choosing k things with replacement from a set of size n.


I wrote up a set of notes on sampling with replacement that I won’t duplicate here, but in a nutshell:

\left( {n \choose k} \right) = {n + k - 1 \choose k}

The symbol on the left is Richard Stanley’s suggested notation for the number of ways to select k things with replacement from a set of size n. It turns out that this equals the expression on the right side. The derivation isn’t too long, but it’s not completely obvious either. See the aforementioned notes for details.

I started by saying basic combinatorial problems can be reduced to multiplication, permutations, and combinations. Sampling with replacement can be reduced to a combination, the right side of the equation above, but with non-obvious arguments. Hence the need to introduce a new symbol, the one on the right, that maps directly to problem statements.

Enumerating possibilities

Sometimes you just need to count how many ways one can select k things with replacement from a set of size n. But often you need to actually enumerate the possibilities, say to loop over them in software.

In conducting a clinical trial, you may want to enumerate all the ways pending data could turn out. If you’re going to act the same way, no matter how the pending data work out, there’s no need to wait for the missing data before proceeding. This is something I did many times when I worked at MD Anderson Cancer Center.

When you evaluate a multivariate Taylor series, you need to carry out a sum that has a term for corresponding to each partial derivative. You could naively sum over all possible derivatives, not taking into account equality of mixed partials, but that could be very inefficient, as described here.

The example above summing over even partitions of an even number also comes out of a problem with multivariate Taylor series, one in which odd terms drop out by symmetry.

To find algorithms for enumerating selections with replacement, see Knuth’s TAOCP, volume 4, Fascicle 3.

Related posts

10 best rational approximations for pi

It’s easy to create rational approximations for π. Every time you write down π to a few decimal places, that’s a rational approximation. For example, 3.14 = 314/100. But that’s not the best approximation.

Think of the denominator of your fraction as something you have to buy. If you have enough budget to buy a three-digit denominator, then you’re better off buying 355/113 rather than 314/100 because the former is more accurate. The approximation 355/113 is accurate to 6 decimal places, whereas 314/100 is only accurate to 2 decimal places.

There’s a way to find the most economical rational approximations to π, or any other irrational number, using continued fractions. If you’re interested in details, see the links below.

Here are the 10 best rational approximations to π, found via continued fractions.

    | Fraction       | Decimals |
    | 3              |      0.8 |
    | 22/7           |      2.9 |
    | 333/106        |      4.1 |
    | 355/113        |      6.6 |
    | 103993/33102   |      9.2 |
    | 104348/33215   |      9.5 |
    | 208341/66317   |      9.9 |
    | 312689/99532   |     10.5 |
    | 833719/265381  |     11.1 |
    | 1146408/364913 |     11.8 |

If you only want to know the number of correct decimal places, ignore the fractional parts of the numbers in the Decimals column above. For example, 22/7 gives two correct decimal places. But it almost gives three. (Technically, the Decimals column gives the -log10 of the absolute error.)

In case you’re curious, here’s a plot of the absolute errors on a log scale.

Errors in rational approximations to pi

Related posts

Fixed points of logistic function

Here’s an interesting problem that came out of a logistic regression application. The input variable was between 0 and 1, and someone asked when and where the logistic transformation

f(x) = 1/(1 + exp(abx))

has a fixed point, i.e. f(x) = x.

So given logistic regression parameters a and b, when does the logistic curve given by yf(x) cross the line yx? Do they always cross? Can they cross twice?

There’s always at least one solution. Because f(x) is strictly between 0 and 1, the function

g(x) = f(x) – x

is positive at 0 and negative at 1, and by the intermediate value theorem g(x) must be zero for some x between 0 and 1.

Sometimes f has only one fixed point. It may have two or three fixed points, as demonstrated in the graph below. The case of two fixed points is unstable: the logistic curve is tangent to the line yx at one point, and a tiny change would turn this tangent point into either no crossing or two crossings.

Logistic curves crossing identity line

If |b| < 1, then you can show that the function f is a contraction map on [0, 1]. In that case there is a unique solution to f(x) = x, and you can find it by starting with an arbitrary value for x and repeatedly applying f to it. For example, if a= 1 and b = 0.8 and we start with x= 0, after applying f ten times we get xf(x) = 0.233790157.

There are a couple questions left to finish this up. How can you tell from a and b how many fixed points there will be? The condition |b| < 1 is sufficient for f to be a contraction map on [0, 1]. Can you find necessary and sufficient conditions?

Related post: Sensitivity of logistic regression prediction