Binomial number system

I just stumbled across the binomial number system in Exercise 5.38 of Concrete Mathematics. The exercise asks the reader to show that every non-negative integer n can be written as

n = \binom{a}{1} + \binom{b}{2} + \binom{c}{3}

and that the representation is unique if you require 0 ≤ abc. The book calls this the binomial number system. I skimmed a paper that said this has some application in signal processing, but I haven’t looked at it closely [1].

You can find ab, and c much as you would find the representation in many other number systems: first find the largest possible c, then the largest possible b for what’s left, and then the remainder is a.

In order to find c, we start with the observation that the binomial coefficient C(k, 3) is less than k³/6 and so c is less than the cube root of 6n. We can use this as an initial lower bound on c then search incrementally. If we wanted to be more efficient, we could do some sort of binary search.

Here’s Python code to find ab, and c.

from math import comb, factorial

def lower(n, r):
    "Find largest k such that comb(k, r) <= n."
    k = int( (factorial(r)*n)**(1/r) ) # initial guess
    while comb(k, r) <= n: 
        k += 1 
    return k - 1 

def binomial_rep(n): 
    c = lower(n, 3) 
    cc = comb(c, 3) 
    b = lower(n - comb(c, 3), 2) 
    bb = comb(b, 2) 
    a = n - cc - bb 
    assert(c > b > a >= 0)
    return (a, b, c)

For example, here’s the binomial number system representation of today’s date.

>>> binomial_rep(20250605)
(79, 269, 496)
>>> comb(496, 3) + comb(269, 2) + comb(79, 1)
20250605

You could use any number of binomial terms, not just three.

[1] I looked back at the paper, and it is using a different kind of binomial number system, expressing numbers as sums of fixed binomial coefficients, not varying the binomial coefficient arguments. This representation has some advantages for error correction.

Additive and multiplicative persistence

Casting out nines is a well-known way of finding the remainder when a number is divided by 9. You add all the digits of a number n. And if that number is bigger than 9, add all the digits of that number. You keep this up until you get a number less than 9.

This is an example of persistence. The persistence of a number, relative to some operation, is the number of times you have to apply that operation before you reach a fixed point, a point where applying the operation further makes no change.

The additive persistence of a number is the number of times you have to take the digit sum before getting a fixed point (i.e. a number less than 9). For example, the additive persistence of 8675309 is 3 because the digits in 8675309 sum to 38, the digits in 38 sum to 11, and the digits in 11 sum to 2.

The additive persistence of a number in base b is bounded above by its iterated logarithm in base b.

The smallest number with additive persistence n is sequence A006050 in OEIS.

Multiplicative persistence is analogous, except you multiply digits together. Curiously, it seems that multiplicative persistence is bounded. This is true for numbers with less than 30,000 digits, and it is conjectured to be true for all integers.

The smallest number with multiplicative persistence n is sequence A003001 in OEIS.

Here’s Python code to compute multiplicative persistence.

def digit_prod(n):
    s = str(n)
    prod = 1
    for d in s:
        prod *= int(d)
    return prod

def persistence(n):
    c = 0
    while n > 9:
        n = digit_prod(n)
        print(n)
        c += 1
    return c   

You could use this to verify that 277777788888899 has persistence 11. It is conjectured that no number has larger persistence larger than 11.

The persistence of a large number is very likely 1. If you pick a number with 1000 digits at random, for example, it’s very likely that at least of these digits will be 0. So it would seem that the probability of a large number having persistence larger than 11 would be incredibly small, but I would not have expected it to be zero.

Here are plots to visualize the fixed points of the numbers less than N for increasing values of N.


It’s not surprising that zeros become more common as N increases. And a moments thought will explain why even numbers are more common than odd numbers. But it’s a little surprising that 4 is much less common than 2, 6, or 8.

Incidentally, we have worked in base 10 so far. In base 2, the maximum persistence is 1: when you multiply a bunch of 0s and 1s together, you either get 0 or 1. It is conjectured that the maximum persistence in base 3 is 3.

Iterated logarithm

Logarithms give a way to describe large numbers compactly. But sometimes even the logarithm is a huge number, and it would be convenient to take the log again. And maybe again.

For example, consider

googolplex = 10googol

where

googol = 10100.

(The name googol is older than “Google,” and in fact the former inspired the latter name.)

A googolplex is a big number. It’s log base 10 is still a big number. You have to take the log 3 times to get to a small number:

log10( log10( log10 googolplex ) ) = 2.

Log star

The iterated logarithm base b of a positive number x, written log*b (x) is the number of times you have to take the log base b until you get a result less than or equal to 1. So in the example above

log*10 googolplex = 4.

As usual, a missing base implies be. Also just as log base 2 is often denoted lg, the iterated log base 2 is sometimes denoted lg*.

I’ve seen sources that say the base of an iterated logarithm doesn’t matter. That’s not true, because, for example,

log*2 googolplex = 6.

The function log*(x) appears occasionally in the analysis of algorithms.

Python implementation

It’s easy to implement the iterated log, though this is of limited use because the iterated log is usually applied to numbers too large to represent as floating point numbers.

from math import log, exp

def logstar(x, base=exp(1)):
    c = 0
    while x > 1:
        x = log(x, base)
        c += 1
    return c

lgstar = lambda x: logstar(x, 2)

Mersenne primes

The largest known prime, at the time of writing, is

p = 2136279841 − 1.

We can use the code above to determine that lg* 136279841 = 5 and so lg* p = 6.

Skewes’ bounds

The prime number theorem says that the number of prime numbers less than x, denoted π(x), is asymptotically li(x) where

\text{li}(x) = \int_0^x \frac{dt}{\log t}

You may have seen a different form of the prime number theorem that says π(x) is asymptotically x/log x. That’s true too, because li(x) and x/log x are asymptotically equal. But li(x) gives a better approximation to π(x) for finite x. More on that here.

For every x for which π(x) has been computed, π(x) < li(x). However, Littlewood proved in 1914 that there exists some x for which π(x) > li(x). In 1933 Stanley Skewes gave an upper bound on the smallest x such that π(x) > li(x), assuming the Riemann Hypothesis:

S1 = 10^(10^(10^34))

In 1955 he proved the upper bound

S2 = 10^(10^(10^964))

not assuming the Riemann Hypothesis.

You can calculate that

log*10 S1 = 5

and

log*10 S2 = 6.

Iterated iterated logarithm

In all the examples above, log*(x) is a small number. But for some numbers, such as Moser’s number and Graham’s number, even log*(x) is unwieldy and you have to do things like take the iterated logarithm of the iterated logarithm

log**(x) = log*( log*(x) )

before the numbers become imaginable.

Related posts

Approximation of Inverse, Inverse of Approximation

“The inverse of an approximation is an approximation of the inverse.” This statement is either obvious or really clever. Maybe both.

Logs and exponents

Let’s start with an example. For moderately small x,

10^x \approx \frac{1 + x}{1 - x}

That means that near 1 (i.e. near 10 raised to a small number,

\log_{10}x \approx \frac{1-x}{1 + x}

The inverse of a good approximation is a good approximation of the inverse. But the measure of goodness might not be the same in both uses of the word “good.” The inverse of a very good approximation might be a mediocre approximation of the inverse, depending on the sensitivity of the problem. See, for example, Wilkinson’s polynomial.

Inverse function theorem

Now let’s look at derivatives and inverse functions. The derivative of a function is the best linear approximation to that function at a point. And it’s true that the best linear approximation to the inverse of a function at a point is the inverse of the best linear approximation to the function. In symbols,

\frac{dx}{dy} = \frac{1}{\dfrac{dy}{dx}}

This is a good example of a bell curve meme, or coming full circle. The novice naively treats the symbols above as fractions and says “of course.” The sophomore says “Wait a minute. These aren’t fractions. You can’t just do that.” And they would be correct, except the symbols are limits of fractions, and in this context naive symbol manipulation in fact leads to the correct result.

However, the position on the right side of the meme is nuanced. Treating derivatives as fractions, especially partial derivatives, can get you into trouble.

The more rigorous statement of the inverse function theorem is harder to parse visually, and so the version above is a good mnemonic. If f(x) = y then

\bigl(f^{-1}\bigr)^\prime (y) = \frac{1}{f^\prime(x)} = \frac{1}{f^\prime(f^{-1}(y))}

The same is true in multiple variables. If f is a smooth function from ℝn to ℝn then the best linear approximation to f at a point is a linear transformation, which can be represented by a matrix M. And the best linear approximation to the inverse function at that point is the linear transformation represented by the inverse of M. In symbols,

D\bigl(f^{-1}\bigr)(y) = \left[Df\bigl(f^{-1}(y)\bigr)\right]^{-1}

There’s fine print, such as saying a function has to be differentiable in a neighborhood of x and the derivative has to be non-singular, but that’s the gist of the inverse function theorem.

Note that unlike in the first section, there’s no discussion of how good the approximations are. The approximations are exact, at a point. But it might still be the case that the linear approximation on one side is good over a much larger neighborhood than the linear approximation on the other side.

False witnesses

Pinocchio talking to a judge

Fermat’s primality test

Fermat’s little theorem says that if p is a prime and a is not a multiple of p, then

ap−1 = 1 (mod p).

The contrapositive of Fermat’s little theorem says if

ap−1 ≠ 1 (mod p)

then either p is not prime or a is a multiple of p. The contrapositive is used to test whether a number is prime. Pick a number a less than p (and so a is clearly not a multiple of p) and compute ap−1 mod p.  If the result is not congruent to 1 mod p, then p is not a prime.

So Fermat’s little theorem gives us a way to prove that a number is composite: if ap−1 mod p equals anything other than 1, p is definitely composite. But if ap−1 mod p does equal 1, the test is inconclusive. Such a result suggests that p is prime, but it might not be.

Examples

For example, suppose we want to test whether 57 is prime. If we apply Fermat’s primality test with a = 2 we find that 57 is not prime, as the following bit of Python shows.

    >>> pow(2, 56, 57)
    4

Now let’s test whether 91 is prime using a = 64. Then Fermat’s test is inconclusive.

    >>> pow(64, 90, 91)
    1

In fact 91 is not prime because it factors into 7 × 13.

If we had used a = 2 as our base we would have had proof that 91 is composite. In practice, Fermat’s test is applied with multiple values of a (if necessary).

Witnesses

If ap−1 = 1 (mod p), then a is a witness to the primality of p. It’s not proof, but it’s evidence. It’s still possible p is not prime, in which case a is a false witness to the primality of p, or p is a pseudoprime to the base a. In the examples above, 64 is a false witness to the primality of 91.

Until I stumbled on a paper [1] recently, I didn’t realize that on average composite numbers have a lot of false witnesses. For large N, the average number of false witnesses for composite numbers less than N is greater than N15/23. Although N15/23 is big, it’s small relative to N when N is huge. And in applications, N is huge, e.g. N = 22048 for RSA encryption.

However, your chances of picking one of these false witnesses, if you were to choose a base a at random, is small, and so probabilistic primality testing based on Fermat’s little theorem is practical, and in fact common.

Note that the result quoted above is a lower bound on the average number of false witnesses. You really want an upper bound if you want to say that you’re unlikely to run into a false witness by accident. The authors in [1] do give upper bounds, but they’re much more complicated expressions than the lower bound.

Related posts

[1] Paul Erdős and Carl Pomerance. On the Number of False Witnesses for a Composite Number. Mathematics of Computation. Volume 46, Number 173 (January 1986), pp. 259–279.

Houston’s long term planning

When I hear American cities criticized for lack of long-term planning, my initial thought is that this is a good thing because the future cannot be predicted or directed very far in advance, and cities should be allowed to grow organically. Houston, in particular, is subject to a lot of criticism because of its lack of zoning, which I also think is a good thing. [1]

With that in mind, I was very surprised to see the following thread [2] from Aaron Renn.

Maybe the craziest thing about Houston’s third beltway, Grand Parkway, is that the idea originated in 1961, when the Houston metro area population was only 1.2 million people. [It’s now 7.8 million. — JC]

“In October 1965, plans for the Grand Parkway became public. The Houston City Planning Commission released a draft version of the 1966 Major Thoroughfare and Freeway Plan showing the new third loop on the official planning map.”

“The City Planning Commission approved the route and scheduled a public hearing for February 17, 1966. The Houston Chronicle enthusiastically endorsed the new route, saying, ‘No sensible citizen can doubt that this freeway will be needed eventually.’”

Aaron’s quotations are from Houston Freeways.

To a very crude approximation, a map of Houston looks like three concentric rings [2]. According to the sources cited above, planning for the outer ring began long before the middle ring was completed. The first section of the Grand Parkway opened in 1994, and it’s maybe three quarters finished currently.

Maybe what has made Houston successful is that it planned far ahead on a macro scale, but has not micromanaged as much as most cities.

[1] Along these lines, see James Scott’s book Seeing Like a State for examples of failures in long-range state planning. Or if you don’t have the to read the book, you might want to read Venkatash Rao’s article A Big Little Idea Called Legibility.

[2] It’s not exactly a thread. It’s a series of posts that quote the first sentence.

[3] When the outer ring, the Grand Parkway, is completed, it will be the longest beltway in the US and encompass more area than the state of Rhode Island.

Gardener’s ellipse

There are several ways to define an ellipse. If you want to write software draw an ellipse, it’s most convenient to have a parametric form:

\begin{align*} x(t) &= h + a \cos(\theta) \cos(t) - b \sin(\theta) \sin(t) \\ y(t) &= k + a \sin(\theta) \cos(t) + b \cos(\theta) \sin(t) \end{align*}

This gives an ellipse centered at (h, k), with semi-major axis a, semi-minor axis b, and with the major axis rotated by an angle θ from horizontal.

But if you’re going to physically draw an ellipse, there’s a more convenient definition: an ellipse is the set of points such that the sum of the distances to two foci is a constant s. This method is most often called the gardener’s ellipse. It’s also called the nails-and-string method, which is more descriptive.

To draw an ellipse on a sheet of plywood, drive a nails in the plywood at both foci, and attach an end of a string of length s to each nail. Then put a pencil in the middle of the string and pull it tight. Keep the string tight as you move the pencil. This will draw an ellipse [1].

Presumably the name gardener’s ellipse comes from the same idea on a larger scale, such as a gardener marking out an elliptical flower bed using two stakes in the ground and some rope.

Going back to drawing on a computer screen, there is a practical use for the gardener’s ellipse: to determine whether a point is inside an ellipse, add the distance from the point to each of the foci of the ellipse. If the sum is less than s then the point is inside the ellipse. If the sum is greater than s the point is outside the ellipse.

From parameters to nails and string

If you have the parametric form of an ellipse, how do you find the gardener’s method representation?

To make it easier to describe points, set θ = 0 and h = k = 0 for a moment. Then the foci are at (±c, 0) where c² = a² − b².

Since (a, 0) is a point on the string, you have to be able to draw it and so the string must stretch from the focus at (−c, 0) to the point (a, 0) and back to the focus at (c, 0). Therefore the length of the string is 2a.

The length of the string depends on the major axis and not on the minor axis.

When we shift and rotate our ellipse, letting θ, h, and k take on possibly non-zero values, we apply the same transformation to the foci.

\begin{align*} F_1 &= (h - c\cos(\theta), k - c\sin(\theta)) \\ F_2 &= (h + c\cos(\theta), k +\, c\sin(\theta)) \\ \end{align*}

Rotating and translating the ellipse doesn’t change the length of the major axis, so s stays the same.

From nails and string to parameters

Draw a line between the two “nails” (foci). The slope of this line is tan θ and it’s midpoint is (hk). The parameter c is half the length of the line.

The semimajor axis a is half the string length s.

Once you know a and c you can solve for b = √(a² − c²).

 

Related posts

[1] In practice, a carpenter might want to draw an ellipse a different way.

Fitting a parabola to an ellipse and vice versa

The previous post discussed fitting an ellipse and a parabola to the same data. Both fit well, but the ellipse fit a little better. This will often be the case because an ellipse has one more degree of freedom than a parabola.

There is one way to fit a parabola to an ellipse at an extreme point: match the end points and the curvatures. This uses up all the degrees of freedom in the parabola.

When you take the analogous approach to fitting an ellipse to a parabola, you have one degree of freedom left over. The curvature depends on a ratio, and so you can adjust the parameters while maintaining the ratio. You can use this freedom to fit the parabola better over an interval while still matching the curvature at the vertex.

The rest of the post will illustrate the ideas outlined above.

Fitting a parabola to an ellipse

Suppose you have an ellipse with equation

(x/a)² + (y/b)² = 1.

The curvature at (±a, 0) equals a/b² and the curvature at (0, ±b) equals b/a².

Now if you have a parabola

xcy² + d

then its curvature at y = 0 is 2|c|.

If you want to match the parabola and the ellipse at (a, 0) then da.

To match the curvatures at (a, 0) we set a/b² = 2|c|. So c = −a/2b². (Without the negative sign the curvatures would match, but the parabola would turn away from the ellipse.)

Similarly, at (−a, 0) we have d = −a and c = a/2b². And at (0,  ±b) we have d = ±b and c = ∓b/2a².

Here’s an example with a golden ellipse.

Fitting an ellipse to a parabola

Now we fix the parabola, say

ycx²

and find an ellipse

(x/a)² + ((yy0)/b)² = 1

to fit at the vertex (0, 0). For the ellipse to touch the parabola at its vertex we must have

((0 − y0)/b)² = 1

and so y0b. To match curvature we have

b/2a² = c.

So a and b are not uniquely determined, only the ratio b/a². As long as this ratio stays fixed at 2c, every ellipse will touch at the vertex and match curvature there. But larger values of the parameters will match the parabola more closely over a wider range. In the limit as b → ∞  (keeping b/a²  = 2c), the ellipses become a parabola.

Related posts

The ellipse hidden inside Pascal’s triangle

The nth row of Pascal’s triangle contains the binomial coefficients C(nr) for r ranging from 0 to n. For large n, if you print out the numbers in the nth row vertically in binary you can see a circular arc.

Here’s an example with n = 50.

You don’t have to use binary. Here’s are the numbers in the row for n = 100 written in base 10. You may be read the numbers if you squint, but you can see that the shape of the curve is something like a piece of a circle.

The length of the numerical representation of a number is roughly proportional to its logarithm. Changing the base only changes the proportionality constant. The examples above suggests that a plot of the logarithms of a row of Pascal’s triangle will be a portion of a circle, up to some scaling of one of the axes, so in general we have an ellipse.

Here’s a plot of log C(nr) for n = 1000. The shape is the same for all large n, so the choice of n = 1000 doesn’t matter much at all.

The best fitting ellipse is

\left(\frac{r - n/2}{b}\right)^2 + \left(\frac{y-y_0}{b}\right)^2= 1

where a = 554.2. b = 47.12, and y0 = −19.87.

The plots of log C(1000, r) and the ellipse lie on top of each other; the error is less than the width of a plotting line. Here’s a plot of the relative error in approximating log C(1000, r) with the ellipse.

Parabolic fit

WoЇfgang pointed out that the curve should be a parabola rather than an ellipse because the binomial distribution is asymptotically normal. Makes perfect sense.

So I redid my plots with the parabola that interpolates log C(nr) at 0, n/2, and n. This also gives a very good fit, but not as good!

But that’s not a fair comparison because it’s comparing the best (least squares) elliptical fit to a convenient parabolic fit.

So I redid my plots again with the least squares parabolic fit. The fit was better, but still not as good as the elliptical fit.

Part of the reason the ellipse fits better than the parabola has to do with the limitations of the central limit theorem. First of all, it applies to CDFs, not PDFs. Second, it applies to absolute error, not relative error. In practice, the CLT gives a good approximation in the middle but not in the tails. With all the curves mentioned above, the maximum error is in the tails.

Another part of the reason the ellipse fits better is that ellipses are more flexible than parabolas, and the discussion above suggests such flexibility would be useful. You can approximate a parabola arbitrarily well with an ellipse, so in the limit, elliptical fits include parabolic fits, as demonstrated in the next post.

Stacking positive and negative cannonballs

Imagine you are a soldier in charge of stacking cannonballs. Your fort has a new commander, a man with OCD who wants the cannonballs stacked in a very particular way.

The new commander wants the balls stacked in tetrahedra. The balls on the bottom of the stack are arranged into a triangle. Then the next layer is also a triangle, each ball resting in the space between balls in the previous layer. Keep doing this until there’s one ball on top.

For example, you might have 10 balls on the first layer, 6 on the next layer, then 3, then 1. That’s how you’d stack 20 cannonballs.

The number of cannonballs in a tetrahedron n layers high is called the nth tetrahedral number. We just showed that the 4th tetrahedral number is 20.

Not every number is a tetrahedral number, so it’s not always possible to stack a given number of cannonballs into a tetrahedron. But it’s always possible, as far as we know, to stack any number of cannonballs into no more than five tetrahedra. Usually it is possible to use no more than four tetrahedra. Naturally, your new commander would like the cannonballs arranged into no more than four tetrahedra.

There are two ways of meeting the commander’s wishes. One is to make a list of the number of cannonballs that cannot be arranged into four tetrahedral stacks and have a standing order that whenever the cannon fire is over, you always shoot one more ball to avoid having to stack a forbidden number of balls.

The other possibility is to allow the possibility of negative cannonballs.

We explain both possibilities below.

Tetrahedral numbers

Let T(n, 2) be the number of cannonballs on the bottom layer of a tetrahedron, arranged into a triangle with n balls on each side. This is the nth triangular number. The 2 in T(n, 2) stands for two dimensions.

T(n, 2) = 1 + 2 + 3 + \cdots + n = \frac{n(n+1)}{2}

Now let T(n, 3) be the nth tetrahedral number.

T(n, 3) = T_1 + T_2 + T_3 + \cdots + T_n = \frac{n(n+1)(n+2)}{6}

It’s possible to define and find a compact expression for T(n, d) for all positive integers d, which I wrote about here. But for this post we only need T(n, 3).

Pollock’s conjecture

Sir Frederick Pollock conjectured in 1850 that every positive integer can be written as the sum of at most 5 tetrahedral numbers, and all but a finite number of integers can be written as the sum of at most 4 tetrahedral numbers.

Both parts of the conjecture are still open. Nobody has found a number requiring more than 5 tetrahedral numbers in its sum, and it is believed that there are exactly 241 numbers that cannot be written as the sum of 4 tetrahedral numbers, the smallest being 17 and the largest being 343867. For a list of these numbers, see OEIS sequence A000797.

Negative cannonballs

It is possible to write any integer as the sum of four tetrahedral numbers if you take the formula for computing T(n, 3) as the definition of T(n, 3), which permits negative values of n [1]. Vadim Ponomarenko [2] showed that this could be done and shows how:

n = T(n, 3) + T(n-2, 3) + 2T(-n-1, 3)

Proof: expand the four terms on the right and simplify.

Satisfying the commander

I imagine the commander might not be pleased with the negative cannonball solution. If you could imagine negative cannonballs as balls missing from the stack, that would be one thing, but that’s not where we are. To stack 17 cannonballs into four tetrahedra, you’d need a stack of 969 balls, a stack of 680 balls and two stacks of 816 negative balls.

Better go back to saying certain numbers of cannonballs simply aren’t allowed, or require a fifth stack. But what if you tell the commander 17 balls cannot be arranged into four tetrahedral stacks and he tells you to try harder.

You could write a Python program to show by brute force when an arrangement isn’t possible. Brute force isn’t the most elegant approach, but it’s often the most convincing. This code will check whether n cannonballs can be stacked in four tetrahedral piles.

def t(n): return n*(n + 1)*(n + 2) // 6

def check(n):
    m = n + 1
    s = set()
    for i in range(m):
        for j in range(m):
            for k in range(m):
                for l in range(m):
                    s.add(t(i) + t(j) + t(k) + t(l))
    return n in s

You could be a little more clever by using itertools to eliminate the deeply nested for loops, though this goes against the spirit of brute force.

from itertools import product

def check2(n):
    m = n + 1
    s = set()
    for i, j, k, l in product(range(m), range(m), range(m), range(m)):
        s.add(t(i) + t(j) + t(k) + t(l))
    return n in s

Update: The code above could be made much more efficient. See the C code in the comment.

Related posts

[1] We tend to forget definitions once we have convenient formulas. This an be good or bad. It’s bad for comprehension. For example, calculus students usually don’t appreciate that the fundamental theorem of calculus is a theorem. They learn to compute integrals by using antiderivatives and come to think that the antiderivative is the definition of an integral.

The upside of blurring definitions and formulas is that it is common to use a formula to generalize a definition, as we do here. This is often a very useful approach, but it has to be done consciously.

[2] Vadim Ponomarenko. Pollock’s Generalized Tetrahedral Numbers Conjecture. The American Mathematical Monthly Vol. 128, No. 10 (December 2021), p. 921.