Golden ratio base numbers

Posted on 8 June 2025 by John

It is possible to express every positive integer as a sum of powers of the golden ratio φ using each power at most once. This means it is possible to create a binary-like number system using φ as the base with coefficients of 0 and 1 in front of each power of φ.

This system is sometimes called phinary because of the analogy with binary. I’ll use that term here rather than more formal names such as base-φ or golden base number system.

An interesting feature of phinary is that in general you need to include negative powers of φ to represent positive integers. For example,

$2 = \varphi + \varphi^{-2}$

and so you could write 2 in this system as 10.01.

To state things more formally, every positive integer n satisfies the following equation where a finite number of coefficients coefficients a_k are equal to 1 and the rest are equal to 0.

$n = \sum_{k=-\infty}^\infty a_k\varphi^k$

The golden ratio satisfies φ² = φ + 1 and so phinary representations are not unique. But if you add the rule that number representations must not have consecutive 1s, then representations are unique, analogous to the Fibonacci number system.

The original paper describing the phinary system [1] is awkwardly written. It has the flavor of “Here are some examples. You can see how this generalizes.” rather than a more typical mathematical style.

The end of the article says “Jr. High School 246 Brooklyn, N.Y.” and so when I got to that point I thought the style was due to the paper having been written by a public school teacher rather than a professional mathematician. I later learned from [2] that the author was not a math teacher but a student: George Bergman was 12 years old when he discovered and published his number system.

Phinary is not as simple to develop as you might expect. Bergman’s discovery was impressive, and not only because he was 12 years old at the time. You can find more sophisticated developments in [2] and in [3], but both require a few preliminaries and are not simple.

***

[1] George Bergman. A Number System with an Irrational Base. Mathematics Magazine. 31 (2): 98–110. 1957.

[2] Cecil Rousseau. The Phi Number System Revisited. Mathematics Magazine, Vol. 68, No. 4 (Oct., 1995), pp. 283-284

[3] Donald Knuth. The Art of Computer Programming, volume 1.

Binomial number system

Posted on 5 June 2025 by John

I just stumbled across the binomial number system in Exercise 5.38 of Concrete Mathematics. The exercise asks the reader to show that every non-negative integer n can be written as

$n = \binom{a}{1} + \binom{b}{2} + \binom{c}{3}$

and that the representation is unique if you require 0 ≤ a < b < c. The book calls this the binomial number system. I skimmed a paper that said this has some application in signal processing, but I haven’t looked at it closely [1].

You can find a, b, and c much as you would find the representation in many other number systems: first find the largest possible c, then the largest possible b for what’s left, and then the remainder is a.

In order to find c, we start with the observation that the binomial coefficient C(k, 3) is less than k³/6 and so c is less than the cube root of 6n. We can use this as an initial lower bound on c then search incrementally. If we wanted to be more efficient, we could do some sort of binary search.

Here’s Python code to find a, b, and c.

from math import comb, factorial

def lower(n, r):
    "Find largest k such that comb(k, r) <= n."
    k = int( (factorial(r)*n)**(1/r) ) # initial guess
    while comb(k, r) <= n: 
        k += 1 
    return k - 1 

def binomial_rep(n): 
    c = lower(n, 3) 
    cc = comb(c, 3) 
    b = lower(n - comb(c, 3), 2) 
    bb = comb(b, 2) 
    a = n - cc - bb 
    assert(c > b > a >= 0)
    return (a, b, c)

For example, here’s the binomial number system representation of today’s date.

>>> binomial_rep(20250605)
(79, 269, 496)
>>> comb(496, 3) + comb(269, 2) + comb(79, 1)
20250605

You could use any number of binomial terms, not just three.

[1] I looked back at the paper, and it is using a different kind of binomial number system, expressing numbers as sums of fixed binomial coefficients, not varying the binomial coefficient arguments. This representation has some advantages for error correction.

Additive and multiplicative persistence

Posted on 4 June 2025 by John

Casting out nines is a well-known way of finding the remainder when a number is divided by 9. You add all the digits of a number n. And if that number is bigger than 9, add all the digits of that number. You keep this up until you get a number less than 9.

This is an example of persistence. The persistence of a number, relative to some operation, is the number of times you have to apply that operation before you reach a fixed point, a point where applying the operation further makes no change.

The additive persistence of a number is the number of times you have to take the digit sum before getting a fixed point (i.e. a number less than 9). For example, the additive persistence of 8675309 is 3 because the digits in 8675309 sum to 38, the digits in 38 sum to 11, and the digits in 11 sum to 2.

The additive persistence of a number in base b is bounded above by its iterated logarithm in base b.

The smallest number with additive persistence n is sequence A006050 in OEIS.

Multiplicative persistence is analogous, except you multiply digits together. Curiously, it seems that multiplicative persistence is bounded. This is true for numbers with less than 30,000 digits, and it is conjectured to be true for all integers.

The smallest number with multiplicative persistence n is sequence A003001 in OEIS.

Here’s Python code to compute multiplicative persistence.

def digit_prod(n):
    s = str(n)
    prod = 1
    for d in s:
        prod *= int(d)
    return prod

def persistence(n):
    c = 0
    while n > 9:
        n = digit_prod(n)
        print(n)
        c += 1
    return c

You could use this to verify that 277777788888899 has persistence 11. It is conjectured that no number has larger persistence larger than 11.

The persistence of a large number is very likely 1. If you pick a number with 1000 digits at random, for example, it’s very likely that at least of these digits will be 0. So it would seem that the probability of a large number having persistence larger than 11 would be incredibly small, but I would not have expected it to be zero.

Here are plots to visualize the fixed points of the numbers less than N for increasing values of N.

It’s not surprising that zeros become more common as N increases. And a moments thought will explain why even numbers are more common than odd numbers. But it’s a little surprising that 4 is much less common than 2, 6, or 8.

Incidentally, we have worked in base 10 so far. In base 2, the maximum persistence is 1: when you multiply a bunch of 0s and 1s together, you either get 0 or 1. It is conjectured that the maximum persistence in base 3 is 3.

Iterated logarithm

Posted on 4 June 2025 by John

Logarithms give a way to describe large numbers compactly. But sometimes even the logarithm is a huge number, and it would be convenient to take the log again. And maybe again.

For example, consider

googolplex = 10^googol

where

googol = 10¹⁰⁰.

(The name googol is older than “Google,” and in fact the former inspired the latter name.)

A googolplex is a big number. It’s log base 10 is still a big number. You have to take the log 3 times to get to a small number:

log₁₀( log₁₀( log₁₀ googolplex ) ) = 2.

Log star

The iterated logarithm base b of a positive number x, written log*_b (x) is the number of times you have to take the log base b until you get a result less than or equal to 1. So in the example above

log*₁₀ googolplex = 4.

As usual, a missing base implies b = e. Also just as log base 2 is often denoted lg, the iterated log base 2 is sometimes denoted lg*.

I’ve seen sources that say the base of an iterated logarithm doesn’t matter. That’s not true, because, for example,

log*₂ googolplex = 6.

The function log*(x) appears occasionally in the analysis of algorithms.

Python implementation

It’s easy to implement the iterated log, though this is of limited use because the iterated log is usually applied to numbers too large to represent as floating point numbers.

from math import log, exp

def logstar(x, base=exp(1)):
    c = 0
    while x > 1:
        x = log(x, base)
        c += 1
    return c

lgstar = lambda x: logstar(x, 2)

Mersenne primes

The largest known prime, at the time of writing, is

p = 2^136279841 − 1.

We can use the code above to determine that lg* 136279841 = 5 and so lg* p = 6.

Skewes’ bounds

The prime number theorem says that the number of prime numbers less than x, denoted π(x), is asymptotically li(x) where

$\text{li}(x) = \int_0^x \frac{dt}{\log t}$

You may have seen a different form of the prime number theorem that says π(x) is asymptotically x/log x. That’s true too, because li(x) and x/log x are asymptotically equal. But li(x) gives a better approximation to π(x) for finite x. More on that here.

For every x for which π(x) has been computed, π(x) < li(x). However, Littlewood proved in 1914 that there exists some x for which π(x) > li(x). In 1933 Stanley Skewes gave an upper bound on the smallest x such that π(x) > li(x), assuming the Riemann Hypothesis:

S₁ = 10^(10^(10^34))

In 1955 he proved the upper bound

S₂ = 10^(10^(10^964))

not assuming the Riemann Hypothesis.

You can calculate that

log*₁₀ S₁ = 5

and

log*₁₀ S₂ = 6.

Iterated iterated logarithm

In all the examples above, log*(x) is a small number. But for some numbers, such as Moser’s number and Graham’s number, even log*(x) is unwieldy and you have to do things like take the iterated logarithm of the iterated logarithm

log**(x) = log*( log*(x) )

before the numbers become imaginable.

Approximation of Inverse, Inverse of Approximation

Posted on 3 June 2025 by John

“The inverse of an approximation is an approximation of the inverse.” This statement is either obvious or really clever. Maybe both.

Logs and exponents

Let’s start with an example. For moderately small x,

$10^x \approx \frac{1 + x}{1 - x}$

That means that near 1 (i.e. near 10 raised to a small number,

$\log_{10}x \approx \frac{1-x}{1 + x}$

The inverse of a good approximation is a good approximation of the inverse. But the measure of goodness might not be the same in both uses of the word “good.” The inverse of a very good approximation might be a mediocre approximation of the inverse, depending on the sensitivity of the problem. See, for example, Wilkinson’s polynomial.

Inverse function theorem

Now let’s look at derivatives and inverse functions. The derivative of a function is the best linear approximation to that function at a point. And it’s true that the best linear approximation to the inverse of a function at a point is the inverse of the best linear approximation to the function. In symbols,

$\frac{dx}{dy} = \frac{1}{\dfrac{dy}{dx}}$

This is a good example of a bell curve meme, or coming full circle. The novice naively treats the symbols above as fractions and says “of course.” The sophomore says “Wait a minute. These aren’t fractions. You can’t just do that.” And they would be correct, except the symbols are limits of fractions, and in this context naive symbol manipulation in fact leads to the correct result.

However, the position on the right side of the meme is nuanced. Treating derivatives as fractions, especially partial derivatives, can get you into trouble.

The more rigorous statement of the inverse function theorem is harder to parse visually, and so the version above is a good mnemonic. If f(x) = y then

$\bigl(f^{-1}\bigr)^\prime (y) = \frac{1}{f^\prime(x)} = \frac{1}{f^\prime(f^{-1}(y))}$

The same is true in multiple variables. If f is a smooth function from ℝⁿ to ℝⁿ then the best linear approximation to f at a point is a linear transformation, which can be represented by a matrix M. And the best linear approximation to the inverse function at that point is the linear transformation represented by the inverse of M. In symbols,

$D\bigl(f^{-1}\bigr)(y) = \left[Df\bigl(f^{-1}(y)\bigr)\right]^{-1}$

There’s fine print, such as saying a function has to be differentiable in a neighborhood of x and the derivative has to be non-singular, but that’s the gist of the inverse function theorem.

Note that unlike in the first section, there’s no discussion of how good the approximations are. The approximations are exact, at a point. But it might still be the case that the linear approximation on one side is good over a much larger neighborhood than the linear approximation on the other side.

False witnesses

Posted on 2 June 2025 by John

Pinocchio talking to a judge

Fermat’s primality test

Fermat’s little theorem says that if p is a prime and a is not a multiple of p, then

a^p−1 = 1 (mod p).

The contrapositive of Fermat’s little theorem says if

a^p−1 ≠ 1 (mod p)

then either p is not prime or a is a multiple of p. The contrapositive is used to test whether a number is prime. Pick a number a less than p (and so a is clearly not a multiple of p) and compute a^p−1 mod p. If the result is not congruent to 1 mod p, then p is not a prime.

So Fermat’s little theorem gives us a way to prove that a number is composite: if a^p−1 mod p equals anything other than 1, p is definitely composite. But if a^p−1 mod p does equal 1, the test is inconclusive. Such a result suggests that p is prime, but it might not be.

Examples

For example, suppose we want to test whether 57 is prime. If we apply Fermat’s primality test with a = 2 we find that 57 is not prime, as the following bit of Python shows.

    >>> pow(2, 56, 57)
    4

Now let’s test whether 91 is prime using a = 64. Then Fermat’s test is inconclusive.

    >>> pow(64, 90, 91)
    1

In fact 91 is not prime because it factors into 7 × 13.

If we had used a = 2 as our base we would have had proof that 91 is composite. In practice, Fermat’s test is applied with multiple values of a (if necessary).

Witnesses

If a^p−1 = 1 (mod p), then a is a witness to the primality of p. It’s not proof, but it’s evidence. It’s still possible p is not prime, in which case a is a false witness to the primality of p, or p is a pseudoprime to the base a. In the examples above, 64 is a false witness to the primality of 91.

Until I stumbled on a paper [1] recently, I didn’t realize that on average composite numbers have a lot of false witnesses. For large N, the average number of false witnesses for composite numbers less than N is greater than N^15/23. Although N^15/23 is big, it’s small relative to N when N is huge. And in applications, N is huge, e.g. N = 2²⁰⁴⁸ for RSA encryption.

However, your chances of picking one of these false witnesses, if you were to choose a base a at random, is small, and so probabilistic primality testing based on Fermat’s little theorem is practical, and in fact common.

Note that the result quoted above is a lower bound on the average number of false witnesses. You really want an upper bound if you want to say that you’re unlikely to run into a false witness by accident. The authors in [1] do give upper bounds, but they’re much more complicated expressions than the lower bound.

[1] Paul Erdős and Carl Pomerance. On the Number of False Witnesses for a Composite Number. Mathematics of Computation. Volume 46, Number 173 (January 1986), pp. 259–279.

Houston’s long term planning

Posted on 2 June 2025 by John

When I hear American cities criticized for lack of long-term planning, my initial thought is that this is a good thing because the future cannot be predicted or directed very far in advance, and cities should be allowed to grow organically. Houston, in particular, is subject to a lot of criticism because of its lack of zoning, which I also think is a good thing. [1]

With that in mind, I was very surprised to see the following thread [2] from Aaron Renn.

Maybe the craziest thing about Houston’s third beltway, Grand Parkway, is that the idea originated in 1961, when the Houston metro area population was only 1.2 million people. [It’s now 7.8 million. — JC]

“In October 1965, plans for the Grand Parkway became public. The Houston City Planning Commission released a draft version of the 1966 Major Thoroughfare and Freeway Plan showing the new third loop on the official planning map.”

“The City Planning Commission approved the route and scheduled a public hearing for February 17, 1966. The Houston Chronicle enthusiastically endorsed the new route, saying, ‘No sensible citizen can doubt that this freeway will be needed eventually.’”

Aaron’s quotations are from Houston Freeways.

To a very crude approximation, a map of Houston looks like three concentric rings [2]. According to the sources cited above, planning for the outer ring began long before the middle ring was completed. The first section of the Grand Parkway opened in 1994, and it’s maybe three quarters finished currently.

Maybe what has made Houston successful is that it planned far ahead on a macro scale, but has not micromanaged as much as most cities.

[1] Along these lines, see James Scott’s book Seeing Like a State for examples of failures in long-range state planning. Or if you don’t have the to read the book, you might want to read Venkatash Rao’s article A Big Little Idea Called Legibility.

[2] It’s not exactly a thread. It’s a series of posts that quote the first sentence.

[3] When the outer ring, the Grand Parkway, is completed, it will be the longest beltway in the US and encompass more area than the state of Rhode Island.

Gardener’s ellipse

Posted on 1 June 2025 by John

There are several ways to define an ellipse. If you want to write software draw an ellipse, it’s most convenient to have a parametric form:

$\begin{align*} x(t) &= h + a \cos(\theta) \cos(t) - b \sin(\theta) \sin(t) \\ y(t) &= k + a \sin(\theta) \cos(t) + b \cos(\theta) \sin(t) \end{align*}$

This gives an ellipse centered at (h, k), with semi-major axis a, semi-minor axis b, and with the major axis rotated by an angle θ from horizontal.

But if you’re going to physically draw an ellipse, there’s a more convenient definition: an ellipse is the set of points such that the sum of the distances to two foci is a constant s. This method is most often called the gardener’s ellipse. It’s also called the nails-and-string method, which is more descriptive.

To draw an ellipse on a sheet of plywood, drive a nails in the plywood at both foci, and attach an end of a string of length s to each nail. Then put a pencil in the middle of the string and pull it tight. Keep the string tight as you move the pencil. This will draw an ellipse [1].

Presumably the name gardener’s ellipse comes from the same idea on a larger scale, such as a gardener marking out an elliptical flower bed using two stakes in the ground and some rope.

Going back to drawing on a computer screen, there is a practical use for the gardener’s ellipse: to determine whether a point is inside an ellipse, add the distance from the point to each of the foci of the ellipse. If the sum is less than s then the point is inside the ellipse. If the sum is greater than s the point is outside the ellipse.

From parameters to nails and string

If you have the parametric form of an ellipse, how do you find the gardener’s method representation?

To make it easier to describe points, set θ = 0 and h = k = 0 for a moment. Then the foci are at (±c, 0) where c² = a² − b².

Since (a, 0) is a point on the string, you have to be able to draw it and so the string must stretch from the focus at (−c, 0) to the point (a, 0) and back to the focus at (c, 0). Therefore the length of the string is 2a.

The length of the string depends on the major axis and not on the minor axis.

When we shift and rotate our ellipse, letting θ, h, and k take on possibly non-zero values, we apply the same transformation to the foci.

$\begin{align*} F_1 &= (h - c\cos(\theta), k - c\sin(\theta)) \\ F_2 &= (h + c\cos(\theta), k +\, c\sin(\theta)) \\ \end{align*}$

Rotating and translating the ellipse doesn’t change the length of the major axis, so s stays the same.

From nails and string to parameters

Draw a line between the two “nails” (foci). The slope of this line is tan θ and it’s midpoint is (h, k). The parameter c is half the length of the line.

The semimajor axis a is half the string length s.

Once you know a and c you can solve for b = √(a² − c²).

[1] In practice, a carpenter might want to draw an ellipse a different way.

Fitting a parabola to an ellipse and vice versa

Posted on 1 June 2025 by John

The previous post discussed fitting an ellipse and a parabola to the same data. Both fit well, but the ellipse fit a little better. This will often be the case because an ellipse has one more degree of freedom than a parabola.

There is one way to fit a parabola to an ellipse at an extreme point: match the end points and the curvatures. This uses up all the degrees of freedom in the parabola.

When you take the analogous approach to fitting an ellipse to a parabola, you have one degree of freedom left over. The curvature depends on a ratio, and so you can adjust the parameters while maintaining the ratio. You can use this freedom to fit the ellipse better over an interval while still matching the curvature at the vertex.

The rest of the post will illustrate the ideas outlined above.

Fitting a parabola to an ellipse

Suppose you have an ellipse with equation

(x/a)² + (y/b)² = 1.

The curvature at (±a, 0) equals a/b² and the curvature at (0, ±b) equals b/a².

Now if you have a parabola

x = cy² + d

then its curvature at y = 0 is 2|c|.

If you want to match the parabola and the ellipse at (a, 0) then d = a.

To match the curvatures at (a, 0) we set a/b² = 2|c|. So c = −a/2b². (Without the negative sign the curvatures would match, but the parabola would turn away from the ellipse.)

Similarly, at (−a, 0) we have d = −a and c = a/2b². And at (0, ±b) we have d = ±b and c = ∓b/2a².

Here’s an example with a golden ellipse.

Fitting an ellipse to a parabola

Now we fix the parabola, say

y = cx²

and find an ellipse

(x/a)² + ((y − y₀)/b)² = 1

to fit at the vertex (0, 0). For the ellipse to touch the parabola at its vertex we must have

((0 − y₀)/b)² = 1

and so y₀ = b. To match curvature we have

b/2a² = c.

So a and b are not uniquely determined, only the ratio b/a². As long as this ratio stays fixed at 2c, every ellipse will touch at the vertex and match curvature there. But larger values of the parameters will match the parabola more closely over a wider range. In the limit as b → ∞ (keeping b/a² = 2c), the ellipses become a parabola.

The ellipse hidden inside Pascal’s triangle

Posted on 31 May 2025 by John

The nth row of Pascal’s triangle contains the binomial coefficients C(n, r) for r ranging from 0 to n. For large n, if you print out the numbers in the nth row vertically in binary you can see a circular arc.

Here’s an example with n = 50.

You don’t have to use binary. Here’s are the numbers in the row for n = 100 written in base 10. You may be read the numbers if you squint, but you can see that the shape of the curve is something like a piece of a circle.

The length of the numerical representation of a number is roughly proportional to its logarithm. Changing the base only changes the proportionality constant. The examples above suggests that a plot of the logarithms of a row of Pascal’s triangle will be a portion of a circle, up to some scaling of one of the axes, so in general we have an ellipse.

Here’s a plot of log C(n, r) for n = 1000. The shape is the same for all large n, so the choice of n = 1000 doesn’t matter much at all.

The best fitting ellipse is

$\left(\frac{r - n/2}{b}\right)^2 + \left(\frac{y-y_0}{b}\right)^2= 1$

where a = 554.2. b = 47.12, and y₀ = −19.87.

The plots of log C(1000, r) and the ellipse lie on top of each other; the error is less than the width of a plotting line. Here’s a plot of the relative error in approximating log C(1000, r) with the ellipse.

Parabolic fit

WoЇfgang pointed out that the curve should be a parabola rather than an ellipse because the binomial distribution is asymptotically normal. Makes perfect sense.

So I redid my plots with the parabola that interpolates log C(n, r) at 0, n/2, and n. This also gives a very good fit, but not as good!

But that’s not a fair comparison because it’s comparing the best (least squares) elliptical fit to a convenient parabolic fit.

So I redid my plots again with the least squares parabolic fit. The fit was better, but still not as good as the elliptical fit.

Part of the reason the ellipse fits better than the parabola has to do with the limitations of the central limit theorem. First of all, it applies to CDFs, not PDFs. Second, it applies to absolute error, not relative error. In practice, the CLT gives a good approximation in the middle but not in the tails. With all the curves mentioned above, the maximum error is in the tails.

Another part of the reason the ellipse fits better is that ellipses are more flexible than parabolas, and the discussion above suggests such flexibility would be useful. You can approximate a parabola arbitrarily well with an ellipse, so in the limit, elliptical fits include parabolic fits, as demonstrated in the next post.

Author: John

Golden ratio base numbers

Binomial number system

Additive and multiplicative persistence

Iterated logarithm

Log star

Python implementation

Mersenne primes

Skewes’ bounds

Iterated iterated logarithm

Related posts

Approximation of Inverse, Inverse of Approximation

Logs and exponents

Inverse function theorem

False witnesses

Fermat’s primality test

Examples

Witnesses

Related posts

Houston’s long term planning

Gardener’s ellipse

From parameters to nails and string

From nails and string to parameters

Related posts

Fitting a parabola to an ellipse and vice versa

Fitting a parabola to an ellipse

Fitting an ellipse to a parabola

Related posts

The ellipse hidden inside Pascal’s triangle

Parabolic fit