My most popular posts on Hacker News

Here are the most popular posts on my site according to the number of points given on Hacker News.

Names for extremely large numbers

Names for extremely large numbers are kinda pointless. The purpose of a name is to communicate something, and if the meaning of the name is not widely known, the name doesn’t serve that purpose. It’s even counterproductive if two people have incompatible ideas what the word means. It’s much safer to simply use scientific notation for extremely large numbers.

Obscure or confusing

Everyone knows what a million is. Any larger than that and you may run into trouble. The next large number name, billion, means 109 to some people and 1012 to others.

When writing for those who take one billion to mean 109, your audience may or may not know that one trillion is 1012 according to that convention. You cannot count on people knowing the names of larger numbers: quadrillion, quintillion, sextillion, etc.

To support this last claim, let’s look at the frequency of large number names according to Google’s Ngram viewer. Presumably the less often a word is used, the less likely people are to recognize it.

Here’s a bar chart for the data from 2005, plotting on a logarithmic scale. The chart has a roughly linear slope, which means the frequency of the words drops exponentially. Million is used more than 24,000 times as often as septillion.

Frequency of large number names on log scale

Here’s the raw data.

    |-------------+--------------|
    | Name        |    Frequency |
    |-------------+--------------| 
    | Million     | 0.0096086564 |
    | Billion     | 0.0030243298 |
    | Trillion    | 0.0002595139 |
    | Quadrillion | 0.0000074383 |
    | Quintillion | 0.0000016745 |
    | Sextillion  | 0.0000006676 |
    | Septillion  | 0.0000003970 |
    |-------------+--------------|

Latin prefixes for large numbers

There is a consistency to these names, though they are not well known. Using the convention that these names refer to powers of 1000, the pattern is

[Latin prefix for n] + llion = 1000n+1

So for example, billion is 10002+1 because bi- is the Latin prefix for 2. A trillion is 10003+1 because tri- corresponds to 3, and so forth. A duodecillion is 100013 because the Latin prefix duodec- corresponds to 12.

    |-------------------+---------|
    | Name              | Meaning |
    |-------------------+---------|
    | Billion           |   10^09 |
    | Trillion          |   10^12 |
    | Quadrillion       |   10^15 |
    | Quintillion       |   10^18 |
    | Sextillion        |   10^21 |
    | Septillion        |   10^24 |
    | Octillion         |   10^27 |
    | Nonillion         |   10^30 |
    | Decillion         |   10^33 |
    | Undecillion       |   10^36 |
    | Duodecillion      |   10^39 |
    | Tredecillion      |   10^42 |
    | Quattuordecillion |   10^45 |
    | Quindecillion     |   10^48 |
    | Sexdecillion      |   10^51 |
    | Septendecillion   |   10^54 |
    | Octodecillion     |   10^57 |
    | Novemdecillion    |   10^60 |
    | Vigintillion      |   10^63 |
    |-------------------+---------|

Code

Here’s the Mathematica code that was used to create the plot above.

    BarChart[ 
        {0.0096086564, 0.0030243298, 0.0002595139, 0.0000074383, 
         0.0000016745, 0.0000006676, 0.0000003970}, 
         ChartLabels -> {"million", "billion", "trillion", "quadrillion", 
                         "quintillion", "sextillion", "septillion"}, 
         ScalingFunctions -> "Log", 
         ChartStyle -> "DarkRainbow"
    ]

Biased random number generation

Melissa O’Neill has a new post on generating random numbers from a given range. She gives the example of wanting to pick a card from a deck of 52 by first generating a 32-bit random integer, then taking the remainder when dividing by 52. There’s a slight bias because 232 is not a multiple of 52.

Since 232 = 82595524*52 + 48, there are 82595525 ways to generate the numbers 0 through 47, but only 82595524 ways to generate the numbers 48 through 51. As Melissa points out in her post, the bias here is small, but the bias increases linearly with the size of our “deck.” To clarify, it is the relative bias that increases, not the absolute bias.

Suppose you want to generate a number between 0 and M, where M is less than 232 and not a power of 2. There will be 1 + ⌊232/M⌋ ways to generate a 0, but ⌊232/M⌋ ways to generate M-1. The difference in the probability of generating 0 vs generating M-1 is 1/232, independent of M. However, the ratio of the two probabilities is 1 + 1/⌊232/M⌋ or approximately 1 + M/232.

As M increases, both the favored and unfavored outcomes become increasingly rare, but ratio of their respective probabilities approaches 2.

Whether this makes any practical difference depends on your context. In general, the need for random number generator quality increases with the volume of random numbers needed.

Under conventional assumptions, the sample size necessary to detect a difference between two very small probabilities p1 and p2 is approximately

8(p1 + p2)/(p1p2

and so in the card example, we would have to deal roughly 6 × 1018 cards to detect the bias between one of the more likely cards and one of the less likely cards.

***

Need help with randomization?

Fraktur symbols in mathematics

When mathematicians run out of symbols, they turn to other alphabets. Most math symbols are Latin or Greek letters, but occasionally you’ll run into Russian or Hebrew letters.

Sometimes math uses a new font rather than a new alphabet, such as Fraktur. This is common in Lie groups when you want to associate related symbols to a Lie group and its Lie algebra. By convention a Lie group is denoted by an ordinary Latin letter and its associated Lie algebra is denoted by the same letter in Fraktur font.

lower case alphabet in Fraktur

LaTeX

To produce Fraktur letters in LaTeX, load the amssymb package and use the command \mathfrak{}.

Symbols such as \mathfrak{A} are math symbols and can only be used in math mode. They are not intended to be a substitute for setting text in Fraktur font. This is consistent with the semantic distinction in Unicode described below.

Unicode

The Unicode standard tries to distinguish the appearance of a symbol from its semantics, though there are compromises. For example, the Greek letter Ω has Unicode code point U+03A9 but the symbol Ω for electrical resistance in Ohms is U+2621 even though they are rendered the same [1].

The letters a through z, rendered in Fraktur font and used as mathematical symbols, have Unicode values U+1D51E through U+1D537. These values are in the “Supplementary Multilingual Plane” and do not commonly have font support [2].

The corresponding letters A through Z are encoded as U+1D504 through U+1D51C, though interestingly a few letters are missing. The code point U+1D506, which you’d expect to be Fraktur C, is reserved. The spots corresponding to H, I, and R are also reserved. Presumably these are reserved because they are not commonly used as mathematical symbols. However, the corresponding bold versions U+1D56C through U+ID585 have no such gaps [3].

Footnotes

[1] At least they usually are. A font designer could choose provide different glyphs for the two symbols. I used the same character for both because some I thought some readers might not see the Ohm symbol properly rendered.

[2] If you have the necessary fonts installed you should see the alphabet in Fraktur below:
𝔞 𝔟 𝔠 𝔡 𝔢 𝔣 𝔤 𝔥 𝔦 𝔧 𝔨 𝔩 𝔪 𝔫 𝔬 𝔭 𝔮 𝔯 𝔰 𝔱 𝔲 𝔳 𝔴 𝔵 𝔶 𝔷

I can see these symbols from my desktop and from my iPhone, but not from my Android tablet. Same with the symbols below.

[3] Here are the bold upper case and lower case Fraktur letters in Unicode:
𝕬 𝕭 𝕮 𝕯 𝕰 𝕱 𝕲 𝕳 𝕴 𝕵 𝕶 𝕷 𝕸 𝕹 𝕺 𝕻 𝕼 𝕽 𝕾 𝕿 𝖀 𝖁 𝖂 𝖃 𝖄 𝖅
𝖆 𝖇 𝖈 𝖉 𝖊 𝖋 𝖌 𝖍 𝖎 𝖏 𝖐 𝖑 𝖒 𝖓 𝖔 𝖕 𝖖 𝖗 𝖘 𝖙 𝖚 𝖛 𝖜 𝖝 𝖞 𝖟

Summarizing homotopy groups of spheres

I don’t understand homotopy groups of spheres, and it’s OK if you don’t either. Nobody fully understands them. This post is really more about information compression than homotopy. That is, I’ll be looking at ways to summarize what is known without being overly concerned about what the results mean.

The task: map two integers to a list of integers

For each positive integer k, and non-negative integer n, the kth homotopy group of the sphere Sn is a finitely generated Abelian group, something that can be described by a finite list of numbers. So we’re looking at simply writing a function that takes two integers as input and returns a list of integers. This function is implemented in an online calculator that lets you lookup homotopy groups.

Computing homotopy groups of spheres is far from easy. The first Fields medal given to a topologist was for partial work along these lines. There are still groups that haven’t been computed, and potentially more Fields medals to win. But our task is much more modest: simply to summarize what has been discovered.

This is not going to be too easy, as suggested by the sample of results in the table below.

table of homotopy groups of spheres

This table was taken from the Homotopy Type Theory book, and was in turn based on the Wikipedia article on homotopy groups of spheres.

Output data representation

To give an example of what we’re after, the table says that π13(S²), the 13th homotopy group of the 2-sphere, is Z12 × Z2. All we need to know is the subscripts on the Z‘s, the orders of the cyclic factors, and so our function would take as input (13, 2) and return (12, 2).

The table tells us that π8(S4) = Z22. This is another way of writing Z2 × Z2, and so our function would take (8, 4) as input and return (2, 2).

When I said above that our function would return a list of integers I glossed over one thing: some of the Z‘s don’t have a subscript. That is some of the factors are the group of integers, not the group of integers mod some finite number. So we need to add an extra symbol to indicate a factor with no subscript. I’ll use ∞ because the integers as the infinite cyclic group. For example, our function would take (7, 4) and return (∞, 12). I will also use 1 to denote the trivial group, the group with 1 element.

Some results are unknown, and so we’ll return an empty list for these.

The order of the numbers in the output doesn’t matter, but we will list the numbers in descending order because that appears to be conventional.

Theorems

Some of the values of our function can be filled in by a general theorem, and some will simply be data.

If we call our function f, then there is a theorem that says f(kn) = (1) if kn.  This accounts for the zeros in the upper right corner of the chart above.

There’s another theorem that says f(n+mn) is independent of n if n > m + 1. These are the so-called stable homotopy groups.

The rest of the groups are erratic; we can’t do much better than just listing them as data.

(By the way, the corresponding results for homology rather than homotopy are ridiculously simple by comparison. For k > 0, the kth homology group of Sn is isomorphic to the integers if k = n and is trivial otherwise.)

King of high dimensional space

rook, king, knight

Which has more mobility on an empty chessboard: a rook, a king or a knight?

On an ordinary 8 by 8 chessboard, a rook can move to any one of 14 squares, no matter where it starts. A knight and a king both have 8 possible moves from the middle of the board, fewer moves from an edge.

If we make the board bigger or higher dimensional, what happens to the relative mobility of each piece? Assume we’re playing chess on a hypercube lattice, n points along each edge, in d dimensions. So standard chess corresponds to n = 8 and d = 2.

A rook can move to any square in a coordinate direction, so it can choose one of n-1 squares in each of d dimensions, for a total of (n-1)d possibilities.

A king can move a distance of 0, 1, or -1 in each coordinate. In d dimensions, this gives him 3d – 1 options. (Minus one for the position he’s initially on: moving 0 in every coordinate isn’t a move!)

What about a knight? There are C(d, 2) ways to choose two coordinate directions. For each pair of directions, there are 8 possibilities: two ways to choose which direction to move two steps in, and then whether to move + or – in each direction. So there are 4d(d – 1) possibilities.

In short:

  • A king’s mobility increases exponentially with dimension and is independent of the size of the board.
  • A rook’s mobility increases linearly with dimension and linearly with the size of the board.
  • A knight’s mobility increases quadratically with dimension and independent of the size of the board.

The rules above are not the only way to generalize chess rules to higher dimensions. Here’s an interesting alternative way to define a knight’s moves: a knight can move to any lattice point a distance √5 away. In dimensions up to 4, this corresponds to the definition above. But in dimension 5 and higher, there’s a new possibility: moving one step in each of five dimensions. In that case, the number of possible knight moves increases with dimension like d5.

Related post: A knight’s random walk in 3 or 4 dimensions

 

3D chess knight moves

I’ve written before about a knight’s random walk on an ordinary chess board. In this post I’d like to look at the generalization to three dimensions (or more).

three dimensional lattice

So what do we mean by 3D chess? For this post, we’ll have a three dimensional lattice of possible positions, of size 8 by 8 by 8. You could think of this as a set of 8 ordinary chess boards stacked vertically. To generalize a knight’s move to this new situation, we’ll say that a knight move consists of moving 2 steps in one direction and 1 step in an orthogonal direction. For example, he might move up two levels and over one position horizontally.

Suppose our knight walks randomly through the chess lattice. At each point, he evaluates all possible moves and chooses one randomly with all possible moves having equal probability. How long on average will it take our knight to return to where he started?

As described in the post about the two dimensional problem, we can find the average return time using a theorem about Markov chains.

The solution is to view the problem as a random walk on a graph. The vertices of the graph are the squares of a chess board and the edges connect legal knight moves. The general solution for the time to first return is simply 2N/k where N is the number of edges in the graph, and k is the number of edges meeting at the starting point.

The problem reduces to counting N and k. This is tedious in two dimensions, and gets harder in higher dimensions. Rather than go through a combinatorial argument, I’ll show how to compute the result with a little Python code.

To count the number of edges N, we’ll add up the number of edges at each node in our graph, and then divide by 2 since this process will count each edge twice. We will iterate over our lattice, generate all potential moves, and discard those that would move outside the lattice.

from numpy import all
from itertools import product, combinations

def legal(v):
    return all([ 0 <= t < 8 for t in v])

count = 0
for position in product(range(8), range(8), range(8)):
    # Choose two directions
    for d in combinations(range(3), 2):
        # Move 1 step in one direction
        # and 2 steps in the other.
        for step in [1, 2]:
            for sign0 in [-1, 1]:
                for sign1 in [-1, 1]:
                    move = list(position)
                    move[d[0]] += sign0*step
                    move[d[1]] += sign1*(3-step)
                    if legal(move):
                        count += 1
print(count // 2)

This tells us that there are N = 4,032 nodes in our graph of possible moves. The number of starting moves k depends on our starting point. For example, if we start at a corner, then we have 6 possibilities. In this case we should expect our knight to return to his starting point in an average of 2*4032/6 = 1,344 moves.

We can easily modify the code above. To look at different size lattices, we could change all the 8’s above. The function legal would need more work if the lattice was not the same size in each dimensions.

We could also look at four dimensional chess by adding one more range to the position loop and changing the combinations to come from range(4) rather than range(3). In case you’re curious, in four dimensions, the graph capturing legal knight moves in an 8 by 8 by 8 by 8 lattice would have 64,512 edges. If our knight started in a corner, he’d have 12 possible starting moves, so we’d expect him to return to his starting position on average after 5,275 moves.

Math spinoffs

NASA has fueled the development of lots of spinoff technologies: smoke detectors, memory foam, infrared ear thermometers, etc. NASA didn’t pursue these things directly, but they were a useful side effect.

Something analogous happens in mathematics. While pursuing one goal, mathematicians spin off tools that are useful for other purposes. Algebraic topology might be the NASA of pure mathematics. It’s worth taking a course in algebraic topology even if you don’t care about topology, just to see a lot of widely used ideas in their original habitat.

Number theory has developed an enormous amount of technical machinery. If a graduate student in number theory described to you what she’s working on, you probably wouldn’t recognize it as number theory. But as far as I know, not much of the machinery of number theory has been applied to much besides number theory. Maybe there’s a arbitrage opportunity, not to apply number theory per se, but to apply the machinery of number theory.

Attribution based on tail probabilities

If all you know about a person is that he or she is around 5′ 7″, it’s a toss-up whether this person is male or female. If you know someone is over 6′ tall, they’re probably male. If you hear they are over 7″ tall, they’re almost certainly male. This is a consequence of heights having a thin-tailed probability distribution. Thin, medium, and thick tails all behave differently for the attribution problem as we will show below.

Thin tails: normal

Suppose you have an observation that either came from X or Y, and a priori you believe that both are equally probable. Assume X and Y are both normally distributed with standard deviation 1, but that the mean of X is 0 and the mean of Y is 1. The probability you assign to X and Y after seeing data will change, depending on what you see. The larger the value you observe, the more sure you can be that it came from Y.

Posterior probability of Y with normal distributions

This plot shows the posterior probability that an observation came from Y, given that we know the observation is greater than the value on the horizontal axis.

Suppose I’ve seen the exact value, and it’s 4. All I tell you is that it’s bigger than 2. Then you would say it probably came from Y. When I go back and tell you that in fact it’s bigger than 3. You would be more sure it came from Y. The more information I give you, the more convinced you are. This isn’t the case with other probability distributions.

Thick tails: Cauchy

Now let’s suppose X and Y have a Cauchy distribution with unit scale, with X having mode 0 and Y having mode 1. The plot below shows how likely is it that our observation came from Y given a lower bound on the observation.

Posterior distribution of Y assuming Cauchy distributions

We are most confident that our data came from Y when we know that our data is greater than 1. But the larger our lower bound is, the further we look out in the tails, the less confident we are! If we know, for example, that our data is at least 5, then we still think that it’s more likely that it came from Y than from X, but we’re not as sure.

As above, suppose I’ve seen the data value but only tell you lower bounds on its value.  Suppose I see a value of 4, but only tell you it’s positive. When I come back and tell you that the value is bigger than 1, your confidence goes up that the sample came from Y. But as I give you more information, telling you that the sample is bigger than 2, then bigger than 3, your confidence that it came from Y goes down, just the opposite of the normal case.

Explanation

What accounts for the very different behavior?

In the normal example, seeing a value of 5 or more from Y is unlikely, but seeing a value so large from X is very unlikely. Both tails are getting smaller as you move to the right, but in relative terms, the tail of X is getting thinner much faster than the tail of Y.

In the Cauchy example, the value of both tails gets smaller as you move to the right, but the relative difference between the two tails is decreasing. Seeing a value greater than 10, say, from Y is unlikely, but it would only be slightly less likely from X.

Medium tails

In between thin tails and thick tails are medium tails. The tails of the Laplace (double exponential) distribution decay exponentially. Exponential tails are a often considered the boundary between thin tails and thick tails, or super-exponential and sub-exponential tails.

Suppose you have two Laplace random variables with the same scale, with X centered at 0 and Y centered at 1. What is the posterior probability that a sample came from Y rather than X, given that it’s at least some value Z > 1? It’s constant! Specifically, it’s e/(1 + e).

In the normal and Cauchy examples, it didn’t really matter that one distribution was centered at 0 and the other at 1. We’d get the same qualitative behavior no matter what the shift between the two distributions. The limit would tend to 1 for the normal distribution and 1/2 for the Cauchy. The constant value of the posterior probability with the Laplace example depends on the size of the shift between the two.

We’ve assumed that X and Y are equally likely a priori. The limiting value in the normal case does not depend on the prior probabilities of X and Y as long as they’re both positive. The prior probabilities will effect the limiting values for the Cauchy and Laplace case.

Technical details

For anyone who wants a more precise formulation of the examples above, let B be a non-degenerate Bernoulli random variable and define ZBX + (1-B)Y. We’re computing the conditional probability Pr(B = 0 | Z > z) using Bayes’ theorem.  If X and Y are normally distributed, the limit of Pr(B = 0 | Z > z) as z goes to infinity is 1. If X and Y are Cauchy distributed, the limit is the unconditional probability Pr(B = 0).

In the normal case, as z goes to infinity, the distribution of B carries no useful information.

In the Cauchy case, as z goes to infinity, the observation z carries no useful information.

Variable-speed learning

When I was in college, one of the professors seemed to lecture at a sort of quadratic pace, maybe even an exponential pace.

He would proceed very slowly at the beginning of the semester, so slowly that you didn’t see how he could possibly cover the course material by the end. But his pace would gradually increase to the point that he was going very quickly at the end. And yet the pace increased so smoothly that you were hardly aware of it. By understanding the first material thoroughly, you were able to go through the latter material quickly.

If you’ve got 15 weeks to cover 15 chapters, don’t assume the optimal pace is to cover one chapter every week.

I often read technical books the way the professor mentioned above lectured. The density of completely new ideas typically decreases as a book progresses. If your reading pace is proportional to the density of new ideas, you’ll start slow and speed up.

The preface may be the most important part of the book. Some books I’ve only read the preface and felt like I got a lot out of the book.

The last couple chapters of technical books can often be ignored. It’s common for authors to squeeze in something about their research at the end of a book, even it its out of character with the rest of the book.

Unit speed curve parameterization

A friend asked me a question the other day that came out of a graphics application. He needed to trace out an ellipse in such a way that the length of curve traced out each second was constant. For a circle, the problem is simple: (cos(t), sin(t)) will trace out a circle covering a constant amount of arc length per unit time. The analogous parameterization for an ellipse, (a cos(t), b sin(t)) will move faster near the longer semi-axis and slower near the shorter one.

There’s a general solution to the problem for any curve. Given a parameterization p(t), where p is vector-valued, the length covered from time 0 to time t is

s(t) = \int_0^t || p'(\tau) || \,d\tau

If you change the time parameterization by inverting this function, solving for t as a function of s, then the total length of curve traversed by p(t(s)) up to time s is s. This is called either the unit speed parameterization or parameterization by arc length.

The hard part is inverting s(t). If you had to find a unit speed parameterization in a calculus class, the problem would be carefully designed so the function s(t) is easy to invert. Real applications don’t usually work out so easily.

Digression on elliptic integrals and elliptic functions

My response to my friend’s question was that there probably is a closed-form unit speed parameterization, but it would probably involve elliptic functions. He didn’t need much resolution, and decided to do something ad hoc.

Starting with the parameterization  p(t) = (a cos(t), b sin(t)), the arc length s(t) is given by a special function known as an “incomplete elliptic integral of the second kind.” I remembered that the Jacobi elliptic functions were related to the inverses of elliptic integrals, so my hunch was that you could make a unit speed parameterization using Jacobi elliptic functions. Maybe you can, but it’s not as easy as I thought because the Jacobi functions are related to the inverses of elliptic integrals of the first kind.

Elliptic integrals are so named because they were first identified by computing arc length for a (portion of) an ellipse. Elliptic functions were discovered by inverting elliptic integrals, but not the same class of elliptic integrals that give the arc length of an ellipse. (There may well be a transformation that makes these more directly related, but I’m not aware of it.)

Incidentally, elliptic curves are related to elliptic functions, but they are not ellipses! There is a connection from ellipses to elliptic curves, but it’s historical and indirect.

What if we had a more general curve than an ellipse, say something parameterized by cubic splines? Cubic splines are piecewise cubic polynomials, patched together in such a way that the first and second derivatives are continuous across the patches. We can find length of a spline by finding the length of each polynomial patch.

If p(t) is the parameterization of a curve in 2 or 3 dimensions (or really any number of dimensions) and each component of p is a cubic polynomial, then each component of the derivative of p is a quadratic polynomial, and so the sum of the squares of the components is a fourth degree polynomial. So finding the arc length involves integrating the square root of a fourth degree polynomial. This makes it an elliptic integral!

Unfortunately, knowing that the arc length of a cubic spline corresponds to an elliptic integral is not so useful because it could be any type of elliptic integral, depending on its parameters. You’d have to do some work first to put it into a form where you could call on elliptic integrals to finish your problem.

Numerically computing arc length and unit speed parameterization

The elliptic integral path is something of a dead end. It could still be useful if you needed high accuracy, or if you had some restriction on the class of curves you’re interested in. But in general, you’d need to use numerical integration to find the arc length.

You could also find unit-speed parameterizations numerically, using root-finding to invert s(t) at particular points. Since s is an increasing function of t, you could use a bisection method, which is not the most efficient approach but very robust.

It takes a fair amount of computation to carry root finding where each function evaluation requires computing a numerical integral. But this would work, and depending on your context it could be efficient enough.

If you needed more efficiency, say for a real-time embedded system, you could take a different approach. Your spline is probably an approximation to something else, and so your solution only needs to be as accurate as the spline approximation. This gives you the wiggle room to do something more efficient. You might change your parameterization slightly to make the arc length calculations easier, or come up with a way to approximately invert the arc length function, something that takes advantage of your particular problem and error tolerance.

Related posts

Number of digits in Catalan numbers

The Catalan numbers Cn are closely related to the central binomial coefficients I wrote about recently:

C+n = \frac{1}{n+1} {2n \choose n}

In this post I’ll write C(n) rather than Cn because that will be easier to read later on.

Catalan numbers come up in all kinds of applications. For example, the number of ways to parenthesize an expression with n terms is the nth Catalan number C(n).

Number of digits

Here’s a strange theorem regarding Catalan numbers that I found in Catalan Numbers with Applications by Thomas Koshy:

The number of digits in C(10n) … converges to the number formed by the digits on the right side of the decimal point in log10 4 = 0.60205999132…

Let’s see what that means. C(10) equals 16,796 which of course has 5 digits.

Next, C(100) equals

896,519,947,090,131,496,687,170,070,074,100,632,420,837,521,538,745,909,320

which has 57 digits. These numbers are getting really big really fast, so I’ll give a table of just the number of digits of a few more examples rather than listing the Catalan numbers themselves.

    |---+------------------|
    | n | # C(10^n) digits |
    |---+------------------|
    | 1 |                5 |
    | 2 |               57 |
    | 3 |              598 |
    | 4 |             6015 |
    | 5 |            60199 |
    | 6 |           602051 |
    | 7 |          6020590 |
    | 8 |         60205987 |
    |---+------------------|

I stopped at n = 8 because my computer locked up when I tried to compute C(109). I was computing these numbers with the following Mathematica code:

    Table[
        IntegerLength[
            CatalanNumber[10^n]
        ], 
        {n, 1, 8}
    ]

Computing CatalanNumber[10^9] was too much for Mathematica. So how might we extent the table above?

Numerical computing

We don’t need to know the Catalan numbers themselves, only how many digits they have. And we can compute the number of digits from an approximate value of their logarithm. Taking logarithms also helps us avoid overflow.

We’ll write Python below to determine the number of digits, in part to show that we don’t need any special capability of something like Mathematica.

We need three facts before we write the code:

  1. The number of decimal digits in a number x is 1 + ⌊log10x⌋ where ⌊y⌋ is the floor of y, the greatest integer not greater than y.
  2. n! = Γ(n + 1)
  3. log10(x) = log(x) / log(10)

Note that when I write “log” I always mean natural log. More on that here.

This code can compute the number of digits of C(10n) quickly for large n.

    from scipy import log, floor
    from scipy.special import gammaln
    
    def log_catalan(n):
        return gammaln(2*n+1) - 2*gammaln(n+1) - log(n+1)
            
    def catalan_digits(n):
        return 1 + floor( log_catalan(n)/log(10) )
    
    for n in range(1, 14):
        print( n, catalan_digits(10.**n) )

The code doesn’t run into trouble until n = 306, in which case gammaln overflows. (Note the dot after 10 in the last line. Without the dot Python computes 10**n as an integer, and that has problems when n = 19.)

Proof

How would you go about proving the theorem above? We want to show that the number of digits in C(n) divided by 10n converges to log10 4, i.e.

\lim_{n\to\infty} \frac{1 + \lfloor \log_{10}C(10^n)\rfloor}{10^n} = \log_{10} 4

We can switch to natural logs by multiplying both sides by log(10). Also, in the limit we don’t need the 1 or the floor in the numerator. So it suffices to prove

\lim_{n\to\infty} \frac{\log C(10^n)}{10^n} = \log 4

Now we see this has nothing to do with base 10. We only need to prove

\lim_{n\to\infty} \frac{\log C(n)}{n} = \log 4

and that is a simple exercise using Stirling’s approximation.

Other bases

Our proof showed that this theorem ultimately doesn’t have anything to do with base 10. If we did everything in another base, we’d get analogous results.

To give a taste of that, let’s work in base 7. If we look at the Catalan numbers C(7n) and count how many “digits” their base 7 representations have, we get the same pattern as the base 7 representation of log7 4.

Note that the base appears in four places:

  1. which Catalan numbers we’re looking at,
  2. which base we express the Catalan numbers in,
  3. which base we take the log of 4 in, and
  4. which base we express that result in.

If you forget one of these, as I did at first, you won’t get the right result!

Here’s a little Mathematica code to do an example with n = 8.

    BaseForm[
        1 + Floor[ 
            Log[7, CatalanNumber[7^8]]
            ], 
        7
    ]

This returns 466233417 and the code

    BaseForm[Log[7, 4.], 7]

returns 0.46623367.

Fibonacci meets Pythagoras

The sum of the squares of consecutive Fibonacci numbers is another Fibonacci number. Specifically we have

F_n^2 + F_{n+1}^2 = F_{2n+1}

and so we have the following right triangle.

The hypotenuse will always be irrational because the only Fibonacci numbers that are squares are 1 and 144, and 144 is the 12th Fibonacci number.

Bounds on the central binomial coefficient

It’s hard to find bounds on binomial coefficients that are both simple and accurate, but in 1990, E. T. H. Wang upper and lower bounds on the central coefficient that are both, valid for n at least 4.

\frac{2^{2n-1}}{\sqrt{n}} < \binom{2n}{n} < \frac{2^{2n-1/2}}{\sqrt{n}}

Here are a few numerical results to give an idea of the accuracy of the bounds on a log scale. The first column is the argument n, The second is the log of the CBC (central binomial coefficient) minus the log of the lower bound. The third column is the log of the upper bound minus the log of the CBC.

    |---+--------+--------|
    | n |  lower |  upper |
    |---+--------+--------|
    | 1 | 0.0000 | 0.3465 |
    | 2 | 0.0588 | 0.2876 |
    | 3 | 0.0793 | 0.2672 |
    | 4 | 0.0896 | 0.2569 |
    | 5 | 0.0958 | 0.2507 |
    | 6 | 0.0999 | 0.2466 |
    | 7 | 0.1029 | 0.2436 |
    | 8 | 0.1051 | 0.2414 |
    | 9 | 0.1069 | 0.2396 |
    |---+--------+--------|

And here’s a plot of the same data taking n out further.

Wang's upper and lower bounds on CBC

So the ratio of the upper bound to the CBC and the ratio of the CBC to the lower bound both quickly approach an asymptote, and the lower bound is better.

Redoing division algebra graphs with angular distance

The blog post that kicked off the recent series of posts looked at how far apart xy and yx are for quaternions. There I used the Euclidean distance, i.e. || xy – yx ||. This time I’ll look at the angle between xy and yx, and I’ll make some analogous graphs for octonions.

For vectors x and y in three dimensions, the dot product satisfies

x \cdot y = ||x|| \, ||y|| \cos \theta

where θ is the angle between the two vectors. In higher dimensions, we can turn this theorem around and use it as the definition of the angle between two vectors based on their dot product.

(The plots below are jagged because they’re based on random sampling.)

Here’s the Euclidean distance between xy and yx for quaternions from the earlier post:

histogram

And here’s the corresponding angular distance:

The range has changed from [0, 2] to [o, π], and the distribution now shifts left instead of right.

Here’s a similar graph, looking at the angular distance between xy and yx for octonions, something I haven’t plotted before in Euclidean distance.

This graph is more symmetric, which we might expect: since octonions have less algebraic structure than quaternions, we might expect the relationship between xy and yz to behave more erratically, and for the plot to look more like a normal distribution.

Finally, let’s revisit the distance between (xy)z and x(yz) for octonions. Here is the distribution of the Euclidean distance from a previous post:

histogram of octonion associator norm values

And here is the corresponding histogram based on angular distance.

These plots are based on uniform random samples of quaternions and octonions of length 1, i.e. points from the unit spheres in 4 and 8 dimensions respectively. Quaternions and octonions have the property that the product of unit length vectors is another unit length vector, and the angle between to unit vectors is the inverse cosine of their dot product.

I thought that sedenions also had this norm property, that the product of unit length vectors has unit length. Apparently not, as I discovered by trying to take the inverse cosine of a number larger than 1. So what is the distribution of lengths that come from multiplying two sedenions of length 1? Apparently the mean is near 1, and here’s a histogram.