Comparing bfloat16 range and precision to other 16-bit numbers

server rack

Deep learning has spurred interest in novel floating point formats. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at bfloat16, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

    |--------+------+----------+----------|
    | Format | Bits | Exponent | Fraction |
    |--------+------+----------+----------|
    | FP32   |   32 |        8 |       23 |
    | FP16   |   16 |        5 |       10 |
    | BF16   |   16 |        8 |        7 |
    |--------+------+----------+----------|

BF16 has as many bits as a FP16, but as many exponent bits as a FP32. The latter makes conversion between BF16 and FP32 simple, except for some edge cases regarding denormalized numbers.

Precision

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2e where e is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

    |--------+------------|
    | Format |    Epsilon |
    |--------+------------|
    | FP32   | 0.00000012 |
    | FP16   | 0.00390625 |
    | BF16   | 0.03125000 |
    |--------+------------|

Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

    |--------+-------|
    | Format | DR    |
    |--------+-------|
    | FP32   | 83.38 |
    | BF16   | 78.57 |
    | FP16   | 12.04 |
    |--------+-------|

Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted es by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an n-bit posit, the number of fraction bits near 1 is n – 2 – es and so epsilon is 2 to the exponent es – n – 2. The dynamic range is

\log_{10}\left( 2^{2^{\text{\small\emph{es}}} } \right)^{2n-4} = (2n-4)2^{es}\log_{10}2

which is derived here. The dynamic range and epsilon values for 16-bit posits with es ranging from 1 to 4 are given in the table below.

    |----+--------+-----------|
    | es |     DR |   epsilon |
    |----+--------+-----------|
    |  1 |  16.86 | 0.0000076 |
    |  2 |  33.82 | 0.0000153 |
    |  3 |  37.43 | 0.0000305 |
    |  4 | 143.86 | 0.0000610 |
    |----+--------+-----------|

For all the values of es above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of es, and greater than BF16 and FP32 when es = 4.

Related posts

Continued fraction cryptography

Every rational number can be expanded into a continued fraction with positive integer coefficients. And the process can be reversed: given a sequence of positive integers, you can make them the coefficients in a continued fraction and reduce it to a simple fraction.

In 1954, Arthur Porges published a one-page article pointing out that continued fractions could be used to create a cipher. To encrypt your cleartext, convert it to a list of integers, use them as continued fraction coefficients, and report the resulting fraction. To decrypt, expand the fraction into a continued fraction and convert the coefficients back to text.

We can implement this in Mathematica as follows:

    
encode[text_] := FromContinuedFraction[ ToCharacterCode[ text ]]
decode[frac_] := FromCharacterCode[ ContinuedFraction[ frac ]]

So, for example, suppose we want to encrypt “adobe.” If we convert each letter to its ASCII code we get {97, 100, 111, 98, 101}. When we make these numbers coefficients in a continued fraction we get

\mathrm{

which reduces to 10661292093 / 109898899.

This isn’t a secure encryption scheme by any means, but it’s a cute one. It’s more of an encoding scheme, a (non-secret) way to convert a string of numbers into a pair of numbers.

Related posts

[1] Arthur Porges. A Continued Fraction Cipher. The American Mathematical Monthly, Vol. 59, No. 4 (Apr., 1952), p. 236

How to compute the area of a polygon

If you know the vertices of a polygon, how do you compute its area? This seems like this could be a complicated, with special cases for whether the polygon is convex or maybe other considerations. But as long as the polygon is “simple,” i.e. the sides meet at vertices but otherwise do not intersect each other, then there is a general formula for the area.

The formula

List the vertices starting anywhere and moving counterclockwise around the polygon: (x1y1), (x2y2), …, (xnyn). Then the area is given by the formula below.

A = \frac{1}{2} \begin{vmatrix} x_1 & x_2 & \ldots & x_n & x_1\\ y_1 & y_2 & \ldots & y_n & y_1 \end{vmatrix}

But what does that mean? The notation is meant to be suggestive of a determinant. It’s not literally a determinant because the matrix isn’t square. But you evaluate it in a way analogous to 2 by 2 determinants: add the terms going down and to the right, and subtract the terms going up and to the right. That is,

x1 y2 + x2 y3 + … xn y1y1 x2y2 x3 – … –  yn x1

This formula is sometimes called the shoelace formula because the pattern of multiplications resembles lacing a shoe. It’s also called the surveyor’s formula because it’s obviously useful in surveying.

Numerical implementation

As someone pointed out in the comments, in practice you might want to subtract the minimum x value from all the x coordinates and the minimum y value from all the y coordinates before using the formula. Why’s that?

If you add a constant amount to each vertex, you move your polygon but you don’t change the area. So in theory it makes no difference whether you translate the polygon before computing its area. But in floating point, it can make a difference.

The cardinal rule of floating point arithmetic is to avoid subtracting nearly equal numbers. If you subtract two numbers that agree to k significant figures, you can lose up to k figures of precision. We’ll illustrate this by taking a right triangle with base 2 and height π. The area should remain π as we translate the triangle away from the origin, we lose precision the further out we translate it.

Here’s a Python implementation of the shoelace formula.

    def area(x, y):
        n = len(x)
        s = 0.0
        for i in range(-1, n-1):
            s += x[i]*y[i+1] - y[i]*x[i+1]
        return 0.5*s

If you’re not familiar with Python, a couple things require explanation. First, range(n-1) is a list of integers starting at 0 but less than n-1. Second, the -1 index returns the last element of the array.

Now, watch how the precision of the area degrades as we shift the triangle by powers of 10.

    import numpy as np

    x = np.array([0.0, 0.0, 2.0])
    y = np.array([np.pi, 0.0, 0.0])

    for n in range(0, 10):
        t = 10**n
        print( area(x+t, y+t) )

This produces

    3.141592653589793
    3.1415926535897825
    3.1415926535901235
    3.1415926535846666
    3.141592651605606
    3.1415929794311523
    3.1416015625
    3.140625
    3.0
    0.0

Shifting by small amounts doesn’t make a noticeable difference, but we lose between one and two significant figures each time we increase t by a multiple of 10. We only have between 15 and 16 significant figures to start with in a standard floating point number, and so eventually we completely run out of significant figures.

When implementing the shoelace formula, we want to do the opposite of this example: instead of shifting coordinates so that they’re similar in size, we want to shift them toward the origin so that they’re not similar in size.

Related post: The quadratic formula and low-precision arithmetic

Passwords and power laws

According to this paper [1], the empirical distribution of real passwords follows a power law [2]. In the authors’ terms, a Zipf-like distribution. The frequency of the rth most common password is proportional to something like 1/r. More precisely,

fr = C rs

where s is on the order of 1. The value of s that best fit the data depended on the set of passwords, but their estimates of s varied from 0.46 to 0.91.

This means that the most common passwords are very common and easy to guess.

Size of password spaces

If passwords come from an alphabet of size A and have length n, then there are An possibilities. For example, if a password has length 10 and consists of uppercase and lowercase English letters and digits, there are

6210 = 839,299,365,868,340,224

possible such passwords. If users chose passwords randomly from this set, brute force password attacks would be impractical. But brute force attacks are practical because passwords are not chosen uniformly from this large space of possibilities, far from it.

Attackers do not randomly try passwords. They start with the most common passwords and work their way down the list. In other words, attackers use Pareto’s rule.

Rules requiring, say, one upper case letter, don’t help much because most users will respond by using exactly one upper case letter, probably the first letter. If passwords must have one special character, most people will use exactly one special character, most likely at the end of the word. Expanding the alphabet size A exponentially increases the possible passwords, but does little to increase the actual number of passwords.

What’s interesting about the power law distribution is that there’s not a dichotomy between naive and sophisticated users. If there were, there would be a lot of common passwords, and all the rest uniformly distributed. Instead, there’s a continuum between the most naive and most sophisticated. That means a lot of people are not exactly naive, but not as secure as they think they are.

Some math

Under the Zipf model [3], the number of times we’d expect to see the most common password is NC where N is the size of the data set. The constant C is what it has to be for the frequencies to sum to 1. That is, C depends on the number of data points N and the exponent s and is given by

C_{N, s} = \frac{1}{\sum_{r=1}^n r^{-s}}

We can bound the sum in the denominator from above and below with integrals, as in the integral test for series convergence. This gives us a way to see more easily how C depends on its parameters.

1 + \int_1^N x^{-s} \, dx > \frac{1}{C} = \sum_{r=1}^N r^{-s} > 1 + \int_2^{N+1} x^{-r}\, dx

When s = 1 this reduces to

1 + \log(N) > \frac{1}{C} > 1 + \log(N+1) - \log(2)

and otherwise to

1 + \frac{N^{1-s} - 1}{1-s} > \frac{1}{C} > 1 + \frac{(N+1)^{1-s} - 2^{1-s}}{1-s}

Suppose you have N = 1,000,000 passwords. The range of s values found by Wang et al varied from roughly 0.5 to 0.9. Let’s first set s = 0.5. Then C is roughly 0.0005. This mean the most common password appears about 500 times.

If we keep N the same and set s to 0.9, then C is roughly 0.033, and so the most common password would appear about 33,000 times.

How to choose passwords

If you need to come up with a password, randomly generated passwords are best, but hard to remember. So people either use weak but memorable passwords, or use strong passwords and don’t try to remember them. The latter varies in sophistication from password management software down to Post-it notes stuck on a monitor.

One compromise is to concatenate a few randomly chosen words. Something like “thebestoftimes” would be weak because they are consecutive words from a famous novel. Something like “orangemarbleplungersoap” would be better.

Another compromise, one that takes more effort than most people are willing to expend, is to use Manuel Blum’s mental hash function.

Related posts

[1] In case the link rots in the future, the paper is “Zipf’s Law in Passwords” by Ding Wang, Haibo Cheng, Ping Wang, Xinyi Huang, and Gaopeng Jian. IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, November 2017.

[2] Nothing follows a power law distribution exactly and everywhere. But that’s OK: nothing exactly follows any other distribution everywhere: not Gaussian, not exponential, etc. But a lot of things, like user passwords, approximately follow a power law, especially over the middle of their range. Power law’s like Zipf’s law tend to not fit as well at the beginning and the end.

[3] Here I’m using a pure Zipf model for simplicity. The paper [1] used a Zipf-like model that I’m not using here. Related to the footnote [2] above, it’s often necessary to use a minor variation on the pure Zipf model to get a better fit.

Footnote on fifth root trick

Numberphile has a nice video on the fifth root trick: someone raises a two-digit number to the 5th power, reads the number aloud, and you tell them immediately what the number was.

Here’s the trick in a nutshell. For any number n, n5 ends in the same last digit as n. You could prove that by brute force or by Euler’s theorem. So when someone tells you n5, you immediately know the last digit. Now you need to find the first digit, and you can do that by learning, approximately, the powers (10k)5 for i = 1, 2, 3, …, 9. Then you can determine the first digit by the range.

Here’s where the video is a little vague. It says that you don’t need to know the powers of 10k very accurately. This is true, but just how accurately do you need to know the ranges?

If the two-digit number is a multiple of 10, you’ll recognize the zeros at the end, and the last non-zero digit is the first digit of n. For example, if n5 = 777,600,000 then you know n is a multiple of 10, and since the last non-zero digit is 6, n = 60.

So you need to know the fifth powers of multiples of 10 well enough to distinguish (10k – 1)5 from (10k + 1)5. The following table shows what these numbers are.

|---+---------------+---------------|
| k | (10k - 1)^5   | (10k + 1)^5   |
|---+---------------+---------------|
| 1 |        59,049 |       161,051 |
| 2 |     2,476,099 |     4,084,101 |
| 3 |    20,511,149 |    28,629,151 |
| 4 |    90,224,199 |   115,856,201 |
| 5 |   282,475,249 |   345,025,251 |
| 6 |   714,924,299 |   844,596,301 |
| 7 | 1,564,031,349 | 1,804,229,351 |
| 8 | 3,077,056,399 | 3,486,784,401 |
| 9 | 5,584,059,449 | 6,240,321,451 |
|---+---------------+---------------|

So any number less than a million has first digit 1. Any number between 1 million and 3 million has first digit 2. Etc.

You could choose the following boundaries, if you like.

|---+----------------|
| k | upper boundary |
|---+----------------|
| 1 |      1,000,000 |
| 2 |      3,000,000 |
| 3 |     25,000,000 |
| 4 |    100,000,000 |
| 5 |    300,000,000 |
| 6 |    800,000,000 |
| 7 |  1,700,000,000 |
| 8 |  3,200,000,000 |
| 9 |  6,000,000,000 |
|---+----------------|

The Numberphile video says you should have someone say the number aloud, in words. So as soon as you hear “six billion …”, you know the first digit of n is 9. If you hear “five billion” or “four billion” you know the first digit is 8. If you hear “three billion” then you know to pay attention to the next number, to decide whether the first digit is 7 or 8. Once you hear the first few syllables of the number, you can stop pay attention until you hear the last syllable or two.

A strange sort of product rule

Let u be a real-valued function of n variables, and let v be a vector-valued function of n variables, a function from n variables to a vector of size n. Then we have the following product rule:

D(uv) = v Duu Dv.

It looks strange that the first term on the right isn’t Du v.

The function uv is a function from n dimensions to n dimensions, so it’s derivative must be an n by n matrix. So the two terms on the right must be n by n matrices, and they are. But Du v is a 1 by 1 matrix, so it would not make sense on the right side.

Here’s why the product rule above looks strange: the multiplication by u is a scalar product, not a matrix product. Sometimes you can think of real numbers as 1 by 1 matrices and everything works out just fine, but not here. The product uv doesn’t make sense if you think of the output of u as a 1 by 1 matrix. Neither does the product u Dv.

If you think of v as an n by 1 matrix and Du as a 1 by n matrix, everything works. If you think of v and Du as vectors, then v Du is the outer product of the two vectors. You could think of Du as the gradient of u, but be sure you think of it horizontally, i.e. as a 1 by n matrix. And finally, D(uv) and Dv are Jacobian matrices.

Update: As Harald points out in the comments, the usual product rule applies if you write the scalar-vector product uv as the matrix product vu where now are are thinking of u as a 1 by 1 matrix! Now the product rule looks right

D(vu) = Dv uv Du

but the product vu looks wrong because you always write scalars on the left. But here u isn’t a scalar!

Finite simple group example: PSL(q)

A few days ago I wrote about finite simple groups and said that I might write more about them. This post is a follow-on to that one.

In my earlier post I said that finite simple groups fell into five broad categories, and that three of these were easy to describe. This post will chip away at one of the categories I said was hard to describe briefly, namely groups of Lie type or classical groups. These are finite groups that are analogous to (continuous) Lie groups. Continue reading

Simplest exponential sum

Today‘s exponential sum curve is simply a triangle.

2018_09_06

But yesterday‘s curve was more complex

2018_09_05

and tomorrow‘s curve will be more complex as well.

2018_09_07

Why is today’s curve so simple? The vertices of the curves are the partial sums of the series

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

where m is the month, d is the day, and y is the last two digits of the year. Typically the sums are too complicated to work out explicitly by hand, but today’s sum is fairly simple. We have m = 9, d = 6, and y = 18. The three fractions add to (2n + 3n² + n³)/18. Reduced mod 18, the numerators are

0, 6, 6, 6, 12, 12, 12, 0, 0, 0, 6, 6, 6, 12, 12, 12, 0, 0

The repetition in the terms of the sum leads to the straight lines in the plot. The terms in the exponential sum only take on three values, the three cube roots of 1. These three roots are 1, a, and b where

a = exp(2πi/3) = -1/2 + i√3/2

and

b = exp(-2πi/3) = -1/2 – i√3/2

is the complex conjugate of a.

Using Mathematica we have

    Table[ Exp[2 Pi I (2 n + 3 n^2 + n^3)/18], {n, 0, 17}] 
        /. {Exp[2 Pi I/3] -> a, Exp[-2 Pi I/3] -> b}

which produces

1, a, a, a, b, b, b, 1, 1, 1, a, a, a, b, b, b, 1, 1

When we take the partial sums, we get four points in a straight line because they differ by a:

1, 1 + a, 1 + 2a, 1 + 3a

 then three points in a straight line because they differ by b:

(1 + 3a), (1 + 3a) + b, (1 + 3a) + 2b, (1 + 3a) + 3b

and so forth.

 

Three ways to sum a divergent series

There’s no direct way to define the sum of an infinite number of terms. Addition takes two arguments, and you can apply the definition repeatedly to define the sum of any finite number of terms. But an infinite sum depends on a theory of convergence. Without a definition of convergence, you have no way to define the value of an infinite sum. And with different definitions of convergence, you can get different values.

In this post I’ll review two ways of assigning a meaning to divergent series that I’ve written about before, then mention a third way.

Asymptotic series

A few months ago I wrote about an asymptotic series solution to the differential equation

y' = y = \frac{1}{x}

You end up with the solution

y = \frac{1}{x} + \frac{1}{x^2} + \frac{2}{x^3} + \cdots + \frac{(n-1)!}{x^n} + \cdots

which diverges for all x. That is, for each x, the partial sums of the series do not get closer to any number that you could call the sum. In fact, the individual terms of the series eventually get bigger and bigger. Surely this is a useless solution, right?

Actually, it is useful if you change your perspective. Instead of holding x fixed and letting n go to infinity, fix a value of n and let x go to infinity. In that sense, the series converges. For fixed n and large x, this gives accurate approximations to the solution of the differential equation.

Analytic continuation

At the end of a post on Bernoulli numbers I briefly explain the interpretation of the apparently nonsensical equation

1 + 2 + 3 + … = -1/12.

In a nutshell, the Riemann zeta function is defined by a two-step process. First define

\zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s}

for s with real part strictly bigger than 1. Then define the zeta function for the rest of the complex plane (except the point s = 1) by analytic continuation. If the infinite sum for zeta were valid for s = -1, which is it not, then it would equal 1 + 2 + 3 + …

The analytic continuation of the zeta function is defined at -1, and there the function equals -1/12. So to make sense of the sum of the positive integers, interpret the sum as a sort of pun, a funny way to write ζ(-1).

p-adic numbers

This is the most radical way to make sense of divergent series: change your number system so that they aren’t divergent!

The sum

1 + 2 + 4 + 8 + …

diverges because the partial sums (1, 3, 7, 15, …) are not getting closer to anything. But you can make the series converge by changing the way you measure distance between numbers. That’s what p-adic numbers do. For any fixed prime number p, define the distance between two numbers as the reciprocal of the largest power of p that divides their difference. That is, numbers are close together if they differ by a large power of p. We can make sense of the sum above in the 2-adic numbers, i.e. the p-adic numbers with p = 2.

The nth partial sum of the series above is 2n – 1. The 2-adic distance between 2n – 1 and -1 is 2n, which goes to zero, so the series converges to -1.

1 + 2 + 4 + 8 + … = -1.

Note that all the partial sums are the same, whether in the real numbers or the 2-adics, but the two number systems disagree on whether the partial sums converge.

If that explanation went by too quickly, here’s a 15-minute video expands on the same derivation.

Names for extremely large numbers

Names for extremely large numbers are kinda pointless. The purpose of a name is to communicate something, and if the meaning of the name is not widely known, the name doesn’t serve that purpose. It’s even counterproductive if two people have incompatible ideas what the word means. It’s much safer to simply use scientific notation for extremely large numbers.

Obscure or confusing

Everyone knows what a million is. Any larger than that and you may run into trouble. The next large number name, billion, means 109 to some people and 1012 to others.

When writing for those who take one billion to mean 109, your audience may or may not know that one trillion is 1012 according to that convention. You cannot count on people knowing the names of larger numbers: quadrillion, quintillion, sextillion, etc.

To support this last claim, let’s look at the frequency of large number names according to Google’s Ngram viewer. Presumably the less often a word is used, the less likely people are to recognize it.

Here’s a bar chart for the data from 2005, plotting on a logarithmic scale. The chart has a roughly linear slope, which means the frequency of the words drops exponentially. Million is used more than 24,000 times as often as septillion.

Frequency of large number names on log scale

Here’s the raw data.

    |-------------+--------------|
    | Name        |    Frequency |
    |-------------+--------------| 
    | Million     | 0.0096086564 |
    | Billion     | 0.0030243298 |
    | Trillion    | 0.0002595139 |
    | Quadrillion | 0.0000074383 |
    | Quintillion | 0.0000016745 |
    | Sextillion  | 0.0000006676 |
    | Septillion  | 0.0000003970 |
    |-------------+--------------|

Latin prefixes for large numbers

There is a consistency to these names, though they are not well known. Using the convention that these names refer to powers of 1000, the pattern is

[Latin prefix for n] + llion = 1000n+1

So for example, billion is 10002+1 because bi- is the Latin prefix for 2. A trillion is 10003+1 because tri- corresponds to 3, and so forth. A duodecillion is 100013 because the Latin prefix duodec- corresponds to 12.

    |-------------------+---------|
    | Name              | Meaning |
    |-------------------+---------|
    | Billion           |   10^09 |
    | Trillion          |   10^12 |
    | Quadrillion       |   10^15 |
    | Quintillion       |   10^18 |
    | Sextillion        |   10^21 |
    | Septillion        |   10^24 |
    | Octillion         |   10^27 |
    | Nonillion         |   10^30 |
    | Decillion         |   10^33 |
    | Undecillion       |   10^36 |
    | Duodecillion      |   10^39 |
    | Tredecillion      |   10^42 |
    | Quattuordecillion |   10^45 |
    | Quindecillion     |   10^48 |
    | Sexdecillion      |   10^51 |
    | Septendecillion   |   10^54 |
    | Octodecillion     |   10^57 |
    | Novemdecillion    |   10^60 |
    | Vigintillion      |   10^63 |
    |-------------------+---------|

Code

Here’s the Mathematica code that was used to create the plot above.

    BarChart[ 
        {0.0096086564, 0.0030243298, 0.0002595139, 0.0000074383, 
         0.0000016745, 0.0000006676, 0.0000003970}, 
         ChartLabels -> {"million", "billion", "trillion", "quadrillion", 
                         "quintillion", "sextillion", "septillion"}, 
         ScalingFunctions -> "Log", 
         ChartStyle -> "DarkRainbow"
    ]

Fraktur symbols in mathematics

When mathematicians run out of symbols, they turn to other alphabets. Most math symbols are Latin or Greek letters, but occasionally you’ll run into Russian or Hebrew letters.

Sometimes math uses a new font rather than a new alphabet, such as Fraktur. This is common in Lie groups when you want to associate related symbols to a Lie group and its Lie algebra. By convention a Lie group is denoted by an ordinary Latin letter and its associated Lie algebra is denoted by the same letter in Fraktur font.

lower case alphabet in Fraktur

LaTeX

To produce Fraktur letters in LaTeX, load the amssymb package and use the command \mathfrak{}.

Symbols such as \mathfrak{A} are math symbols and can only be used in math mode. They are not intended to be a substitute for setting text in Fraktur font. This is consistent with the semantic distinction in Unicode described below.

Unicode

The Unicode standard tries to distinguish the appearance of a symbol from its semantics, though there are compromises. For example, the Greek letter Ω has Unicode code point U+03A9 but the symbol Ω for electrical resistance in Ohms is U+2621 even though they are rendered the same [1].

The letters a through z, rendered in Fraktur font and used as mathematical symbols, have Unicode values U+1D51E through U+1D537. These values are in the “Supplementary Multilingual Plane” and do not commonly have font support [2].

The corresponding letters A through Z are encoded as U+1D504 through U+1D51C, though interestingly a few letters are missing. The code point U+1D506, which you’d expect to be Fraktur C, is reserved. The spots corresponding to H, I, and R are also reserved. Presumably these are reserved because they are not commonly used as mathematical symbols. However, the corresponding bold versions U+1D56C through U+ID585 have no such gaps [3].

Footnotes

[1] At least they usually are. A font designer could choose provide different glyphs for the two symbols. I used the same character for both because some I thought some readers might not see the Ohm symbol properly rendered.

[2] If you have the necessary fonts installed you should see the alphabet in Fraktur below:
𝔞 𝔟 𝔠 𝔡 𝔢 𝔣 𝔤 𝔥 𝔦 𝔧 𝔨 𝔩 𝔪 𝔫 𝔬 𝔭 𝔮 𝔯 𝔰 𝔱 𝔲 𝔳 𝔴 𝔵 𝔶 𝔷

I can see these symbols from my desktop and from my iPhone, but not from my Android tablet. Same with the symbols below.

[3] Here are the bold upper case and lower case Fraktur letters in Unicode:
𝕬 𝕭 𝕮 𝕯 𝕰 𝕱 𝕲 𝕳 𝕴 𝕵 𝕶 𝕷 𝕸 𝕹 𝕺 𝕻 𝕼 𝕽 𝕾 𝕿 𝖀 𝖁 𝖂 𝖃 𝖄 𝖅
𝖆 𝖇 𝖈 𝖉 𝖊 𝖋 𝖌 𝖍 𝖎 𝖏 𝖐 𝖑 𝖒 𝖓 𝖔 𝖕 𝖖 𝖗 𝖘 𝖙 𝖚 𝖛 𝖜 𝖝 𝖞 𝖟

King of high dimensional space

rook, king, knight

Which has more mobility on an empty chessboard: a rook, a king or a knight?

On an ordinary 8 by 8 chessboard, a rook can move to any one of 14 squares, no matter where it starts. A knight and a king both have 8 possible moves from the middle of the board, fewer moves from an edge.

If we make the board bigger or higher dimensional, what happens to the relative mobility of each piece? Assume we’re playing chess on a hypercube lattice, n points along each edge, in d dimensions. So standard chess corresponds to n = 8 and d = 2.

A rook can move to any square in a coordinate direction, so it can choose one of n-1 squares in each of d dimensions, for a total of (n-1)d possibilities.

A king can move a distance of 0, 1, or -1 in each coordinate. In d dimensions, this gives him 3d – 1 options. (Minus one for the position he’s initially on: moving 0 in every coordinate isn’t a move!)

What about a knight? There are C(d, 2) ways to choose two coordinate directions. For each pair of directions, there are 8 possibilities: two ways to choose which direction to move two steps in, and then whether to move + or – in each direction. So there are 4d(d – 1) possibilities.

In short:

  • A king’s mobility increases exponentially with dimension and is independent of the size of the board.
  • A rook’s mobility increases linearly with dimension and linearly with the size of the board.
  • A knight’s mobility increases quadratically with dimension and independent of the size of the board.

The rules above are not the only way to generalize chess rules to higher dimensions. Here’s an interesting alternative way to define a knight’s moves: a knight can move to any lattice point a distance √5 away. In dimensions up to 4, this corresponds to the definition above. But in dimension 5 and higher, there’s a new possibility: moving one step in each of five dimensions. In that case, the number of possible knight moves increases with dimension like d5.

Related post: A knight’s random walk in 3 or 4 dimensions

 

3D chess knight moves

I’ve written before about a knight’s random walk on an ordinary chess board. In this post I’d like to look at the generalization to three dimensions (or more).

three dimensional lattice

So what do we mean by 3D chess? For this post, we’ll have a three dimensional lattice of possible positions, of size 8 by 8 by 8. You could think of this as a set of 8 ordinary chess boards stacked vertically. To generalize a knight’s move to this new situation, we’ll say that a knight move consists of moving 2 steps in one direction and 1 step in an orthogonal direction. For example, he might move up two levels and over one position horizontally.

Suppose our knight walks randomly through the chess lattice. At each point, he evaluates all possible moves and chooses one randomly with all possible moves having equal probability. How long on average will it take our knight to return to where he started?

As described in the post about the two dimensional problem, we can find the average return time using a theorem about Markov chains.

The solution is to view the problem as a random walk on a graph. The vertices of the graph are the squares of a chess board and the edges connect legal knight moves. The general solution for the time to first return is simply 2N/k where N is the number of edges in the graph, and k is the number of edges meeting at the starting point.

The problem reduces to counting N and k. This is tedious in two dimensions, and gets harder in higher dimensions. Rather than go through a combinatorial argument, I’ll show how to compute the result with a little Python code.

To count the number of edges N, we’ll add up the number of edges at each node in our graph, and then divide by 2 since this process will count each edge twice. We will iterate over our lattice, generate all potential moves, and discard those that would move outside the lattice.

from numpy import all
from itertools import product, combinations

def legal(v):
    return all([ 0 <= t < 8 for t in v])

count = 0
for position in product(range(8), range(8), range(8)):
    # Choose two directions
    for d in combinations(range(3), 2):
        # Move 1 step in one direction
        # and 2 steps in the other.
        for step in [1, 2]:
            for sign0 in [-1, 1]:
                for sign1 in [-1, 1]:
                    move = list(position)
                    move[d[0]] += sign0*step
                    move[d[1]] += sign1*(3-step)
                    if legal(move):
                        count += 1
print(count // 2)

This tells us that there are N = 4,032 nodes in our graph of possible moves. The number of starting moves k depends on our starting point. For example, if we start at a corner, then we have 6 possibilities. In this case we should expect our knight to return to his starting point in an average of 2*4032/6 = 1,344 moves.

We can easily modify the code above. To look at different size lattices, we could change all the 8’s above. The function legal would need more work if the lattice was not the same size in each dimensions.

We could also look at four dimensional chess by adding one more range to the position loop and changing the combinations to come from range(4) rather than range(3). In case you’re curious, in four dimensions, the graph capturing legal knight moves in an 8 by 8 by 8 by 8 lattice would have 64,512 edges. If our knight started in a corner, he’d have 12 possible starting moves, so we’d expect him to return to his starting position on average after 5,275 moves.

Fibonacci meets Pythagoras

The sum of the squares of consecutive Fibonacci numbers is another Fibonacci number. Specifically we have

F_n^2 + F_{n+1}^2 = F_{2n+1}

and so we have the following right triangle.

The hypotenuse will always be irrational because the only Fibonacci numbers that are squares are 1 and 144, and 144 is the 12th Fibonacci number.

Bounds on the central binomial coefficient

It’s hard to find bounds on binomial coefficients that are both simple and accurate, but in 1990, E. T. H. Wang upper and lower bounds on the central coefficient that are both, valid for n at least 4.

\frac{2^{2n-1}}{\sqrt{n}} < \binom{2n}{n} < \frac{2^{2n-1/2}}{\sqrt{n}}

Here are a few numerical results to give an idea of the accuracy of the bounds on a log scale. The first column is the argument n, The second is the log of the CBC (central binomial coefficient) minus the log of the lower bound. The third column is the log of the upper bound minus the log of the CBC.

    |---+--------+--------|
    | n |  lower |  upper |
    |---+--------+--------|
    | 1 | 0.0000 | 0.3465 |
    | 2 | 0.0588 | 0.2876 |
    | 3 | 0.0793 | 0.2672 |
    | 4 | 0.0896 | 0.2569 |
    | 5 | 0.0958 | 0.2507 |
    | 6 | 0.0999 | 0.2466 |
    | 7 | 0.1029 | 0.2436 |
    | 8 | 0.1051 | 0.2414 |
    | 9 | 0.1069 | 0.2396 |
    |---+--------+--------|

And here’s a plot of the same data taking n out further.

Wang's upper and lower bounds on CBC

So the ratio of the upper bound to the CBC and the ratio of the CBC to the lower bound both quickly approach an asymptote, and the lower bound is better.