Computing the nth digit of π directly

The following equation, known as the BBP formula [1], will let you compute the nth digit of π directly without having to compute the previous digits.

 \pi = \sum_{n=0}^\infty \frac{1}{16^n} \left( \frac{4}{8n+1} - \frac{2}{8n+4} - \frac{1}{8n+5} - \frac{1}{8n+6}\right)

I’ve seen this claim many times, but I’ve never seen it explained in much detail how this formula lets you extract digits of π.

First of all, this formula lets you calculate the digits of π, not in base 10, but in base 16, i.e. in hexadecimal.

It looks like the BBP formula might let you extract hexadecimal digits. After all, the hexadecimal expansion of π is the set of coefficients an such that

 \pi = \sum_{n=0}^\infty \frac{a_n}{16^n}

where each an is an integer between 0 and 15. But the term multiplying 16n in the BBP formula is not an integer, so it’s not that easy. Fortunately, it’s not that hard either.

Warmup

As a warmup, let’s think how we would find the nth digit of π in base 10. We’d multiply by 10n, throw away the factional part, and look at the last digit. That is, we would calculate

\left\lfloor 10^n \pi \right\rfloor \bmod 10

Another approach would be to multiply by 10n−1, shifting the digit that we’re after to the first digit after the decimal point, then take the integer part of 10 times the fraction part.

\left\lfloor 10 \{10^{n-1} \pi\} \right\rfloor

Here {x} denotes the fractional part of x. This is a little more complicated, but now all the calculation is inside the floor operator.

For example, suppose we wanted to find the 4th digit of π. Multiplying by 103 gives 3141.592… with fractional part 0.592…. Multiplying by 10 gives 5.92… and taking the floor gives us 5.

100th hex digit

Now to find the nth hexadecimal digit we’d do the analogous procedure, replacing 10 with 16. To make things concrete, let’s calculate the 100th hexadecimal digit. We need to calculate

\left\lfloor 16 \left\{\sum_{n=0}^\infty 16^{99-n} \left( \frac{4}{8n+1} - \frac{2}{8n+4} - \frac{1}{8n+5} - \frac{1}{8n+6}\right) \right\} \right\rfloor

We can replace the infinite sum with a sum up to 99 because the remaining terms sum to an amount too small to change our answer. Note that we’re being sloppy here, but this step is justified in this particular example.

Here’s the trick that makes this computation practical: when calculating the fractional part, we can carry out the calculation of the first term mod (8n + 1), and the second part mod (8n + 4), etc. We can use fast modular exponentiation, the same trick that makes, for example, a lot of encryption calculations practical.

Here’s code that evaluates the Nth hexadecimal digit of π by evaluating the expression above.

def hexdigit(N):
    s = 0
    for n in range(N):
        s += 4*pow(16, N-n-1, 8*n + 1) / (8*n + 1)
        s -= 2*pow(16, N-n-1, 8*n + 4) / (8*n + 4)
        s -=   pow(16, N-n-1, 8*n + 5) / (8*n + 5)
        s -=   pow(16, N-n-1, 8*n + 6) / (8*n + 6)
    frac = s - floor(s)
    return floor(16*frac)

Here the three-argument version of pow, introduced into Python a few years ago, carries out modular exponentiation efficiently. That is, pow(b, x, m) calculates bx mod m.

This code correctly calculates that the 100th hex digit of π is C. Note that we did not need 100 hex digits (400 bits) of precision to calculate the 100th hex digit of π. We used standard precision, which is between 15 and 16 bits.

Improved code

We can improve the code above by adding an estimate of the infinite series we ignored.

A more subtle improvement is reducing the sum accumulator variable s mod 1. We only need the fractional part of s, and so by routinely cutting off the integer part we keep s from getting large. This improves accuracy by devoting all the bits in the machine representation of s to the fractional part.

epsilon = np.finfo(float).eps

def term(N, n):
     return 16**(N-1-n) * (4/(8*n + 1) - 2/(8*n+4) - 1/(8*n+5) - 1/(8*n + 6))

def hexdigit(N):
    s = 0
    for n in range(N):
        s += 4*pow(16, N-n-1, 8*n + 1) / (8*n + 1)
        s -= 2*pow(16, N-n-1, 8*n + 4) / (8*n + 4)
        s -=   pow(16, N-n-1, 8*n + 5) / (8*n + 5)
        s -=   pow(16, N-n-1, 8*n + 6) / (8*n + 6)
        s = s%1

    n = N + 1    
    while True:
        t = term(N, n)
        s += t
        n += 1
        if abs(t) < epsilon:
            break
        
    frac = s - floor(s)
    return floor(16*frac)

I checked the output of this code with [1] for N = 106, 107, 108, and 109. The code will fail due to rounding error for some very large N. We could refine the code to push this value of N a little further out, but eventually machine precision will be the limiting factor.

[1] Named after the initials of the authors. Bailey, David H.; Borwein, Peter B.; Plouffe, Simon (1997). On the Rapid Computation of Various Polylogarithmic Constants. Mathematics of Computation. 66 (218) 903–913.

Unicode, Tolkien, and Privacy

When I’m in the mood to write, you can often follow a chain of though in my posts. Recently, a post on LLM tokenization lead to a post on how Unicode characters are tokenized, which led to a post on Unicode surrogates. The latter ended by touching on Unicode’s PUA (Private Use Area), which of course leads to J.R.R. Tolkien and privacy, as we shall see.

To back up a bit, Unicode started as an attempt to one alphabet to rule them all, a coding system large enough to contain all the world’s writing systems. Initially it was thought that 216 = 65,536 symbols would be enough, but that didn’t last long. The Unicode standard now contains nearly 100,000 Chinese characters alone [1], along with myriad other symbols such as graphical drawing elements, mathematical symbols, and emoji.

Private Use Area

Although Unicode is vast, it doesn’t have room to include every symbol anyone might use. Unicode anticipated that contingency and reserved some areas for private use. The most well known example is probably Apple’s use of U+F8FF for their logo . The private use area is also being used for ancient and medieval scripts, as well as for fictional writing systems such as Tolkein’s Tengwar and Cirth.

Because anyone can use the private use area as they wish, there could be conficts. For example, Apple intends U+F8FF to display their logo, but the code point is also used for a Klingon symbol. We’ll see another example of a conflict at the end of this post.

Privacy

Here’s the chain of thought that leads to privacy. When I started thinking about this post, I thought about creating an image of Tengwar writing, and that made me think about font fingerprinting. Browsers let web servers know what fonts you have installed, which serves a benign purpose: a site may want to send you text formatted in a pariticular font if its available but will fall back to a more common font if necessary.

However, the collection of fonts installed on a particular computer may be unique, or at least used in combination with other browser information to uniquely identify a user. Before installing a Tengwar font I thought about how it would be sure to increase the uniqueness of my font fingerprint.

Tolkein’s Tengwar

J. R. R. Tolkein’s Tengwar writing system has been mapped to Unicode range E000 to E07F.

Here’s a sample. No telling what it looks like in your browser:

      

When I view the same text in Tecendil online transcriber I see this:

Not all who wander are lost

But then I open the text in Emacs I see this:

Both are legitimate renderings because nobody owns the private use areas. The characters that Tecendil uses for Tengwar, apparently some font on my laptop uses to display Chinese characters.

I’m especially curious about the last character, U+E000. Tecendil interprets it as the Tengwar symbol tinco but something on my laptop interprets it as the Mozilla mascot.

Mozilla T-rex mascot

[1] It wasn’t so naive to think 16 bits would do, even including Chinese. There may be 100,000 Chinese characters, but a few thousand characters account for nearly all Chinese writing. More on that here. But if you’re going to break the 16-bit barrier anyway, you might as well try to include even the most rare characters.

Unicode surrogates

At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.

I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.

Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.

Historical background

Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.

The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.

So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 216 = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.

Encoding

The most obvious way to represent Unicode characters, back when there were fewer than 216 characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.

But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.

But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.

UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 216 characters.

Surrogate pairs

The way to represent more than 216 symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.

There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.

Let n > 216 be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.

In terms of math rather than bit representations, the pair of surrogates (HL) represent code point

216 + 210(H − 55396) + (L − 56320)

where 55396 = D800hex and 56320 = DC00hex.

Example

The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.

    >>> H, L = 0xD83D, 0xDE80
    >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680
    True

When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.

H D83D 1101100000111101 L DE80 1101111010000000 1F680 11111011010000000

High private use surrogates

The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”

These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.

Practical consequences of tokenization details

I recently ran across the article Something weird is happening with LLMs and chess. One of the things it mentions is how the a minor variation in a prompt can have a large impact on the ability of an LLM to play chess.

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

The author goes on to explain that tokenization probably explains the difference. The intent is to get the LLM to predict the next move, but the extra space confuses the model because it is tokenized differently than the spaces in front of the e’s. The trailing space is tokenized as an individual character, but the spaces in front of the e’s are tokenized with the e’s. I wrote about this a couple days ago in the post The difference between tokens and words.

For example, ChatGPT will tokenize “hello world” as [15339, 1917] and “world hello” as [14957, 24748]. The difference is that the first string is parsed as “hello” and ” world” while the latter is parsed as “world” and ” hello”. Note the spaces attached to the second word in each case.

The previous post was about how ChatGPT tokenizes individual Unicode characters. It mentions UTF-16, which is itself an example of how tokenization matters. The string “UTF-16” it will be represented by three tokens, one each for “UTF”, “-“, and “16”. But the string “UTF16” will be represented by two tokens, one for “UTF” and one for “16”. The string “UTF16” might be more likely to be interpreted as a unit, a Unicode encoding.

ChatGPT tokens and Unicode

I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo in tiktoken. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token.

I was curious about the relationship between tokens and Unicode so I looked into it a little further.

Low codes

The Unicode characters U+D800 through U+DFFF all map to a single token, 5809. This is because these are not really characters per se but “surrogates,” code points that are used in pairs to represent other code points [1]. They don’t make sense in isolation.

The character U+FFFD, the replacement character �, also corresponds to 5809. It’s also not a character per se but a way to signal that another character is not valid.

Aside from the surrogates and the replacement characters, every Unicode character in the BMP, characters up to U+FFFF, has a unique representation in tokens. However, most require two or three tokens. For example, the snowman character ☃ is represented by two tokens: [18107, 225].

Note that this discussion is about single characters, not words. As the previous post describes, many words are tokenized as entire words, or broken down into units larger than single characters.

High codes

The rest of the Unicode characters, those outside the BMP, all have unique token representations. Of these, 3,404 are represented by a single token, but the rest require 2, 3, or 4 tokens. The rocket emoji, U+1F680, for example, is represented by three tokens: [9468, 248, 222].

Rocket U+1F680 [9468, 248, 222]

[1] Unicode was originally limited to 16 bits, and UFT-16 represented each character with a 16-bit integer. When Unicode expanded to beyond 216 characters, UTF-16 used pairs of surrogates, one high surrogate and one low surrogate, to represent code points higher than U+FFFF.

The difference between tokens and words

Large language models operate on tokens, not words, though tokens roughly correspond to words.

A list of words would not be practical. There is no definitive list of all English words, much less all words in all languages. Still, tokens correspond roughly to words, while being more flexible.

Words are typically turned into tokens using BPE (byte pair encoding). There are multiple implementations of this algorithm, giving different tokenizations. Here I use the tokenizer gpt-3.5-turbo used in GPT 3.5 and 4.

Hello world!

If we look at the sentence “Hello world!” we see that it turns into three tokens: 9906, 1917, and 0. These correspond to “Hello”, ” world”, and “!”.

In this example, each token corresponds to a word or punctuation mark, but there’s a little more going on. It is true that 0 is simply the token for the exclamation mark—we’ll explain why in a moment—it’s not quite true to say 9906 is the token for “hello” and 1917 is the token for “world”.

Many to one

In fact 1917 is the token for ” world”. Note the leading space. The token 1917 represents the word “world,” not capitalized and not at the beginning of a sentence. At the beginning of a sentence, “World” would be tokenized as 10343. So one word may correspond to several different tokens, depending on how the word is used.

One to many

It’s also true that a word may be broken into several tokens. Consider the sentence “Chuck Mangione plays the flugelhorn.” This sentence turns into 9 tokens, corresponding to

“Chuck”, “Mang”, “ione”, ” plays”, ” fl”, “ug”, “el”, “horn”, “.”

So while there is a token for the common name “Chuck”, there is no token for the less common name “Mangione”. And while there is a single token for ” trumpet” there is no token for the less common “flugelhorn.”

Characters

The tokenizer will break words down as far as necessary to represent them, down to single letters if need be.

Each ASCII character can be represented as a token, as well as many Unicode characters. (There are 100256 total tokens, but currently 154,998 Unicode characters, so not all Unicode characters can be represented as tokens.)

Update: The next post dives into the details of how Unicode characters are handled.

The first 31 ASCII characters are non-printable control characters, and ASCII character 32 is a space. So exclamation point is the first printable, non-space character, with ASCII code 33. The rest of the printable ASCII characters are tokenized as their ASCII value minus 33. So, for example, the letter A, ASCII 65, is tokenized as 65 − 33 = 32.

Tokenizing a dictionary

I ran every line of the american-english word list on my Linux box through the tokenizer, excluding possessives. There are 6,015 words that correspond to a single token, 37,012 that require two tokens, 26,283 that require three tokens, and so on. The maximum was a single word, netzahualcoyotl, that required 8 tokens.

The 6,015 words that correspond to a single token are the most common words in English, and so quite often a token does represent a word. (And maybe a little more, such as whether the word is capitalized.)

A simpler GELU activation function approximation

The GELU (Gaussian Error Linear Units) activation function was proposed in [1]. This function is x Φ(x) where Φ is the CDF of a standard normal random variable. As you might guess, the motivation for the function involves probability. See [1] for details.

The GELU function is not too far from the more familiar ReLU, but it has advantages that we won’t get into here. In this post I wanted to look at approximations to the GELU function.

Since an implementation of Φ is not always available, the authors provide the following approximation:

\text{GELU(x)} \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)

I wrote about a similar but simpler approximation for Φ a while back, and multiplying by x gives the approximation

\text{GELU}(x) \approx 0.5x(1 + \tanh 0.8x)

The approximation in [1] is more accurate, though the difference between the exact values of GELU(x) and those of the simpler approximation are hard to see in a plot.

Since model weights are not usually needed to high precision, the simpler approximation may be indistinguishable in practice from the more accurate approximation.

Related posts

[1] Dan Hendrycks, Kevin Gimpel. Gaussian Error Linear Units (GELUs). Available on arXiv.

Spreading out words in space

A common technique for memorizing numbers is to associate numbers with words. The Major mnemonic system does this by associating consonant sounds with each digit. You form words by inserting vowels as you please.

There are many possible encodings of numbers, but sometimes you want to pick a canonical word for each number, what’s commonly called a peg. Choosing pegs for the numbers 1 through 10, or even 1 through 100, is not difficult. Choosing pegs for a larger set of numbers becomes difficult for a couple reasons. First, it’s hard to think of words to fit some three-digit numbers. Second, you want your pegs to be dissimilar in order to avoid confusion.

Say for example you’ve chosen “syrup” for 049 and you need a peg for 350. You could use “molasses,” but that’s conceptually similar to “syrup.” If you use “Miles Davis” for 350 then there’s no confusion [1].

You could quantify how similar words are using cosine similarity between word embeddings.  A vector embedding associates a high-dimensional vector with each word in such a way that the geometry corresponds roughly with meaning. The famous example is that you might have, at least approximately,

queen = king − man + woman.

This gives you a way to define angles between words that ideally corresponds to conceptual similarity. Similar words would have a small angle between their vectors, while dissimilar words would have larger angles.

If you wanted to write a program to discover pegs for you, say using some corpus like ARPABet, you could have it choose alternatives that spread the words out conceptually. It’s debatable how practical this is, but it’s interesting none the less.

The angles you get would depend on the embedding you use. Here I’ll use the gensim code I used earlier in this post.

The angle between “syrup” and “molasses” is 69° but the angle between “syrup” and “miles” is 84°. The former is larger than I would have expected, but still significantly smaller than the latter. If you were using cosine similarity to suggest mnemonic pegs, hopefully the results would be directionally useful, choosing alternatives that minimize conceptual overlap.

As I said earlier, it’s debatable how useful this is. Mnemonics are very personal. A musician might be fine with using “trumpet” for 143 and “flugelhorn” for 857 because in his mind they’re completely different instruments, but someone else might think they’re too similar. And you might not want to use “Miles Davis” and “trumpet” as separate pegs, even though software will tell you that “miles” and “trumpet” are nearly orthogonal.

Related posts

[1] Here we’re following the convention that only the first three consonants in a word count. This makes it easier to think of pegs.

Duplicating a hand-drawn contour plot

Like the previous post, this post also seeks to approximately reproduce a hand-drawn plot. This time the goal is reproduce figure 7.3 from A&S page 298.

This plot is a visualizing of the function of a complex variable

w(z) = exp(−z²) erfc(− iz)

where erfc is the complementary error function.

A&S calls the graph above an “altitude chart.” This could be a little misleading since it’s the overlay of two plots. One plot is the absolute value of w(z), which could well be called altitude. But it also contains a plot of the phase. To put it another way, if we denote the values of the function in polar form r exp(iθ) the altitude chart is an overlay of a plot of r and a plot of θ.

We begin by defining

f[z_] := Exp[-z^2] Erfc[-I z]

The following code reproduces the lines of constant phase fairly well.

ContourPlot[Arg[f[x + I y]], {x, 0.1, 3.1}, {y, -2.5, 3}, 
    Contours -> 20, ContourShading -> None, AspectRatio -> 1.6]

The lines of constant absolute value take a little more effort to reproduce. If we let Mathematica pick where to put the contour lines, they will not be distributed the same way they were in A&S.

ContourPlot[Abs[f[x + I y]], {x, 0, 3.1}, {y, -2.6, 3}, 
    Contours -> 20, ContourShading -> None, AspectRatio -> 1.6]

We can duplicated the spacing in the original plot by providing Mathematica a list of contour values rather than number of contour values.

ContourPlot[Abs[f[x + I y]], {x, 0, 3.1}, {y, -2.6, 3}, 
    Contours -> {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 10, 100}, 
    ContourShading -> None, AspectRatio -> 1.6]

(For reasons I don’t understand, Mathematica does not draw the contours corresponding to w = 10 and w = 100.)

When I overlay the phase and absolute value plots with the Show command I get a plot approximately reproducing the original.

Related posts

Reproducing a hand-drawn plot

The plots in old (i.e. pre-computer) math books are impressive. These plots took a lot of effort to produce, and so they weren’t created lightly. Consequently they tend to be aesthetically and mathematically interesting. A few weeks ago I recreated a plot from A&S using Mathematica, and today I’d like to do the same for the plot below from a different source [1].

Here is my approximate reproduction.

I’ll give the mathematical details below for those who are interested.

The plot shows “normalized associated Legendre functions of the first kind.” There are two families of Legendre polynomials, denoted P and Q; we’re interested in the former. “Associated” means polynomials that are derived the Legendre polynomials by taking derivatives. Normalized means the polynomials are multiplied by constants so that their squares integrate to 1 over [−1, 1].

Mathematica has a function LegendreP[n, m, x] that implements the associated Legendre polynomials Pnm(x). I didn’t see that Mathematica has a function for the normalized version of these functions, so I rolled by own.

f[n_, m_, x_] := (-1)^n LegendreP[n, m, x] 
        Sqrt[(2 n + 1) Factorial[n - m]/(2 Factorial[n + m])]

I added the alternating sign term up front after discovering that apparently the original plot used a different convention for defining Pnm than Mathematica uses.

I make my plot by stacking the plots created by

Plot[Table[f[n, n, x], {n, 1, 8}],  {x, 0, 1}]

and

Plot[Table[f[n + 4, n, x], {n, 0, 4}],  {x, 0, 1}]

The original plot shows P4(x). I used the fact that this equals P40(x) to simplify the code. I also left out the flat line plotting P0 because I thought that looked better.

Related post: Duplicating a Hankel function plot from A&S.

[1] Tables Of Functions With Formulae And Curves by Fritz Emde, published in 1945. Available on Archive.org.