Example using a generating function to find asymptotic behavior

The previous post looked at how to compute Q(n), the number of permutations of 1, 2, 3, …, n + 1 that contain no consecutive integers. We found a way to numerically compute Q(n) but no analytic expression that would let us compute asymptotics.

The sequence Q(n) is sequence A000255 in OEIS, and OEIS gives the exponential generating function of Q:

\sum_{n=0}^\infty \frac{Q(n)}{n!} = \frac{\exp(-x)}{(1-x)^2}

We can use this function and Theorem 5.2.1 from [1] to get the asymptotic form of Q(n). According to that theorem, the coefficient of xn in our generating function is asymptotically the same as the coefficient of xn in the principle part at the singularity at 1. This principle part is

\frac{1}{e(1-x)^2} + \frac{1}{e(1-x)} = \frac{1}{e}\left(2 + 3x + 4x^2 + 5x^3 + \cdots\right)

and so the coefficient of xn is (n + 2)/e.

So

Q(n) \sim \frac{n + 2}{e} n!

and Q(n) / (n + 1)! approaches 1/e for large n, as conjectured in the previous post.

[1] Herbert S. Wilf. Generatingfunctionology. Second edition.

Permutations with no consecutive elements

I was working on something this week that made me wonder how many permutations break up all consecutive elements. For example, if we’re looking at permutations of abcde then acebd counts, but acdbe does not. I’d like to count the number of such permutations, and estimate for large N the number of permutations of N elements with no consecutive terms.

A brief search lead to OEIS sequence A000255 which gives a recursive formula for the answer. Before looking at that, let’s define some terms and do some brute force calculation to get a feel for the problem.

Brute force calculation

Given a positive integer n, define Q(n) to be the number of partitions p of 1, 2, 3, …, n + 1 such that for no k does p(k + 1) = p(k) + 1.

from itertools import permutations

def no_consec(perm):
    for k in range(len(perm) - 1):
        if perm[k+1] == perm[k] + 1:
            return False
    return True

def Q(n):
    count = 0
    for p in permutations(range(n+1)):
        if no_consec(p):
            count += 1
    return count

We could use this code to compute Q(n) for moderate-sized n, but the time required to compute Q(n) this way is proportional to n! and so that won’t scale well.

Recurrence relation

OEIS sequence A000255 gives the values of what we’re calling Q(n), and the first few values match the output of the code above. But OEIS does better, giving a recursive solution for Q:

Q(n) = n Q(n − 1) + (n − 1) Q(n − 2)

for n > 1 with initial conditions Q(0) = Q(1) = 1.

As with Fibonacci numbers, you could implement this easily as a recursive function

def Q2(n):
    if n in[0, 1]:
        return 1
    return n*Q2(n-1) + (n-1)*Q2(n-2)

but it is much faster to implement iteratively.

def Q3(n):
    q = [1]*(n+2)
    for k in range(2, n+1):
        q[k] = k*q[k-1] + (k-1)*q[k-2]
    return q[n]

You could make the recursive implementation faster with memoization, using lru_cache from functools. However, memoization does not prevent recursion; it just speeds up recursion using a cache, and you can still get an error saying maximum recursion depth exceeded.

Asymptotics

We can calculate Q(n) for large n, but we don’t have an analytic expression that will let us find the asymptotic behavior. I intended to work that out in my next post.

Empirically, it seems Q(n) approaches (n + 1)!/e as n goes to infinity, but I don’t have a proof of that at the time of writing. If this is correct, this means that in the limit the probability of a permutation having no fixed point equals the probability of it having no consecutive values.

Update: Proof

Computing the nth digit of π directly

The following equation, known as the BBP formula [1], will let you compute the nth digit of π directly without having to compute the previous digits.

 \pi = \sum_{n=0}^\infty \frac{1}{16^n} \left( \frac{4}{8n+1} - \frac{2}{8n+4} - \frac{1}{8n+5} - \frac{1}{8n+6}\right)

I’ve seen this claim many times, but I’ve never seen it explained in much detail how this formula lets you extract digits of π.

First of all, this formula lets you calculate the digits of π, not in base 10, but in base 16, i.e. in hexadecimal.

It looks like the BBP formula might let you extract hexadecimal digits. After all, the hexadecimal expansion of π is the set of coefficients an such that

 \pi = \sum_{n=0}^\infty \frac{a_n}{16^n}

where each an is an integer between 0 and 15. But the term multiplying 16n in the BBP formula is not an integer, so it’s not that easy. Fortunately, it’s not that hard either.

Warmup

As a warmup, let’s think how we would find the nth digit of π in base 10. We’d multiply by 10n, throw away the factional part, and look at the last digit. That is, we would calculate

\left\lfloor 10^n \pi \right\rfloor \bmod 10

Another approach would be to multiply by 10n−1, shifting the digit that we’re after to the first digit after the decimal point, then take the integer part of 10 times the fraction part.

\left\lfloor 10 \{10^{n-1} \pi\} \right\rfloor

Here {x} denotes the fractional part of x. This is a little more complicated, but now all the calculation is inside the floor operator.

For example, suppose we wanted to find the 4th digit of π. Multiplying by 103 gives 3141.592… with fractional part 0.592…. Multiplying by 10 gives 5.92… and taking the floor gives us 5.

100th hex digit

Now to find the nth hexadecimal digit we’d do the analogous procedure, replacing 10 with 16. To make things concrete, let’s calculate the 100th hexadecimal digit. We need to calculate

\left\lfloor 16 \left\{\sum_{n=0}^\infty 16^{99-n} \left( \frac{4}{8n+1} - \frac{2}{8n+4} - \frac{1}{8n+5} - \frac{1}{8n+6}\right) \right\} \right\rfloor

We can replace the infinite sum with a sum up to 99 because the remaining terms sum to an amount too small to change our answer. Note that we’re being sloppy here, but this step is justified in this particular example.

Here’s the trick that makes this computation practical: when calculating the fractional part, we can carry out the calculation of the first term mod (8n + 1), and the second part mod (8n + 4), etc. We can use fast modular exponentiation, the same trick that makes, for example, a lot of encryption calculations practical.

Here’s code that evaluates the Nth hexadecimal digit of π by evaluating the expression above.

def hexdigit(N):
    s = 0
    for n in range(N):
        s += 4*pow(16, N-n-1, 8*n + 1) / (8*n + 1)
        s -= 2*pow(16, N-n-1, 8*n + 4) / (8*n + 4)
        s -=   pow(16, N-n-1, 8*n + 5) / (8*n + 5)
        s -=   pow(16, N-n-1, 8*n + 6) / (8*n + 6)
    frac = s - floor(s)
    return floor(16*frac)

Here the three-argument version of pow, introduced into Python a few years ago, carries out modular exponentiation efficiently. That is, pow(b, x, m) calculates bx mod m.

This code correctly calculates that the 100th hex digit of π is C. Note that we did not need 100 hex digits (400 bits) of precision to calculate the 100th hex digit of π. We used standard precision, which is between 15 and 16 bits.

Improved code

We can improve the code above by adding an estimate of the infinite series we ignored.

A more subtle improvement is reducing the sum accumulator variable s mod 1. We only need the fractional part of s, and so by routinely cutting off the integer part we keep s from getting large. This improves accuracy by devoting all the bits in the machine representation of s to the fractional part.

epsilon = np.finfo(float).eps

def term(N, n):
     return 16**(N-1-n) * (4/(8*n + 1) - 2/(8*n+4) - 1/(8*n+5) - 1/(8*n + 6))

def hexdigit(N):
    s = 0
    for n in range(N):
        s += 4*pow(16, N-n-1, 8*n + 1) / (8*n + 1)
        s -= 2*pow(16, N-n-1, 8*n + 4) / (8*n + 4)
        s -=   pow(16, N-n-1, 8*n + 5) / (8*n + 5)
        s -=   pow(16, N-n-1, 8*n + 6) / (8*n + 6)
        s = s%1

    n = N + 1    
    while True:
        t = term(N, n)
        s += t
        n += 1
        if abs(t) < epsilon:
            break
        
    frac = s - floor(s)
    return floor(16*frac)

I checked the output of this code with [1] for N = 106, 107, 108, and 109. The code will fail due to rounding error for some very large N. We could refine the code to push this value of N a little further out, but eventually machine precision will be the limiting factor.

[1] Named after the initials of the authors. Bailey, David H.; Borwein, Peter B.; Plouffe, Simon (1997). On the Rapid Computation of Various Polylogarithmic Constants. Mathematics of Computation. 66 (218) 903–913.

Colossus versus El Capitan: A Tale of Two Supercomputers

Colossus

The xAI Colossus supercomputer contains 100,000 NVIDIA H100 GPUs. Upgrades are planned, ultimately up to as much as a million GPUs. The H100 has theoretical peak speed of at least 60 teraFLOPs (FP64 tensor core), though the actual number depends on the power and frequency cap settings on the GPUs. Admittedly FP64 is overkill for Colossus’ intended use for AI model training, though it is required for most scientific and engineering applications on typical supercomputers. This would put Colossus nominally at theoretical peak speed of 6 Exaflops full FP64 precision for dense matrix multiplies.

El Capitan

El Capitan at Lawrence Livermore National Lab ranks now as top #1 fastest system in the world on the TOP500 list, recently taking the crown from Frontier at Oak Ridge National Lab. Both Frontier and El Cap were procured under the same collaborative CORAL-2 project by the two respective laboratories. El Capitan uses AMD Instinct MI300A GPUs for theoretical peak speed of 2.746 Exaflops.

Which system is fastest?

You may wonder about the discrepancy: Colossus has more raw FLOPs, while El Capitan is ranked #1. Which system is actually faster? For decades, top system performance has commonly been measured for TOP500 using the High Performance Linpack (HPL) benchmark. Some have expressed concerns that HPL is an unrepresentative “FLOPs-only” benchmark. However, HPL actually measures more than raw rate of floating point operations. HPL performs distributed matrix products on huge matrices that become smaller and smaller in size during the HPL run, with a serial dependency between sequential matrix multiplies. Near the end of the run, performance becomes very limited by network latency, requiring excellent network performance. Furthermore, HPL is also a system stability test, since the system (often made up of brand new hardware for which bad parts must be weeded out) must stay up for a period of hours without crashing and at the end yield a correct answer (my colleague Phil Roth gives a description of this ordeal for Frontier). In short, a system could have lots of FLOPs but fail these basic tests of being able to run a nontrivial application.

Some commercial system owners may choose not to submit an HPL number, for whatever reason (though Microsoft submitted one and currently has a system at #4). In some cases submitting a TOP500 number may not be a mission priority for the owner. Or the system may not have an adequate network or the requisite system stability to produce a good number, in spite of having adequate FLOPs. Companies don’t typically give reasons for not submitting, but their reasons can be entirely valid, and not submitting a number has certainly happened before.

How long to build a system?

You may also wonder how it is that Colossus was stood up in 122 days (indeed a remarkable achievement by a highly capable team) whereas the CORAL-2 Project, which delivered El Capitan and Frontier, spanned multiple years.

Put simply, a system like Colossus stands on the shoulders of many multi-year investments in vendor hardware and software under projects like CORAL-2. Years ago, Oak Ridge National Lab originally put NVIDIA on the map for supercomputing with Titan, the first NVIDIA-powered petascale supercomputer. Some of the core NVIDIA software in use today was developed in part under this multi-year Titan project. Similarly for AMD and CORAL-2. Many systems, including Colossus, have benefitted from these long-term multi-year investments.

Another reason has to do with intended workloads of the respective systems. Colossus is intended primarily for AI model training; even though model architecture variations have slightly different computational patterns, the requirements are similar. El Capitan on the other hand is a general purpose supercomputer, and as such must support many different kinds of science applications with highly diverse requirements (and even more so at other centers like OLCF, ALCF and NERSC) (on system requirements, application diversity and application readiness see here, here and here). It’s much harder to put together a system to meet the needs of such a variety of science projects.

Conclusion

Colossus and El Capitan are both highly capable systems that will provide millions of node-hours of compute for their respective projects. Colossus has a high flop rate to support reduced precision matrix multiples (and presumably high network bandwidth for Allreduce) required for AI model training. El Capitan has a balanced architecture to support a wide variety of science applications at scale.

ADDENDUM: Colossus is now up to 200K GPUs.

Unicode, Tolkien, and Privacy

When I’m in the mood to write, you can often follow a chain of though in my posts. Recently, a post on LLM tokenization lead to a post on how Unicode characters are tokenized, which led to a post on Unicode surrogates. The latter ended by touching on Unicode’s PUA (Private Use Area), which of course leads to J.R.R. Tolkien and privacy, as we shall see.

To back up a bit, Unicode started as an attempt to one alphabet to rule them all, a coding system large enough to contain all the world’s writing systems. Initially it was thought that 216 = 65,536 symbols would be enough, but that didn’t last long. The Unicode standard now contains nearly 100,000 Chinese characters alone [1], along with myriad other symbols such as graphical drawing elements, mathematical symbols, and emoji.

Private Use Area

Although Unicode is vast, it doesn’t have room to include every symbol anyone might use. Unicode anticipated that contingency and reserved some areas for private use. The most well known example is probably Apple’s use of U+F8FF for their logo . The private use area is also being used for ancient and medieval scripts, as well as for fictional writing systems such as Tolkein’s Tengwar and Cirth.

Because anyone can use the private use area as they wish, there could be conficts. For example, Apple intends U+F8FF to display their logo, but the code point is also used for a Klingon symbol. We’ll see another example of a conflict at the end of this post.

Privacy

Here’s the chain of thought that leads to privacy. When I started thinking about this post, I thought about creating an image of Tengwar writing, and that made me think about font fingerprinting. Browsers let web servers know what fonts you have installed, which serves a benign purpose: a site may want to send you text formatted in a pariticular font if its available but will fall back to a more common font if necessary.

However, the collection of fonts installed on a particular computer may be unique, or at least used in combination with other browser information to uniquely identify a user. Before installing a Tengwar font I thought about how it would be sure to increase the uniqueness of my font fingerprint.

Tolkein’s Tengwar

J. R. R. Tolkein’s Tengwar writing system has been mapped to Unicode range E000 to E07F.

Here’s a sample. No telling what it looks like in your browser:

      

When I view the same text in Tecendil online transcriber I see this:

Not all who wander are lost

But then I open the text in Emacs I see this:

Both are legitimate renderings because nobody owns the private use areas. The characters that Tecendil uses for Tengwar, apparently some font on my laptop uses to display Chinese characters.

I’m especially curious about the last character, U+E000. Tecendil interprets it as the Tengwar symbol tinco but something on my laptop interprets it as the Mozilla mascot.

Mozilla T-rex mascot

[1] It wasn’t so naive to think 16 bits would do, even including Chinese. There may be 100,000 Chinese characters, but a few thousand characters account for nearly all Chinese writing. More on that here. But if you’re going to break the 16-bit barrier anyway, you might as well try to include even the most rare characters.

Unicode surrogates

At the highest level, Unicode is a simply a list of symbols. But when you look closer you find that isn’t entirely true. Some of the symbols are sorta meta symbols. And while a list of symbols is not complicated, this list is adjacent to a lot of complexity.

I’ve explored various rabbit holes of Unicode in previous posts, and this time I’d like to go down the rabbit hole of surrogates. These came up in the recent post on how LLMs tokenize Unicode characters.

Incidentally, every time I hear “surrogates” I think of the Bruce Willis movie by that name.

Historical background

Once upon a time, text was represented in ASCII, and life was simple. Letters, numbers, and a few other symbols all fit into a single eight-bit byte with one bit left over. There wasn’t any need to distinguish ASCII and its encoding because they were synonymous. The nth ASCII character was represented the same way as the number n.

The A in ASCII stands for American, and ASCII was sufficient for American English, sorta. Then people played various tricks with the leftover bit to extend ASCII by adding 128 more symbols. This was enough, for example, to do things like include the tilde on the n in jalapeño, but representing anything like Russian letters, math symbols, or Chinese characters was out of the question.

So along came Unicode. Instead of using one byte to represent characters, Unicode used two bytes. This didn’t double the possibilities; it squared them. Instead of 128 ASCII symbols, or 256 extended ASCII symbols, there was the potential to encode 216 = 65,536 symbols. Surely that would be enough! Except it wasn’t. Soon people wanted even more symbols. Currently Unicode has 154,998 code points.

Encoding

The most obvious way to represent Unicode characters, back when there were fewer than 216 characters, was as a 16-bit number, i.e. to represent each character using two bytes rather than one. The nth Unicode character could be represented the same way as the number n, analogous to the case of ASCII.

But while text could potentially require thousands of symbols, at lot of text can be represented just fine with ASCII. Using two bytes per character would make the text take up twice as much memory. UTF-8 encoding was a very clever solution to this problem. Pure ASCII (i.e. not extended) text is represented exact as in ASCII, taking up no more memory. And if text is nearly all ASCII with occasional non-ASCII characters sprinkled in, the size of the text in memory increases in proportional to the number of non-ASCII characters. Nearly pure ASCII text takes up nearly the same amount of memory.

But the cost of UTF-8 is encoding. The bit representation of the nth character is no longer necessarily the bit representation of the integer n.

UTF-16, while wasteful when representing English prose, is a direct encoding. Or at least it was until Unicode spilled over its initial size limit. You need more than a single pair of bytes to represent more than 216 characters.

Surrogate pairs

The way to represent more than 216 symbols with 16-bit characters is to designate some of the “characters” as meta characters. These special characters represent half of a pair of 16-bit units corresponding to a single Unicode character.

There are 1024 characters reserved as high surrogates (U+D800 through U+DBFF) and 1024 reserved as low surrogates (U+DC00 through U+DFFF). High surrogates contain the high-order bits and low surrogates contain the low-order bits.

Let n > 216 be the code point of a character. Then the last 10 bits of the high surrogate are the highest 10 bits of n, and the last 1o bits of the low surrogate are the lowest 10 bits of n.

In terms of math rather than bit representations, the pair of surrogates (HL) represent code point

216 + 210(H − 55396) + (L − 56320)

where 55396 = D800hex and 56320 = DC00hex.

Example

The rocket emoji has Unicode value U+1F680. The bit pattern representing the emoji is DB3DDE80hex, the combination of high surrogate U+D83D and low surrogate U+DE80. We can verify this with the following Python.

    >>> H, L = 0xD83D, 0xDE80
    >>> 2**16 + 2**10*(H - 0xD800) + L - 0xDC00 == 0x1F680
    True

When we write out the high and low surrogates in binary we can visualize how they contain the high and low bits of the rocket codepoint.

H D83D 1101100000111101 L DE80 1101111010000000 1F680 11111011010000000

High private use surrogates

The high surrogate characters U+D800 through U+DBFF are divided into two ranges. The range U+D800 through U+DB7F are simply called high surrogates, but the remaining characters U+DB80 through U+DBFF are called “high private use surrogates.”

These private use surrogates to not work any differently than the rest. They just correspond to code points in Unicode’s private use area.

Practical consequences of tokenization details

I recently ran across the article Something weird is happening with LLMs and chess. One of the things it mentions is how the a minor variation in a prompt can have a large impact on the ability of an LLM to play chess.

One extremely strange thing I noticed was that if I gave a prompt like “1. e4 e5 2. ” (with a space at the end), the open models would play much, much worse than if I gave a prompt like “1 e4 e5 2.” (without a space) and let the model generate the space itself. Huh?

The author goes on to explain that tokenization probably explains the difference. The intent is to get the LLM to predict the next move, but the extra space confuses the model because it is tokenized differently than the spaces in front of the e’s. The trailing space is tokenized as an individual character, but the spaces in front of the e’s are tokenized with the e’s. I wrote about this a couple days ago in the post The difference between tokens and words.

For example, ChatGPT will tokenize “hello world” as [15339, 1917] and “world hello” as [14957, 24748]. The difference is that the first string is parsed as “hello” and ” world” while the latter is parsed as “world” and ” hello”. Note the spaces attached to the second word in each case.

The previous post was about how ChatGPT tokenizes individual Unicode characters. It mentions UTF-16, which is itself an example of how tokenization matters. The string “UTF-16” it will be represented by three tokens, one each for “UTF”, “-“, and “16”. But the string “UTF16” will be represented by two tokens, one for “UTF” and one for “16”. The string “UTF16” might be more likely to be interpreted as a unit, a Unicode encoding.

ChatGPT tokens and Unicode

I mentioned in the previous post that not every Unicode character corresponds to a token in ChatGPT. Specifically I’m looking at gpt-3.5-turbo in tiktoken. There are 100,256 possible tokens and 155,063 Unicode characters, so the pigeon hole principle says not every character corresponds to a token.

I was curious about the relationship between tokens and Unicode so I looked into it a little further.

Low codes

The Unicode characters U+D800 through U+DFFF all map to a single token, 5809. This is because these are not really characters per se but “surrogates,” code points that are used in pairs to represent other code points [1]. They don’t make sense in isolation.

The character U+FFFD, the replacement character �, also corresponds to 5809. It’s also not a character per se but a way to signal that another character is not valid.

Aside from the surrogates and the replacement characters, every Unicode character in the BMP, characters up to U+FFFF, has a unique representation in tokens. However, most require two or three tokens. For example, the snowman character ☃ is represented by two tokens: [18107, 225].

Note that this discussion is about single characters, not words. As the previous post describes, many words are tokenized as entire words, or broken down into units larger than single characters.

High codes

The rest of the Unicode characters, those outside the BMP, all have unique token representations. Of these, 3,404 are represented by a single token, but the rest require 2, 3, or 4 tokens. The rocket emoji, U+1F680, for example, is represented by three tokens: [9468, 248, 222].

Rocket U+1F680 [9468, 248, 222]

[1] Unicode was originally limited to 16 bits, and UFT-16 represented each character with a 16-bit integer. When Unicode expanded to beyond 216 characters, UTF-16 used pairs of surrogates, one high surrogate and one low surrogate, to represent code points higher than U+FFFF.

The difference between tokens and words

Large language models operate on tokens, not words, though tokens roughly correspond to words.

A list of words would not be practical. There is no definitive list of all English words, much less all words in all languages. Still, tokens correspond roughly to words, while being more flexible.

Words are typically turned into tokens using BPE (byte pair encoding). There are multiple implementations of this algorithm, giving different tokenizations. Here I use the tokenizer gpt-3.5-turbo used in GPT 3.5 and 4.

Hello world!

If we look at the sentence “Hello world!” we see that it turns into three tokens: 9906, 1917, and 0. These correspond to “Hello”, ” world”, and “!”.

In this example, each token corresponds to a word or punctuation mark, but there’s a little more going on. It is true that 0 is simply the token for the exclamation mark—we’ll explain why in a moment—it’s not quite true to say 9906 is the token for “hello” and 1917 is the token for “world”.

Many to one

In fact 1917 is the token for ” world”. Note the leading space. The token 1917 represents the word “world,” not capitalized and not at the beginning of a sentence. At the beginning of a sentence, “World” would be tokenized as 10343. So one word may correspond to several different tokens, depending on how the word is used.

One to many

It’s also true that a word may be broken into several tokens. Consider the sentence “Chuck Mangione plays the flugelhorn.” This sentence turns into 9 tokens, corresponding to

“Chuck”, “Mang”, “ione”, ” plays”, ” fl”, “ug”, “el”, “horn”, “.”

So while there is a token for the common name “Chuck”, there is no token for the less common name “Mangione”. And while there is a single token for ” trumpet” there is no token for the less common “flugelhorn.”

Characters

The tokenizer will break words down as far as necessary to represent them, down to single letters if need be.

Each ASCII character can be represented as a token, as well as many Unicode characters. (There are 100256 total tokens, but currently 154,998 Unicode characters, so not all Unicode characters can be represented as tokens.)

Update: The next post dives into the details of how Unicode characters are handled.

The first 31 ASCII characters are non-printable control characters, and ASCII character 32 is a space. So exclamation point is the first printable, non-space character, with ASCII code 33. The rest of the printable ASCII characters are tokenized as their ASCII value minus 33. So, for example, the letter A, ASCII 65, is tokenized as 65 − 33 = 32.

Tokenizing a dictionary

I ran every line of the american-english word list on my Linux box through the tokenizer, excluding possessives. There are 6,015 words that correspond to a single token, 37,012 that require two tokens, 26,283 that require three tokens, and so on. The maximum was a single word, netzahualcoyotl, that required 8 tokens.

The 6,015 words that correspond to a single token are the most common words in English, and so quite often a token does represent a word. (And maybe a little more, such as whether the word is capitalized.)

A simpler GELU activation function approximation

The GELU (Gaussian Error Linear Units) activation function was proposed in [1]. This function is x Φ(x) where Φ is the CDF of a standard normal random variable. As you might guess, the motivation for the function involves probability. See [1] for details.

The GELU function is not too far from the more familiar ReLU, but it has advantages that we won’t get into here. In this post I wanted to look at approximations to the GELU function.

Since an implementation of Φ is not always available, the authors provide the following approximation:

\text{GELU(x)} \approx 0.5x\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right)

I wrote about a similar but simpler approximation for Φ a while back, and multiplying by x gives the approximation

\text{GELU}(x) \approx 0.5x(1 + \tanh 0.8x)

The approximation in [1] is more accurate, though the difference between the exact values of GELU(x) and those of the simpler approximation are hard to see in a plot.

Since model weights are not usually needed to high precision, the simpler approximation may be indistinguishable in practice from the more accurate approximation.

Related posts

[1] Dan Hendrycks, Kevin Gimpel. Gaussian Error Linear Units (GELUs). Available on arXiv.