How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes.

The problem UTF-8 solves

US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 27 = 128, and that’s what ASCII does. It represents each character with 8 bits since computers work with bits in groups of sizes that are powers of 2, but the first bit is always 0 because it’s not needed. Extended ASCII uses the left over space in ASCII to encode more characters.

A total of 256 characters might serve some users well, but it wouldn’t begin to let you represent, for example, Chinese. Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 216 = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

If you were to store English text using two bytes for every letter, half the space would be wasted storing zeros. And if you used four bytes per letter, three quarters of the space would be wasted. Without some kind of encoding every file containing English test would be two or four times larger than necessary. And not just English, but every language that can represented with ASCII.

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn’t use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer. You don’t have to suddenly make every character take up twice or four times as much space just because you want to use, say, a Euro sign € (U+20AC).

How UTF-8 does it

Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

When software reading UTF-8 comes across a byte starting with 1, it counts how many 1’s follow before encountering a 0. For example, in a byte of the form 110xxxxx, there’s a single 1 following the initial 1. Let n be the number of 1’s between the initial 1 and the first 0. The remaining bits in this byte and some bits in the next n bytes will represent a Unicode character. There’s no need for n to be bigger than 3 for reasons we’ll get to later. That is, it takes at most four bytes to represent a Unicode character using UTF-8.

So a byte of the form 110xxxxx says the first five bits of a Unicode character are stored at the end of this byte, and the rest of the bits are coming in the next byte.

A byte of the form 1110xxxx contains four bits of a Unicode character and says that the rest of the bits are coming over the next two bytes.

A byte of the form 11110xxx contains three bits of a Unicode character and says that the rest of the bits are coming over the next three bytes.

Following the initial byte announcing the beginning of a character spread over multiple bytes, bits are stored in bytes of the form 10xxxxxx. Since the initial bytes of a multibyte sequence start with two 1 bits, there’s no ambiguity: a byte starting with 10 cannot mark the start of a new multibyte sequence. That is, UTF-8 is self-punctuating.

So multibyte sequences have one of the following forms.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If you count the x’s in the bottom row, there are 21 of them. So this scheme can only represent numbers with up to 21 bits. Don’t we need 32 bits? It turns out we don’t.

Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. This is why n, the number of 1’s following the initial 1 at the beginning of a multibyte sequence, only needs to be 1, 2, or 3. The UTF-8 encoding scheme could be extended to allow n = 4, 5, or 6, but this is unnecessary.

Efficiency

UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one. My previous post was about the confusion that could result from software interpreting a UTF-8 encoded file as an extended ASCII file.

Related posts

Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.

Say the CSV file looked like this:

    foo,bar
    1,2
    3,4

I read the file into R with

    df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as df$bar but could not access the first column as df$foo. What’s going on?

When I ran names(df) it showed me that the first column was named not foo but ï..foo. I opened the CSV file in a hex editor and saw this:

    efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

    foo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The read.csv function has an optional encoding parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became X.U.FEFF.foo [1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM" to fix the problem. This works, though I didn’t see it in the documentation the first time. The read.csv function takes an encoding parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding parameter. With enough persistence you’ll eventually find that "UTF-8-BOM" is a possible value for fileEncoding.

How fast were dead languages spoken?

A new paper in Science suggests that all human languages carry about the same amount of information per unit time. In languages with fewer possible syllables, people speak faster. In languages with more syllables, people speak slower.

Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply the information per syllable by the number of syllables per second, you get around 39 bits per second across a wide variety of languages.

If a language has N possible syllables, and the probability of the ith syllable occurring in speech is pi, then the average information content of a syllable, as measured by Shannon entropy, is

-\sum_{i=1}^N p_i \log_2 p_i

For example, if a language had only eight possible syllables, all equally likely, then each syllable would carry 3 bits of information. And in general, if there were 2n syllables, all equally likely, then the information content per syllable would be n bits. Just like n zeros and ones, hence the term bits.

Of course not all syllables are equally likely to occur, and so it’s not enough to know the number of syllables; you also need to know their relative frequency. For a fixed number of syllables, the more evenly the frequencies are distributed, the more information is carried per syllable.

If ancient languages conveyed information at 39 bits per second, as a variety of modern languages do, one could calculate the entropy of the language’s syllables and divide 39 by the entropy to estimate how many syllables the speakers spoke per second.

According to this overview of the research,

Japanese, which has only 643 syllables, had an information density of about 5 bits per syllable, whereas English, with its 6949 syllables, had a density of just over 7 bits per syllable. Vietnamese, with its complex system of six tones (each of which can further differentiate a syllable), topped the charts at 8 bits per syllable.

One could do the same calculations for Latin, or ancient Greek, or Anglo Saxon that the researches did for Japanese, English, and Vietnamese.

If all 643 syllables of Japanese were equally likely, the language would convey -log2(1/637) = 9.3 bits of information per syllable. The overview says Japanese carries 5 bits per syllable, and so the efficiency of the language is 5/9.3 or about 54%.

If all 6949 syllables of English were equally likely, a syllable would carry 12.7 bits of information. Since English carries around 7 bits of information per syllable, the efficiency is 7/12.7 or about 55%.

Taking a wild guess by extrapolating from only two data points, maybe around 55% efficiency is common. If so, you could estimate the entropy per syllable of a language just from counting syllables.

Related posts

Quiet mode

When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python.

    Python 2.7.6 (default, Nov 23 2017, 15:49:48)
    [GCC 4.8.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.

The version number is a good reminder. I’m used to the command python bringing up Python 3+, so seeing the text above would remind me that on that computer I need to type python3 rather than simply python.

But if you’re working at the command line and jumping over to Python for a quick calculation, the start up verbiage separates your previous work from your current work by a few lines. This isn’t such a big deal with Python, but it is with R:

    R version 3.6.1 (2019-07-05) -- "Action of the Toes"
    Copyright (C) 2019 The R Foundation for Statistical Computing
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.

      Natural language support but running in an English locale

    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.

    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.

By the time you see all that, your previous work may have scrolled out of sight.

There’s a simple solution: use the option -q for quiet mode. Then you can jump in and out of your REPL with a minimum of ceremony and keep your previous work on screen.

For example, the following shows how you can use Python and bc without a lot of wasted vertical space.

    > python -q
    >>> 3+4
    7
    >>> quit()

    > bc -q
    3+4
    7
    quit

Python added the -q option in version 3, which the example above uses. Python 2 does not have an explicit quiet mode option, but Mike S points out a clever workaround in the comments. You can open a Python 2 REPL in quiet mode by using the following.

    python -ic ""

The combination of the -i and -c options tells Python to run the following script and enter interpreter mode. In this case the script is just the empty string, so Python does nothing but quietly enter the interpreter.

R has a quiet mode option, but by default R has the annoying habit of asking whether you want to save a workspace image when you quit.

    > R.exe -q
    > 3+4
    [1] 7
    > quit()
    Save workspace image? [y/n/c]: n

I have never wanted R to save a workspace image; I just don’t work that way. I’d rather keep my state in scripts. I set R to an alias that launches R with the --no-save option.

So if you launch R with -q and --no-save it takes up no more vertical space than Python or bc.

Related posts

More bc weirdness

As I mentioned in a footnote to my previous post, I just discovered that variable names in the bc programming language cannot contain capital letters. I think I understand why: Capital letters are reserved for hexadecimal constants, though in a weird sort of way.

At first variable names in bc could only be one letter long. (This is still the case in the POSIX version of bc but not in Gnu bc.) And since A through F were reserved, you might as well make things simple and just reserve all capital letters. Maybe that was the thinking.

If you enter A at the bc prompt, you get back 10. Enter B you get 11, etc. So bc assumes a number containing a hex character is a hex number, right? Actually no. It assumes that any single letter that could be a hex number is one. But in numbers with multiple digits, it interprets letters as 9’s. Yes, 9’s.

The full story is a little more complicated. bc will work in multiple bases, and it lets you set the input and output bases with the variables ibase and obase respectively. Both are set to 10 by default. When a number contains multiple characters, letters less than ibase are interpreted as you’d expect. But letters greater than or equal to ibase are interpreted as ibase – 1.

So in base 12 in a number represented by more than one character, A means 10 and B means 11. But C, D, E, and F also mean 11. For example, A0 is 120 and BB is 143. But CC is also 143.

If ibase is set to 10, then the expression E == F evaluates to false, because 14 does not equal 15. But the expression EE == FF evaluates to true, because 99 equals 99.

If you set ibase to 16, then you’re in hex mode and the letters A through F behave exactly as expected.

If you want to go back to base 10, you need to set ibase to A, not 10. If you’re in hex mode, every number you enter is interpreted in hex, and so “10” is interpreted as the number we usually write as 16. In any base, setting ibase to 10 does nothing because it sets the base equal to the base.

Asimov’s question about π

In 1977, Isaac Asimov [1] asked how many terms of the slowly converging series

π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …

would you have to sum before doing better than the approximation

π ≈ 355/113.

A couple years later Richard Johnsonbaugh [2] answered Asimov’s question in the course of an article on techniques for computing the sum of series. Johnsonbaugh said you would need at least N = 3,748,630 terms.

Johnsonbaug’s answer is based on exact calculations. I wondered how well you’d do with N terms using ordinary floating point arithmetic. Would there be so much rounding error that the result is terrible?

I wrote the most direct implementation in Python, with no tricks to improve the accuracy.

    from math import pi
    s = 0
    N = 3748630
    for n in range(1, N+1):
        s += (-1)**(n+1) * 4/(2*n - 1)

I intended to follow this up by showing that you could do better by summing all the positive and negative terms separately, then doing one subtraction at the end. But the naive version actually does quite well. It’s essentially as accurate as 355/113, with both approximations having an error of 2.66764 × 10-7.

Extended precision with bc

Next, I translated my program to bc [3] so I could control the precision. bc lets you specify your desired precision with its scale parameter.

    scale = 20
    pi = 4*a(1)
    s = 0
    m = 3748630
    for (n = 1; n <= m; n++) {
        s += (-1)^(n+1) * 4/(2*n - 1)
    }

Floating point precision is between 15 and 16 decimal places. I added more precision by setting set scale to 20, i.e. carrying out calculations to 20 decimal places, and summed the series again.

The absolute error in the series was less than the error in 355/113 in the 14th decimal place. When I used one less term in the series, its error was larger than the error in 355/113 in the 14th decimal place. In other words, the calculations suggest Johnsonbaugh found exactly the minimum number of terms needed.

I doubt Johnsonbaugh ever verified his result computationally. He doesn’t mention computer calculations in his paper [4], and it would have been difficult in 1979 to have access to the necessary hardware and software.

If he had access to an Apple II at the time, it would have run at 1 MHz. My calculation took around a minute to run on a 2 GHz laptop, so I’m guessing the same calculation would have taken a day or two on an Apple II. This assumes he could find extended precision software like bc on an Apple II, which is doubtful.

The bc programming language had been written four years earlier, so someone could have run a program like the one above on a Unix machine somewhere. However, such machines were hard to find, and their owners would have been reluctant to give up a couple days of compute time for a guest to run a frivolous calculation.

Related posts

[1] Isaac Asimov, Asimov on Numbers, 1977.

[2] Richard Johnsonbaugh, Summing an Alternating Series. The American Mathematical Monthly, Vol 86, No 8, pp.637–648.

[3] Notice that N from the Python program became m in bc. I’ve used bc occasionally for years, and didn’t know until now that you cannot use capital letters for variables in standard bc. I guess I never tried before. The next post explains why bc doesn’t allow capital letters in variable names.

[4] Johnsonbaugh’s paper does include some numerical calculations, but he only sums up 500 terms, not millions of terms, and it appears he only uses ordinary precision, not extended precision.

National Drug Code (NDC)

The US Food and Drug Administration tracks drugs using an identifier called the NDC or National Drug Code. It is described as a 10-digit code, but it may be more helpful to think of it as a 12-character code.

An NDC contains 10 digits, separated into three segments by two dashes. The three segments are the labeler code, product code, and package code. The FDA assigns the labeler codes to companies, and each company assigns its own product and package codes.

Format

The segments are of variable length and so the dashes are significant. The labeler code could be 4 or 5 digits. The product code could be 3 or 4 digits, and the package code could be 1 or 2 digits. The total number of digits is must be 10, so there are three possible combinations:

  • 4-4-2
  • 5-3-2
  • 5-4-1.

There’s no way to look at just the digits and know how to separate them into three segments. My previous post looked at self-punctuating codes. The digits of NDC codes are not self-punctuating because they require the dashes. The digit combinations are supposed to be unique, but you can’t tell how to parse a set of digits from the digits alone.

Statistics

I downloaded the NDC data from the FDA to verify whether the codes work as documented, and to see the relative frequency of various formats.

(The data change daily, so you may get different results if you do try this yourself.)

Format

All the codes were 12 characters long, and all had the documented format as verified by the regular expression [1]

    \d{4,5}-\d{3,4}-\d{1,2}

Uniqueness exception

I found one exception to the rule that the sequence of digits should be unique. The command

    sed "s/-//g" ndc.txt | sort | uniq -d

returned 2950090777.

The set of NDC codes contained both 29500-907-77 and 29500-9077-7.

Distribution

About 60% of the codes had the form 5-3-2. About 30% had the form 5-4-1, and the remaining 10% had the form 4-4-2.

There were a total of 252,355 NDC codes with 6,532 different lablelers (companies).

There were 9448 NDC codes associated with the most prolific labeler. The 1,424 least prolific labelers had only one DNC code. In Pareto-like fashion, the top 20% of labelers accounted for about 90% of the codes.

Related posts

[1] Programming languages like Python or Perl will recognize this regular expression, but by default grep does not support \d for digits. The Gnu implementation of grep with the -P option will. It will also understand notation like {4,5} to mean a pattern is repeated 4 or 5 times, with or without -P, but I don’t think other implementations of grep necessarily will.

Prefix code examples

In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows whether you’ve dialed an in-office extension or the first three digits of an outside call.

This is an example of a prefix code: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.

It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.

There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all fixed-length codes are prefix codes.

A final example of prefix codes related to telephony are country calling codes. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.

Prefix codes are sometimes called self-punctuating codes. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.

We usually think of Morse code as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether dot dash represented an A (dot dash) , or an E (dot) followed by a T (dash). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.

Q codes are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.

The letter components of Roman numerals are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.

In the previous post, I discussed UTF-16 encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. UTF-8 is also a prefix code, which I discuss in detail here.

You may have seen Huffman coding, a form of data compression that uses a prefix code.

Reverse Polish Notation is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called instantaneous codes because of this feature.

Related posts

How many possible Unicode characters there are and why

How many?

The previous post showed how the number of Unicode characters has grown over time.

Unicode size by version

You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on.

Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible Unicode characters are there?

Short answer: There are 1,111,998 possible Unicode characters.

Longer answer: There are 17×216 – 2048 – 66 = 1,111,998 possible Unicode characters: seventeen 16-bit planes, with 2048 values reserved as surrogates, and 66 reserved as non-characters. More on this below.

Which ones?

Going one level of detail deeper, which numbers correspond to Unicode characters?

The hexadecimal numbers 0 through 10FFFF are potential Unicode characters, with exception of surrogates and non-characters.

Unicode is divided into 17 planes. The first two hexadecimal “digits” indicate the plane, and the last four indicate a value within the plane. The first plane is known as the BMP, the Basic Multilingual Plane. The rest are known as supplemental planes.

The surrogates are DC00 through DFFF and D800 through DBFF. The first range of 1024 surrogates are known as low surrogates, and the second rage of 1024 the high surrogates.

The non-characters are FDD0 through FDEF and the last two values in each plane: FFFE, FFFF, 1FFFE, 1FFFF, 2FFFE, 2FFFF, …, 10FFFE, 10FFFF. This is one range of 32 non-characters, plus 34 coming from the end of each plane, for a total of 66.

Why?

Why are there only 17 planes? And what are these mysterious surrogates and non-characters? What purpose do they serve?

The limitations of UTF-16 encoding explain why 17 planes and why surrogates. Non-characters require a different explanation.

UTF-16

This post mentioned at the top that the size of Unicode jumped between versions 3.0 and 3.1. Significantly, the size went from less than 216 to more than 216. Unicode broke out of the Basic Multilingual Plane.

Unicode needed a way to represent more than 216 characters using groups of 16 bits. The solution to this problem was UTF-16 encoding. With this encoding, the surrogate values listed above do not represent characters per se but are a kind of pointer to further values.

Sixteen supplemental planes would take 20 bits to describe, 4 to indicate the plane and 16 for the values within the plane. The idea was to use a high surrogate to represent the first 10 bits and a low surrogate to represent the last 10 bits. The values DC00 through DFFF and D800 through DBFF were unassigned at the time, so they were picked for surrogates.

In a little more detail, a character in one of the supplemental planes is represented by a hexadecimal number between 1 0000 and 10 FFFF. If we subtract off 1 0000 we get a number between 0000 and FFFFF, a 20-bit number. Take the first 10 bits and add them to D800 to get a high surrogate value. Take the last 10 bits and add them to DC00 to get a low surrogate value. This pair of surrogate values represents the value in one of the supplemental planes.

When you encounter a surrogate value, you don’t need any further context to tell what it is. You don’t need to look upstream for some indication of how the bits are to be interpreted. It cannot be a BMP character, and there’s no doubt whether it is the beginning or end of a pair of surrogate values since the high surrogates and low surrogates are in different ranges.

UTF-16 can only represent 17 planes, and the Unicode Consortium decided they would not assign values that cannot be represented in UTF-16. So that’s why there are 17 planes.

Non-characters

That leaves the non-characters. Why are a few values reserved to never be used for characters?

One use for non-characters is to return a null value as an error indicator, analogous to a NaN or non-a-number in floating point calculations. A program might return FFFF, for example, to indicate that it was unable to read a character.

Another use for special non-characters is to imply which encoding method is used. For reasons that are too complicated to get into here, computers do not always store the bytes within a word in the increasing order. In so called “little endian” order, lower order bits are stored before higher order bits. (“Big endian” and “little endian” are allusions to the two factions in Gulliver’s Travels that crack boiled eggs on their big end and little end respectively.)

The byte order mark FEFF is inserted at the beginning of a file or stream to imply byte ordering. If it is received in the order FEFF then the byte stream is inferred to be using the big endian convention. But if it is received in the order FFFE then little endian is inferred because FFFE cannot be a character.

The preceding paragraphs give a justification for at least two non-characters, FFFF and FFFE, but it’s not clear why 66 are reserved. There could be reasons for each plane would have its own FFFF and FFFE, which would account for 34 non-characters. I’m not clear on why FDD0 through FDEF are non-characters, though I vaguely remember there being some historical reason. In any case, people are free to use the non-characters however they see fit.

Related posts

Growth of Unicode over time

My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious.

The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number of characters in Unicode over time.

Number of Unicode characters over time

Here’s a slightly different plot where the horizontal axis is version number rather than time.

Number of Unicode characters by standard version number

There’s plenty of room left in Unicode. The maximum number of possible Unicode characters is 1,111,998 for reasons I get into here.