Almost ASCII

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 217, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 27, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Related posts

Why “a caret, euro, trademark” ’ in a file?

A with caret, euro, trademark

Why might you see ’ in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see �, which I explained in the previous post. You might want to read that post first if you’re not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put ’ in the opening sentence. But assuming it is an error, it’s likely an encoding error.

But it’s the opposite of the � error. The � occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as as UTF-8.

The ’ sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text ’.

Windows-1252 is a superset of IDO 8859-1, the error resulting in � could also be described as a Windows-1252 error. So a � means Windows-1252 text has been interpreted as UTF-8, and ’ means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they’re not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

A valid character to represent an invalid character

U+FFFD REPLACEMENT CHARACTER

You may have seen a web page with the symbol � scattered throughout the text, especially in older web pages. What is this symbol and why does it appear unexpected?

The symbol we’re discussing is a bit of a paradox. It’s the (valid) Unicode character to represent an invalid Unicode character. If you just read the first sentence of this post, without inspecting the code point values, you can’t tell whether the symbol appears because I’ve made a mistake, or whether I’ve deliberately included the symbol.

The symbol in question is U+FFFD, named REPLACEMENT CHARACTER, a perfectly valid Unicode character. But unlike this post, you’re most likely to see it when the author did not intend for you to see it. What’s going on?

It all has to do with character encoding.

If all you want to do is represent Roman letters and common punctuation marks, and control characters, ASCII is fine. There are 128 ASCII characters, so they fit into a single 8-bit byte. But as soon as you want to write façade, jalapeño, or Gödel you have a problem. And of course you have a bigger problem if your language doesn’t use the Roman alphabet at all.

ASCII wastes one bit per byte, so naturally people wanted to take advantage of that extra bit to represent additional characters, such as the ç, ñ, and ö above. One popular way of doing this was described in the standard ISO 8859-1.

Of course there are other ways of encoding characters. If your language is Russian or Hebrew or Chinese, you’re no happier with ISO 8859-1 than you are with ASCII.

Enter Unicode. Let’s represent all the word’s alphabets (and ideograms and common symbols and …) in a single system. Great idea, though there are a mind-boggling number of details to work out. Even so, once you’ve assigned a number to every symbol that you care about, there’s still more work to do.

You could represent every character with two bytes. Now you can represent 65,536 characters. That’s too much and too little. If you want to represent text that is essentially made of Roman letters plus occasional exotic characters, using two bytes per letter makes the text 100% larger.

And while 65,536 sounds like a lot of characters, it’s not enough to represent every Chinese character, much less all the characters in other languages. Turns out we need four bytes to do what Unicode was intended to do.

So now we have to deal with encodings. UTF-8 is a brilliant solution to this problem. It can handle all Unicode characters, but if text is just ASCII, it won’t be any longer: ASCII is a valid UTF-8 encoding of the subset of Unicode that belonged to ASCII.

But there were other systems before Unicode, like ISO 8859-1, and if your file is encoded as ISO 8859-1, but a web browser thinks its UTF-8 encoded Unicode, some characters could be invalid. Browsers will use the character � as a replacement for invalid text that it could not otherwise display. That’s probably what’s going on when you see �.

See the article on How UTF-8 works to understand why some characters are not just a different character than intended but illegal UTF-8 byte sequences.

Cross-platform way to enter Unicode characters

The previous post describes the hoops I jumped through to enter Unicode characters on a Mac. Here’s a script to run from the command line that will copy Unicode characters to the system clipboard. It runs anywhere the Python module pyperclip runs.

    #!/usr/bin/env python3

    import sys
    import pyperclip

    cp = sys.argv[1]
    ch = eval(f"chr(0x{cp})")
    print(ch)
    pyperclip.copy(ch)

I called this script U so I could type

    U 03c0

at the command line, for example, it would print π to the command line and also copy it to the clipboard.

Unlike the MacOS solution in the previous post, this works for any Unicode value, i.e. for code points above FFFF.

On my Linux box I had to install xclip before pyperclip would work.

Double-struck capital letters

I’ve needed to use double-struck capital letters lately, also called blackboard bold. There are a few quirks in how they are represented in Unicode and in HTML entities, so I’m leaving some notes for myself here and for anyone else who might need to look this up.

Unicode

The double-struck capital letters are split into two blocks for historical reasons. The double-struck capital letters used most often in math — ℂ, ℍ, ℕ, ℙ, ℚ, ℝ, ℤ — are located in the U+21XX range, while the rest are in the U+1D5XX range.

Low characters

The letters in the U+21XX range were the first to be included in the Unicode standard. There’s no significance to the code points.

ℂ U+2102
ℍ U+210D 
ℕ U+2115
ℙ U+2119 
ℚ U+211A
ℝ U+211D
ℤ U+2124 

The names, however are consistent. The official name of ℂ is

DOUBLE-STRUCK CAPITAL C

and the rest are analogous.

High characters

The code point for double-struck capital A is U+1D538 and the rest are in alphabetical order: the nth letter of the English alphabet has code point

0x1D537 + n.

However, the codes corresponding to the letters in the low range are missing. For example, U+1D53A, the code that would logically be the code for ℂ, is unused so as not to duplicate the codes in the low range.

The official names of these characters are

MATHEMATICAL DOUBLE-STRUCK CAPITAL C

and so forth. Note the “MATHEMTICAL” prefix that the letters in the low range do not have.

Incidentally, X is the 24th letter, so the new logo for Twitter has code point U+1D54F.

HTML Entities

All double-struck letters have HTML shortcuts of the form *opf; where * is the letter, whether capital or lower case. For example, is 𝔸 and is 𝕒.

The letters with lower Unicode values also have semantic entities as well.

ℂ ℂ ℂ
ℍ ℍ ℍ
ℕ ℕ ℕ
ℙ ℙ ℙ
ℚ ℚ ℚ
ℝ ℝ ℝ
ℤ ℤ ℤ

LaTeX

The LaTeX command for any double-struck capital letter is \mathbb{}. This only applies to capital letters.

Python

Here’s Python code to illustrate the discussion of Unicode values above.

def doublestrike(ch):

    exceptions = {
        'C' : 0x2102,
        'H' : 0x210D,
        'N' : 0x2115,
        'P' : 0x2119,
        'Q' : 0x211A,
        'R' : 0x211D,
        'Z' : 0x2124
    }
    if ch in exceptions:
        codepoint = exceptions[ch]
    else:
        codepoint = 0x1D538 + ord(ch) - ord('A')
    print(chr(codepoint), f"U+{format(codepoint, 'X')}")

for n in range(ord('A'), ord('Z')+1):
    doublestrike(chr(n))

Russian transliteration hack

I mentioned in the previous post that I had been poking around in HTML entities and noticed symbols for Fourier transforms and such. I also noticed HTML entities for Cyrillic letters. These entities have the form

& + transliteration + cy;.

For example, the Cyrillic letter П is based on the Greek letter Π and its closest English counterpart is P, and its HTML entity is П.

The Cyrillic letter Р has HTML entity &Rpcy; and not П because although it looks like an English P, it sounds more like an English R.

Just as a hack, I decided to write code to transliterate Russian text by converting letters to their HTML entities, then chopping off the initial & and the final cy;.

I don’t speak Russian, but according to Google Translate, the Russian translation of “Hello world” is “Привет, мир.”

Here’s my hello-world program for transliterating Russian.

    from bs4.dammit import EntitySubstitution

    def transliterate(ch):
        entity = escaper.substitute_html(ch)[1:]
        return entity[:-3]
    
    a = [transliterate(c) for c in "Привет, мир."]
    print(" ".join(a))

This prints

P r i v ie t m i r

Here’s what I get trying to transliterate Chebyshev’s native name Пафну́тий Льво́вич Чебышёв.

P a f n u t i j L soft v o v i ch CH ie b y sh io v

I put a space between letters because of possible outputs like “soft v” above.

This was just a fun hack. Here’s what I’d get if I used software intended to be used for transliteration.

    import unidecode

    for x in ["Привет, мир", "Пафну́тий Льво́вич Чебышёв"]:
        print(unidecode.unidecode(x))

This produces

Privet, mir
Pafnutii L’vovich Chebyshiov

The results are similar.

Related posts

Symbols for transforms

I was looking through HTML entities and ran across ℱ. I searched for all entities ending in trf; and also found ℳ, ℒ, and ℨ.

Apparently “trf” stands “transform” and these symbols are intended to be used to represent the Fourier transform, Mellin transform, Laplace transform, and z-transform.

You would not know from the Unicode names that these symbols are intended to be used for transforms. For example, U+2131 has Unicode name SCRIPT CAPITAL F. But the HTML entity ℱ suggests how the symbol could be used.

These are not the symbols I’d use for these transforms if I had access to LaTeX. I’d use {\cal F} for the Fourier transform, for example. But if I were writing about math and restricted to Unicode symbols, as I am on Twitter, I might use these symbols. I could imagine, for example, using ℒ for Laplace transform on @AnalysisFact.

Here’s a table with more on the four transform symbols that have HTML entities.

| ℒ      | ℒ | U+2112  | SCRIPT CAPITAL L       |
| ℨ      | ℨ     | U+2128  | BLACK-LETTER CAPITAL Z |
| ℱ      | ℱ | U+2131  | SCRIPT CAPITAL F       |
| ℳ      | ℳ  | U+2133  | SCRIPT CAPITAL M       |

Related posts

How to memorize Unicode codepoints

At the end of each month I write a newsletter highlighting the most popular posts of that month. When I looked back at my traffic stats to write this month’s newsletter I noticed that a post I wrote last year about how to memorize the ASCII table continues to be popular. This post is a follow up, how to memorize Unicode values.

Memorizing all 128 ASCII values is doable. Memorizing all Unicode values would be insurmountable. There are nearly 150,000 Unicode characters at the moment, and the list is grows over time. But knowing a few Unicode characters is handy. I often need to insert a π symbol, for example, and so I made an effort to remember its Unicode value, U+03C0.

There are convenient ways of inserting common non-ASCII characters without knowing their Unicode values, but these offer a limited range of characters and they work differently in different environments. Inserting Unicode values gives you access to more characters in more environments.

As with ASCII, you can memorize the Unicode value of a symbol by associating an image with a number and associating that image with the symbol. The most common way to associate an image with a number is the Major system. As with everything else, the Major system becomes easier with practice.

However, Unicode presents a couple challenges. First, Unicode codepoints are nearly always written in hexadecimal, and so you’ll run into the letters A through F as well as digits. Second, Unicode codepoints are four hex digits long (or five outside the Basic Multilingual Plane.) We’ll address both of these difficulties shortly.

It may not seem worthwhile to go to the effort of encoding and decoding numbers like this, but it scales well. Brute force is fine for small amounts of data and short-term memory, but image association works much better for large amounts of data and long-term memory.

Unicode is organized into blocks of related characters. For example, U+22xx are math symbols and U+26xx are miscellaneous symbols. If you know what block a symbols is in, you only need to remember the last two hex digits.

You can convert a pair of hex digits to decimal by changing bases. For example, you could convert the C0 in U+03C0 to 192. But this is a moderately difficult mental calculation.

An easier approach would be to leave hex digits alone that correspond to decimal digits, reduce hex digits A through F mod 10, and tack on an extra digit to disambiguate. Stick on a 0, 1, 2, or 3 according to whether no digits, the first digit, the second digit, or both digits had been reduced mod 10. See this page for details. With this system, C0 becomes 201. You could encode 201 as “nest” using the Major system, and imagine a π sitting in a nest, maybe something like the 3Blue1Brown plushie.

3Blue1Brown plushieFor another example, ♕ (U+2655), is the symbol for the white queen in chess. You might think of the White Queen from The Lion, the Witch, and the Wardrobe [2] and associate her with the hex number 0x55. If you convert 0x55 to decimal, you get 85, which you could associate with the Eiffel Tower using the Major system. So maybe imagine the White Queen driving her sleigh under the Eiffel Tower. If you convert 0x55 to 550 as suggested here, you might imagine her driving through a field of lilies.

Often Unicode characters are arranged consecutively in a logical sequence so you can compute the value of the rest of the sequence from knowing the value of the first element. Alphabets are arranged in alphabetical order (mostly [1]), symbols for Roman numerals are arranged in numerical order, symbols for chess pieces are arrange in an order that would make sense to chess players, etc.

[1] There are a few exceptions such as Cyrillic Ё and a gap in Greek capital letters.

[2] She’s not really a queen, but she thinks of herself as a queen. See the book for details.

Symbols for angles

I was looking around in the Unicode block for miscellaneous symbols, U+2600, after I needed to look something up, and noticed there are four astrological symbols for angles: ⚹, ⚺, ⚻, and ⚼.

⚹ ⚺ ⚻ ⚼

These symbols are mysterious at first glance but all make sense in hindsight as I’ll explain below.

Sextile

The first symbol, ⚹, U+26B9, is self-explanatory.  It is made of six 60° angles and is called a sextile after the Latin word for six.

Semisextile

The second symbol, ⚺, U+26BA, is less obvious, though the name is obvious: semisextile is the top half of a sextile, so it represents an angle half as wide.

The symbol looks like ⊻, U+22BB, the logic symbol for XOR (exclusive or), but is unrelated.

Quincunx

The third symbol, ⚻, U+26BB, represents an angle of 150°, the supplementary angle of 30°. Turning the symbol for 30° upside down represents taking the supplementary angle.

The symbol looks like ⊼, U+22BC, the logic symbol for NAND (not and), but is unrelated.

I’ve run into the name quincunx before but not the symbol. Last fall I wrote a post about conformal mapping that mentions the “Peirce quincuncial projection” created by Charles Sanders Peirce using conformal mapping.

Charles Sanders Peirce's quincuncial project
Because the projection was created using conformal mapping, the projection is angle-preserving.

The name of the projection comes from another use of the term quincunx, meaning the pattern of dots on the 5 side of a die.

Sesquiquadrate

The final symbol, ⚼, U+26BC, represents an angle of 135°. A little thought reveals the reason for the symbol and its name. The symbol is a square and half a square, representing a right angle plus half a right angle. The Latin prefix sesqui- means one and a half. For example, a sesquicentennial is a 150th anniversary.

Unicode arrows: math versus emoji

I used the character ↔︎︎ (U+2194) in a blog post recently and once again got bit by the giant pawn problem. That’s my name for when a character intended to be rendered as text is surprisingly rendered as an emoji. I saw

f (emoji <->) f dx dy dz

when what I intended was

f <-> f dx dy dz

I ran into the same problem a while back in my post on modal logic and security. The page contained

when I intended

This example is more surprising because → (U+2192) is rendered as text while ↔︎ (U+2194) is rendered as an image. This seems capricious. Why are some symbols rendered as text while other closely-related symbols are not? The reason is that some symbols are interpreted as emoji and some are not. You can find a full list of text characters that might be hijacked as emoji here.

One way to fix this problem is to use the Unicode “variation selector” U+FE0E to tell browsers that you’d like your character interpreted as text. You can also use the variation selector U+FE0F if you really do want a character displayed as an emoji. See this post for details.

(I use U+FE0E several times in this post. Hopefully your browser or RSS reader is rendering it correctly. If it isn’t, please let me know.)

Update: This post looks terrible in the RSS reader on my phone because the reader ignores the variation selectors. See the comments for a screenshot. It renders as intended in my browser, desktop or mobile.

It’s not always convenient or even possible to use variation selectors, and in that case you can simply use different symbols.

Here are four Alternatives to ↔︎ (U+2194):

  • ⟷ (U+27F7)
  • ⇔ (U+21D4)
  • ⟺ (U+27FA)
  • ⇆ (U+21C6)

There is no risk of these characters being interpreted as emoji, and their intent may be clearer. In the two examples above, I used a double-headed arrow to mean two different things.

In the first example I wanted to convey that the Hodge operator takes the left side to the right and the right side to the left. Maybe ⇆ (U+21C6) would have been a better choice. In the second example, maybe it would have been clearer if I had used ⇒ (U+21D2) and ⇔ (U+21D4) rather than → (U+2192) and ↔︎ (U+2194).

Another math symbol that is unfortunately turned into an emoji is ↪︎ (U+21AA). In TeX this is \hookrightarrow. I used this symbol quite a bit in grad school to indicate that one space embeds in another. I don’t know of a good alternative to this one. You could use ⤷ (U+2937), though this looks kinda odd.