# Accessing characters by name

You can sometimes make code more readable by using names for characters rather than the characters themselves or their code points. Python and Perl both, for example, let you refer to a character by using its standard Unicode name inside \N{}.

For instance, \N{SNOWMAN} refers to Unicode character U+2603, shown at the top of the post. It’s also kinda hard to read ☃, and not many people would read \u2603 and immediately think “Ah yes, U+2603, the snowman.”

A few days ago I wrote about how to get one-liners to work on Windows and Linux. Shells and programming languages have different ways to quoting and escaping special characters, and sometimes these ways interfere with each other.

I said that one way to get around problems with literal quotes inside a quoted string is to use character codes for quotes. This may be overkill, but it works. For example,

    perl -e 'print qq{\x27hello\x27\n}'

and

    python -c "print('\x27hello\x27\n')"

both print 'hello', including the single quotes.

One problem with this is that you may not remember that U+0027 is a single quote. And even if you have that code point memorized [2], someone else reading your code might not.

The official Unicode name for a single quote is APOSTROPHE. So the Python one-liner above could be written

    python -c "print('\N{APOSTROPHE}hello\N{APOSTROPHE}\n')"

This is kinda artificial in a one-liner because such tiny programs optimize for brevity rather than readability. But in an ordinary program rather than on the command line, using character names could make code easier to read.

So how do you find out the name of a Unicode character? The names are standard, independent of any programming language, so you can look them up in any Unicode reference.

A programming language that lets you use Unicode names probably also has a way to let you look up Unicode names. For example, in Python you can use unicodedata.name.

    >>> from unicodedata import name
>>> name('π')
'GREEK SMALL LETTER PI'
>>> name("\u05d0") # א
>>> 'HEBREW LETTER ALEF'


In Perl you could write

    use charnames q{ :full };
print charnames::viacode(0x22b4); # ⊴


which prints “NORMAL SUBGROUP OF OR EQUAL TO” illustrating that Unicode names can be quite long.

## Related posts

[1] How this renders varies greatly from platform to platform. Here are some examples.

Windows with Firefox:

[2] Who memorizes Unicode code points?! Well, I’ve memorized a few landmarks. For example, I memorized where the first letters of the Latin, Greek, and Hebrew alphabets are, so in a pinch I can figure out the rest of the letters.

# Python one-liner to print Roman numerals

Here’s an amusing little Python program to convert integers to Roman numerals:

    def roman(n): print(chr(0x215F + n))

It only works if n is between 1 and 12. That’s because Unicode contains characters for the Roman numerals I through XII.

Here are the characters it produces:

Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ

You may not be able to see these, depending on what fonts you have installed. I can see them in my browser, but when I ran the code above in a terminal window I only saw a missing glyph placeholder.

If you can be sure your reader can see the characters, say in print rather than on the web, the single-character Roman numerals look nice, and they make it clear that they’re to be interpreted as Roman numerals.

Here’s a screenshot of the symbols.

# Alphabets and Unicode

ASCII codes may seem arbitrary when you’re looking at decimal values, but they make more sense in hex [1]. For example, the ASCII value for 0 is 48. Why isn’t it zero, or at least a number that ends in zero? Well it is, in hex: 0x30. And the codes are in consecutive order, so the ASCII value of a digit d is d + 0x30.

There are also patterns in ASCII codes for letters, and this post focuses on these patterns and their analogies in the Unicode values assigned to other alphabets.

# Searching Greek and Hebrew with regular expressions

According to the Python Cookbook, “Mixing Unicode and regular expressions is often a good way to make your head explode.” It is thus with fear and trembling that I dip my toe into using Unicode with Greek and Hebrew.

I heard recently that there are anomalies in the Hebrew Bible where the final form of a letter is deliberately used in the middle of a word. That made me think about searching for such anomalies with regular expressions. I’ll come back to that shortly, but I’ll start by looking at Greek where things are a little simpler. Continue reading

# Looking at the bits of a Unicode (UTF-8) text file

Suppose you type a little text into a text file, say “123”. If you open this file in a hex editor you’ll see

    3132 33

because the ASCII value for the character ‘1’ is 0x31 in hex, ‘2’ corresponds to 0x32, and ‘3’ corresponds to 0x33. If your file is saved as utf-8 rather than ASCII, it makes absolutely no difference, as long as the file is UTF-8 encoded. By design, UTF-8 is backward compatible with the first 128 ASCII characters.

Next, let’s add some Greek letters. Now our file contains “123 αβγ”. The lower-case Greek alphabet starts at 0x03B1, so these three characters are 0x03B1, 0x03B2, and 0x03B3. Now let’s look at the file in our hex editor.

    3132 3320 CEB1 CEB2 CEB3

The B1, B2, and B3 look familiar, but why do they have “CE” in front rather than “03”? This has to do with the details of UTF-8 encoding. If we looked at the same file with UTF-16 encoding, representing each character with 16 bits, the results look more familiar.

    FEFF 0031 0032 0033 0020 03B1 03B2 03B3

So our ASCII characters—1, 2, 3, and space—are padded with a couple zeros, and we see the Unicode values of our Greek letters as we expect. But what’s the FEFF at the beginning? That’s a byte order mark (BOM) that my text editor inserted. This is an invisible marker saying that the bytes are stored in big-endian mode.

Going back to UTF-8, the ASCII characters are more compact, i.e. no zero padding, but why to the Greek letters start with “CE”?

    3132 3320 CEB1 CEB2 CEB3

As I go into detail here, UTF-8 is a clever way to save space when representing mostly ASCII text. Since ASCII bytes start with 0, a byte starting with 1 signals that something special is happening and that the following bytes are to be interpreted differently.

In binary, 0xCE expands to

    11001110

I’ll color-code the bits to make it easier to talk about them.

    1 1 0 01110

The first 1 says that this byte does not simply represent a single character but is part of the encoding of a sequence of bytes encoding a character. The first 1 and the first 0, colored red, are bookends. The number of 1s in between, colored blue, says how many of the next bytes are part of this character. The bits after the first 0, colored black, are part of the character, and the rest follow in the next byte.

The continuation bytes begin with 10, and the remaining six bits are parts of a character. You know they’re not the start of a new character because there are no 1s between the first 1 and the first 0. With UTF-8, you can look at a byte in isolation and know whether it is an ASCII character, the beginning of a non-ASCII character, or the continuation of a non-ASCII character.

So now let’s look at 0xCEB1, with some spaces and colors added.

    1 1 0 01110 10 110001

The black bits, 01110110001, are the bits of our character, and the binary number 1110110001 is 0x03B1 in hex. So we get the Unicode value for α. Similarly the rest of the bytes encode β and γ.

It’s was a coincidence that the last two hex characters of our Greek letters were recognizable in the hex dump of the UTF-8 encoding. We’ll always see the last hex character of the Unicode value in the hex dump, but not always the last two.

For another example, let’s look at a higher Unicode value, U+FB31. This is בּ, the Hebrew letter bet with a dot in the middle. This shows up in a hex editor as

    EFAC B1

or in binary as

    111011111010110010110001

Let’s break this up as before.

    1 11 0 1111 10 101100 10 110001

The first bit is a 1, so we know we have some decoding to do. There are two 1s, colored blue, between the first 1 and the first 0, colored red. This says that the bits for our character, colored black, are stored in the remainder of the first byte and in the following two bytes.

So the bits of our character are

    1111101100110001

which in hex is 0xFB31, the Unicode value of our character.

# Entering symbols in Emacs

Emacs has a relatively convenient way to add accents to letters or to insert a Unicode character if you know the code point for the value. See these notes.

But usually you don’t know the Unicode values of symbols. Then what do you do?

## TeX commands

You enter symbols by typing their corresponding TeX commands by using

    M-x set-input-method RET tex

After doing that, you could, for example, enter π by typing \pi.

You’ll see the backslash as you type the command, but once you finish you’ll see the symbol instead [1].

## HTML entities

You may know the HTML entity for a symbol and want to use that to enter characters in Emacs. Unfortunately, the following does NOT work.

    M-x set-input-method RET html

However, there is a slight variation on this that DOES work:

    M-x set-input-method RET sgml

Once you’ve set your input method to sgml, you could, for example, type &radic; to insert a √ symbol.

Why SGML rather than HTML?

HTML was created by simplifying SGML (Standard Generalized Markup Language). Emacs is older than HTML, and so maybe Emacs supported SGML before HTML was written.

There may be some useful SGML entities that are not in HTML, though I don’t know. I imagine these days hardly anyone knows anything about SGML beyond the subset that lives on in HTML and XML.

## Changing input modes

If you want to move between your default input mode and TeX mode, you can use the command toggle-input-method. This is usually mapped to C-u C-\.

You can see a list of all available input methods with list-input-methods. Most of these are spoken languages, such as Arabic or Welsh, rather than technical input modes like TeX and SGML.

## More Emacs posts

[1] I suppose there could be a problem if one command were a prefix of another. That is, if there were symbols \foo and \foobar and you intended to insert the latter, Emacs might think you’re done after you’ve typed the former. But I can’t think of a case where that would happen. TeX commands are nearly prefix codes. There are TeX commands like \tan and \tanh, but these don’t represent symbols per se. Emacs doesn’t need any help to insert the letters “tan” or “tanh” into a file.

# Drawing with Unicode block characters

My previous post contained the image below.

The point of the post was that the numbers that came up in yet another post made the fractal image above when you wrote them in binary. Here’s the trick I used to make that image.

To make the pattern of 0’s and 1’s easier to see, I wanted to make a graphic with black and white squares instead of the characters 0 and 1. I thought about writing a program to create an image using strings of 0’s and 1’s as input, but then I thought of something more direct.

The Unicode character U+2588 is just a solid rectangle. I replaced the 1’s with U+2588 and replaced the 0’s with spaces. So I made a text file that looked like the image above, and took a screen shot of it. The only problem was that the image was rectangular. I thought the pattern would be easier to see if it were a square, so I changed the aspect ratio on the image.

Here’s another example using Unicode block elements, this time to make a little bar graph, sorta like Edward Tufte’s idea of a sparkline. The characters U+2581 through U+2588 produce bars of increasing height.

To create images like this, your font needs to include glyphs for Unicode block elements. The font needs to be monospaced as well so the letters will line up under the blocks. The example above was created using the APL385. It also looks good in Hack and Input Mono. In some fonts the block characters don’t align consistently on the baseline.

Here’s the Python code that produced the above graph of English letter frequencies.

# Letter frequencies via
# http://www.norvig.com/mayzner.html
freq = [
8.04, 1.48, 3.34, 3.82, 12.49,
2.40, 1.87, 5.05, 7.57,  0.16,
0.54, 4.07, 2.51, 7.23,  7.64,
2.14, 0.12, 6.28, 6.51,  9.28,
2.73, 1.05, 1.68, 0.23,  1.66,
0.09,
]
m = max(freq)

for i in range(26):
u = int(8*freq[i]/m)
ch = chr(0x2580+u) if u > 0 else " "
print(ch, end="")

print()
for i in range(26):
print(chr(0x41+i), end="")


# Typesetting zodiac symbols in LaTeX

Typesetting zodiac symbols in LaTeX is admittedly an unusual thing to do. LaTeX is mostly used for scientific publication, and zodiac symbols are commonly associated with astrology. But occasionally zodiac symbols are used in more respectable contexts.

The wasysym package for LaTeX includes miscellaneous symbols, including zodiac symbols. Here are the symbols, their LaTeX commands, and their corresponding Unicode code points.

The only surprise here is that the command for Capricorn is based on the Latin form of the name: \capricornus.

Each zodiac sign is used to denote a 30° region of the sky. Since the Unicode symbols are consecutive, you can compute the code point of a symbol from the longitude angle θ in degrees:

Here 9800 is the decimal form of 0x2648, and the half brackets are the floor symbol, i.e. round down to the nearest integer.

Here’s the LaTeX code that produced the table.

\documentclass{article}
\usepackage{wasysym}
\begin{document}

\begin{table}
\begin{tabular}{lll}
\aries       & \verb|\aries       | & U+2648 \\
\taurus      & \verb|\taurus      | & U+2649 \\
\gemini      & \verb|\gemini      | & U+264A \\
\cancer      & \verb|\cancer      | & U+264B \\
\leo         & \verb|\leo         | & U+264C \\
\virgo       & \verb|\virgo       | & U+264D \\
\libra       & \verb|\libra       | & U+264E \\
\scorpio     & \verb|\scorpio     | & U+264F \\
\sagittarius & \verb|\sagittarius | & U+2650 \\
\capricornus & \verb|\capricornus | & U+2651 \\
\aquarius    & \verb|\aquarius    | & U+2652 \\
\pisces      & \verb|\pisces      | & U+2653 \\
\end{tabular}
\end{table}
\end{document}


By the way, you can use the Unicode values in HTML by replacing U+ with &#x and adding a semicolon on the end.

# How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes.

## The problem UTF-8 solves

US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 27 = 128, and that’s what ASCII does. It represents each character with 8 bits since computers work with bits in groups of sizes that are powers of 2, but the first bit is always 0 because it’s not needed. Extended ASCII uses the left over space in ASCII to encode more characters.

A total of 256 characters might serve some users well, but it wouldn’t begin to let you represent, for example, Chinese. Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 216 = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

If you were to store English text using two bytes for every letter, half the space would be wasted storing zeros. And if you used four bytes per letter, three quarters of the space would be wasted. Without some kind of encoding every file containing English test would be two or four times larger than necessary. And not just English, but every language that can represented with ASCII.

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn’t use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer. You don’t have to suddenly make every character take up twice or four times as much space just because you want to use, say, a Euro sign € (U+20AC).

## How UTF-8 does it

Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

When software reading UTF-8 comes across a byte starting with 1, it counts how many 1’s follow before encountering a 0. For example, in a byte of the form 110xxxxx, there’s a single 1 following the initial 1. Let n be the number of 1’s between the initial 1 and the first 0. The remaining bits in this byte and some bits in the next n bytes will represent a Unicode character. There’s no need for n to be bigger than 3 for reasons we’ll get to later. That is, it takes at most four bytes to represent a Unicode character using UTF-8.

So a byte of the form 110xxxxx says the first five bits of a Unicode character are stored at the end of this byte, and the rest of the bits are coming in the next byte.

A byte of the form 1110xxxx contains four bits of a Unicode character and says that the rest of the bits are coming over the next two bytes.

A byte of the form 11110xxx contains three bits of a Unicode character and says that the rest of the bits are coming over the next three bytes.

Following the initial byte announcing the beginning of a character spread over multiple bytes, bits are stored in bytes of the form 10xxxxxx. Since the initial bytes of a multibyte sequence start with two 1 bits, there’s no ambiguity: a byte starting with 10 cannot mark the start of a new multibyte sequence. That is, UTF-8 is self-punctuating.

So multibyte sequences have one of the following forms.

    110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


If you count the x’s in the bottom row, there are 21 of them. So this scheme can only represent numbers with up to 21 bits. Don’t we need 32 bits? It turns out we don’t.

Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. This is why n, the number of 1’s following the initial 1 at the beginning of a multibyte sequence, only needs to be 1, 2, or 3. The UTF-8 encoding scheme could be extended to allow n = 4, 5, or 6, but this is unnecessary.

## Efficiency

UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one. My previous post was about the confusion that could result from software interpreting a UTF-8 encoded file as an extended ASCII file.

# Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.

Say the CSV file looked like this:

    foo,bar
1,2
3,4


I read the file into R with

    df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as df$bar but could not access the first column as df$foo. What’s going on?

When I ran names(df) it showed me that the first column was named not foo but ï..foo. I opened the CSV file in a hex editor and saw this:

    efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

    ï»¿foo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The read.csv function has an optional encoding parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became X.U.FEFF.foo [1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM" to fix the problem. This works, though I didn’t see it in the documentation the first time. The read.csv function takes an encoding parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding parameter. With enough persistence you’ll eventually find that "UTF-8-BOM" is a possible value for fileEncoding.