# Drawing with Unicode block characters

My previous post contained the image below.

The point of the post was that the numbers that came up in yet another post made the fractal image above when you wrote them in binary. Here’s the trick I used to make that image.

To make the pattern of 0’s and 1’s easier to see, I wanted to make a graphic with black and white squares instead of the characters 0 and 1. I thought about writing a program to create an image using strings of 0’s and 1’s as input, but then I thought of something more direct.

The Unicode character U+2588 is just a solid rectangle. I replaced the 1’s with U+2588 and replaced the 0’s with spaces. So I made a text file that looked like the the image above, and took a screen shot of it. The only problem was that the image was rectangular. I thought the pattern would be easier to see if it were a square, so I changed the aspect ratio on the image.

Here’s another example using Unicode block elements, this time to make a little bar graph, sorta like Edward Tufte’s idea of a sparkline. The characters U+2581 through U+2588 produce bars of increasing height.

To create images like this, your font needs to include glyphs for Unicode block elements. The font needs to be monospaced as well so the letters will line up under the blocks. The example above was created using the APL385. It also looks good in Hack and Input Mono. In some fonts the block characters don’t align consistently on the baseline.

Here’s the Python code that produced the above graph of English letter frequencies.

# Letter frequencies via
# http://www.norvig.com/mayzner.html
freq = [
8.04, 1.48, 3.34, 3.82, 12.49,
2.40, 1.87, 5.05, 7.57,  0.16,
0.54, 4.07, 2.51, 7.23,  7.64,
2.14, 0.12, 6.28, 6.51,  9.28,
2.73, 1.05, 1.68, 0.23,  1.66,
0.09,
]
m = max(freq)

for i in range(26):
u = int(8*freq[i]/m)
ch = chr(0x2580+u) if u > 0 else " "
print(ch, end="")

print()
for i in range(26):
print(chr(0x41+i), end="")


# Typesetting zodiac symbols in LaTeX

Typesetting zodiac symbols in LaTeX is admittedly an unusual thing to do. LaTeX is mostly used for scientific publication, and zodiac symbols are commonly associated with astrology. But occasionally zodiac symbols are used in more respectable contexts.

The wasysym package for LaTeX includes miscellaneous symbols, including zodiac symbols. Here are the symbols, their LaTeX commands, and their corresponding Unicode code points.

The only surprise here is that the command for Capricorn is based on the Latin form of the name: \capricornus.

Each zodiac sign is used to denote a 30° region of the sky. Since the Unicode symbols are consecutive, you can compute the code point of a symbol from the longitude angle θ in degrees:

Here 9800 is the decimal form of 0x2648, and the half brackets are the floor symbol, i.e. round down to the nearest integer.

Here’s the LaTeX code that produced the table.

\documentclass{article}
\usepackage{wasysym}
\begin{document}

\begin{table}
\begin{tabular}{lll}
\aries       & \verb|\aries       | & U+2648 \\
\taurus      & \verb|\taurus      | & U+2649 \\
\gemini      & \verb|\gemini      | & U+264A \\
\cancer      & \verb|\cancer      | & U+264B \\
\leo         & \verb|\leo         | & U+264C \\
\virgo       & \verb|\virgo       | & U+264D \\
\libra       & \verb|\libra       | & U+264E \\
\scorpio     & \verb|\scorpio     | & U+264F \\
\sagittarius & \verb|\sagittarius | & U+2650 \\
\capricornus & \verb|\capricornus | & U+2651 \\
\aquarius    & \verb|\aquarius    | & U+2652 \\
\pisces      & \verb|\pisces      | & U+2653 \\
\end{tabular}
\end{table}
\end{document}


By the way, you can use the Unicode values in HTML by replacing U+ with &#x and adding a semicolon on the end.

# How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes.

## The problem UTF-8 solves

US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 27 = 128, and that’s what ASCII does. It represents each character with 8 bits since computers work with bits in groups of sizes that are powers of 2, but the first bit is always 0 because it’s not needed. Extended ASCII uses the left over space in ASCII to encode more characters.

A total of 256 characters might serve some users well, but it wouldn’t begin to let you represent, for example, Chinese. Unicode initially wanted to use two bytes instead of one byte to represent characters, which would allow for 216 = 65,536 possibilities, enough to capture a lot of the world’s writing systems. But not all, and so Unicode expanded to four bytes.

If you were to store English text using two bytes for every letter, half the space would be wasted storing zeros. And if you used four bytes per letter, three quarters of the space would be wasted. Without some kind of encoding every file containing English test would be two or four times larger than necessary. And not just English, but every language that can represented with ASCII.

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn’t use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer. You don’t have to suddenly make every character take up twice or four times as much space just because you want to use, say, a Euro sign € (U+20AC).

## How UTF-8 does it

Since the first bit of ASCII characters is set to zero, bytes with the first bit set to 1 are unused and can be used specially.

When software reading UTF-8 comes across a byte starting with 1, it counts how many 1’s follow before encountering a 0. For example, in a byte of the form 110xxxxx, there’s a single 1 following the initial 1. Let n be the number of 1’s between the initial 1 and the first 0. The remaining bits in this byte and some bits in the next n bytes will represent a Unicode character. There’s no need for n to be bigger than 3 for reasons we’ll get to later. That is, it takes at most four bytes to represent a Unicode character using UTF-8.

So a byte of the form 110xxxxx says the first five bits of a Unicode character are stored at the end of this byte, and the rest of the bits are coming in the next byte.

A byte of the form 1110xxxx contains four bits of a Unicode character and says that the rest of the bits are coming over the next two bytes.

A byte of the form 11110xxx contains three bits of a Unicode character and says that the rest of the bits are coming over the next three bytes.

Following the initial byte announcing the beginning of a character spread over multiple bytes, bits are stored in bytes of the form 10xxxxxx. Since the initial bytes of a multibyte sequence start with two 1 bits, there’s no ambiguity: a byte starting with 10 cannot mark the start of a new multibyte sequence. That is, UTF-8 is self-punctuating.

So multibyte sequences have one of the following forms.

    110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


If you count the x’s in the bottom row, there are 21 of them. So this scheme can only represent numbers with up to 21 bits. Don’t we need 32 bits? It turns out we don’t.

Although a Unicode character is ostensibly a 32-bit number, it actually takes at most 21 bits to encode a Unicode character for reasons explained here. This is why n, the number of 1’s following the initial 1 at the beginning of a multibyte sequence, only needs to be 1, 2, or 3. The UTF-8 encoding scheme could be extended to allow n = 4, 5, or 6, but this is unnecessary.

## Efficiency

UTF-8 lets you take an ordinary ASCII file and consider it a Unicode file encoded with UTF-8. So UTF-8 is as efficient as ASCII in terms of space. But not in terms of time. If software knows that a file is in fact ASCII, it can take each byte at face value, not having to check whether it is the first byte of a multibyte sequence.

And while plain ASCII is legal UTF-8, extended ASCII is not. So extended ASCII characters would now take two bytes where they used to take one. My previous post was about the confusion that could result from software interpreting a UTF-8 encoded file as an extended ASCII file.

# Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.

Say the CSV file looked like this:

    foo,bar
1,2
3,4


I read the file into R with

    df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as df$bar but could not access the first column as df$foo. What’s going on?

When I ran names(df) it showed me that the first column was named not foo but ï..foo. I opened the CSV file in a hex editor and saw this:

    efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

    ï»¿foo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The read.csv function has an optional encoding parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became X.U.FEFF.foo [1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM" to fix the problem. This works, though I didn’t see it in the documentation the first time. The read.csv function takes an encoding parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding parameter. With enough persistence you’ll eventually find that "UTF-8-BOM" is a possible value for fileEncoding.

# How many possible Unicode characters there are and why

## How many?

The previous post showed how the number of Unicode characters has grown over time.

You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on.

Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible Unicode characters are there?

Short answer: There are 1,111,998 possible Unicode characters.

Longer answer: There are 17×216 – 2048 – 66 = 1,111,998 possible Unicode characters: seventeen 16-bit planes, with 2048 values reserved as surrogates, and 66 reserved as non-characters. More on this below.

## Which ones?

Going one level of detail deeper, which numbers correspond to Unicode characters?

The hexadecimal numbers 0 through 10FFFF are potential Unicode characters, with exception of surrogates and non-characters.

Unicode is divided into 17 planes. The first two hexadecimal “digits” indicate the plane, and the last four indicate a value within the plane. The first plane is known as the BMP, the Basic Multilingual Plane. The rest are known as supplemental planes.

The surrogates are DC00 through DFFF and D800 through DBFF. The first range of 1024 surrogates are known as low surrogates, and the second rage of 1024 the high surrogates.

The non-characters are FDD0 through FDEF and the last two values in each plane: FFFE, FFFF, 1FFFE, 1FFFF, 2FFFE, 2FFFF, …, 10FFFE, 10FFFF. This is one range of 32 non-characters, plus 34 coming from the end of each plane, for a total of 66.

## Why?

Why are there only 17 planes? And what are these mysterious surrogates and non-characters? What purpose do they serve?

The limitations of UTF-16 encoding explain why 17 planes and why surrogates. Non-characters require a different explanation.

### UTF-16

This post mentioned at the top that the size of Unicode jumped between versions 3.0 and 3.1. Significantly, the size went from less than 216 to more than 216. Unicode broke out of the Basic Multilingual Plane.

Unicode needed a way to represent more than than 216 characters using groups of 16 bits. The solution to this problem was UTF-16 encoding. With this encoding, the surrogate values listed above do not represent characters per se but are a kind of pointer to further values.

Sixteen supplemental planes would take 20 bits to describe, 4 to indicate the plane and 16 for the values within the plane. The idea was to use a high surrogate to represent the first 10 bits and a low surrogate to represent the last 10 bits. The values DC00 through DFFF and D800 through DBFF were unassigned at the time, so they were picked for surrogates.

In a little more detail, a character in one of the supplemental planes is represented by a hexadecimal number between 1 0000 and 10 FFFF. If we subtract off 1 0000 we get a number between 0000 and FFFFF, a 20-bit number. Take the first 10 bits and add them to D800 to get a high surrogate value. Take the last 10 bits and add them to DC00 to get a low surrogate value. This pair of surrogate values represents the value in one of the supplemental planes.

When you encounter a surrogate value, you don’t need any further context to tell what it is. You don’t need to look upstream for some indication of how the bits are to be interpreted. It cannot be a BMP character, and there’s no doubt whether it is the beginning or end of a pair of surrogate values since the high surrogates and low surrogates are in different ranges.

UTF-16 can only represent 17 planes, and the Unicode Consortium decided they would not assign values that cannot be represented in UTF-16. So that’s why there are 17 planes.

### Non-characters

That leaves the non-characters. Why are a few values reserved to never be used for characters?

One use for non-characters is to return a null value as an error indicator, analogous to a NaN or non-a-number in floating point calculations. A program might return FFFF, for example, to indicate that it was unable to read a character.

Another use for special non-characters is to imply which encoding method is used. For reasons that are too complicated to get into here, computers do not always store the bytes within a word in the increasing order. In so called “little endian” order, lower order bits are stored before higher order bits. (“Big endian” and “little endian” are allusions to the two factions in Gulliver’s Travels that crack boiled eggs on their big end and and little end respectively.)

The byte order mark FEFF is inserted at the beginning of a file or stream to imply byte ordering. If it is received in the order FEFF then the byte stream is inferred to be using the big endian convention. But if it is received in the order FFFE then little endian is inferred because FFFE cannot be a character.

The preceding paragraphs give a justification for at least two non-characters, FFFF and FFFE, but it’s not clear why 66 are reserved. There could be reasons for each plane would have its own FFFF and FFFE, which would account for 34 non-characters. I’m not clear on why FDD0 through FDEF are non-characters, though I vaguely remember there being some historical reason. In any case, people are free to use the non-characters however they see fit.

# Growth of Unicode over time

My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious.

The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number of characters in Unicode over time.

Here’s a slightly different plot where the horizontal axis is version number rather than time.

There’s plenty of room left in Unicode. The maximum number of possible Unicode characters is 1,111,998 for reasons I get into here.

# The hopeless task of the Unicode Consortium

Randall Munroe, author of xkcd, discussing Unicode on the Triangulation podcast:

I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the heart of how we use language, no matter how hard they tried to create policies to avoid these problems. It’s just a fun example of how weird language is and how hard human communication is and how you really can’t really get around those problems. … These are really hard problems and I do not envy them.

Reminds me of Jeffrey Snover’s remark about problems vs dilemmas: problems can be solved, but dilemmas can only be managed. Unicode faces a host of dilemmas.

Regarding Munroe’s comment about Unicode starting out small and getting more ambitious, see the next post for a plot of the number of characters as a function of time and of version number.

# Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

## Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

for x in ["Poincaré", "Gödel"]:
print(x, unidecode.unidecode(x))


This produces

    Poincaré Poincare
Gödel Godel


Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '


(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

## Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

## Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

# Trademark symbol, LaTeX, and Unicode

Earlier this year I was a coauthor on a paper about the Cap Score™ test for male fertility from Androvia Life Sciences [1]. I just noticed today that when I added the publication to my CV, it caused some garbled text to appear in the PDF.

Here is the corresponding LaTeX source code.

## Fixing the LaTeX problem

There were two problems: the trademark symbol and the non-printing symbol denoted by a red underscore in the source file. The trademark was a non-ASCII character (Unicode U+2122) and the underscore represented a non-printing (U+00A0). At first I only noticed the trademark symbol, and I fixed it by including a LaTeX package to allow Unicode characters:

    \usepackage[utf8x]{inputenc}

An alternative fix, one that doesn’t require including a new package, would be to replace the trademark Unicode character with \texttrademark\. Note the trailing backslash. Without the backslash there would be no space after the trademark symbol. The problem with the unprintable character would remain, but the character could just be deleted.

I found out there are two Unicode code points render the trademark glyph, U+0099 and U+2122. The former is in the Latin 1 Supplement section and is officially a control character. The correct code point for the trademark symbol is the latter. Unicode files U+2122 under Letterlike Symbols and gives it the official name TRADE MARK SIGN.

## Related posts

[1] Jay Schinfeld, Fady Sharara, Randy Morris, Gianpiero D. Palermo, Zev Rosenwaks, Eric Seaman, Steve Hirshberg, John Cook, Cristina Cardona, G. Charles Ostermeier, and Alexander J. Travis. Cap-Score™ Prospectively Predicts Probability of Pregnancy, Molecular Reproduction and Development. To appear.

# Typesetting modal logic

Modal logic extends propositional logic with two new operators, □ (“box”) and ◇ (“diamond”). There are many interpretations of these two symbols, the most common being necessity and possibility respectively. That is, □p means the proposition p is necessary, and ◇p means that p is possible. Another interpretation is using the symbols to represent things a person knows to be true and things that may be true as far as that person knows.

There are also many axiom systems for inference concerning these operators. For example, some axiom systems include the rule

and some do not. If you interpret □ as saying a proposition is provable, this axiom says whatever is provable is provably provable, which makes sense. But if you take □ to be a statement about what an agent knows, you may not want to say that if an agent knows something, it knows that it knows it.

See the next post for an example of applying logic to security, a logic with lots of modal operators and axioms. But for now, we’ll focus on how to typeset the box and diamond operators.

## LaTeX

In LaTeX, the most obvious commands would be \box and \diamond, but that doesn’t work. There is no \box command, though there is a \square command. And although there is a \diamond command, it produces a symbol much smaller than \square and so the two look odd together. The two operators are dual in the sense that

and so they should have symbols of similar size. A better approach is to use \Box and \Diamond. Those were used in the displayed equations above.

## Unicode

There are many box-like and diamond-like symbols in Unicode. It seems reasonable to use U+25A1 for box and U+25C7 for diamond. I don’t know of any more semantically appropriate characters. There are no Unicode characters with “modal” in their name, for example.

## HTML

You can always insert Unicode characters into HTML by using &#x, followed by the hexadecimal value of the codepoint, followed by a semicolon. For example, I typed &#x25a1; and &#x25c7; to enter the box and diamond symbols above.

If you want to stick to HTML entities because they’re easier to remember, you’re mostly out of luck. There is no HTML entity for the box operator. There is an entity &loz; for “lozenge,” the typographical term for a diamond. This HTML entity corresponds to U+25CA and is smaller than U+25c7 recommended above. As discussed in the context of LaTeX, you want the box and diamond operators to have a similar size.