# Letter-like Unicode symbols

Unicode provides a way to distinguish symbols that look alike but have different meanings.

We can illustrate this with the following Python code.

    import unicodedata as u

for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]:
for c in pair:
print(format(ord(c), '4X'), u.bidirectional(c), u.name(c))


This produces

      4B L LATIN CAPITAL LETTER K
212A L KELVIN SIGN
3A9 L GREEK CAPITAL LETTER OMEGA
2126 L OHM SIGN
2135 L ALEF SYMBOL
5D0 R HEBREW LETTER ALEF


Even though K and K look similar, the former is a Roman letter and the latter is a symbol for temperature in Kelvin. Similarly, Ω and Ω are semantically different even though they look alike.

Or rather, they probably look similar. A font may or may not use different glyphs for different code points. The editor I’m using to write this post uses a font that makes no difference between ohm and omega. The letter K and the Kelvin symbol are slightly different if I look very closely. The two alefs appear substantially different.

Note that the mathematical symbol alef is a left-to-right character and the Hebrew latter alef is a right-to-left character. The former could be useful to tell a word processor “This isn’t really a Hebrew letter; it’s a symbol that looks like a Hebrew letter. Don’t change the direction of my text or switch any language-sensitive features like spell checking.”

These letter-like symbols can be used to provide semantic information, but they can also be used to deceive. For example, a malicious website could change a K in a URL to a Kelvin sign.

# Corner quotes in Unicode

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
| Char | Code   | Name                |
|------+--------+---------------------|
| ⌜    | U+231C | TOP LEFT CORNER     |
| ⌝    | U+231D | TOP RIGHT CORNER    |
| ⌞    | U+231E | BOTTOM LEFT CORNER  |
| ⌟    | U+231F | BOTTOM RIGHT CORNER |
|------+--------+---------------------|


Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

# Fractions in Unicode

There are Unicode characters for a few fractions, such as ½. This looks a little better than 1/2, depending on the context.

Here’s the Taylor series for log(1 + x) written in pure HTML:

log(1 + x) = x – ½x² + ⅓x³ – ¼x⁴ + ⅕x⁵ – ⋯

See this post for how the exponents were made.

Notice that the three dots ⋯ on the end are centered vertically, like \cdots in LaTeX. This was done with &ctdot; (U+22EF).

## Available fractions

The selection of available fraction number forms is small and a little strange.

There are characters for fractions with denominator d equal to 2, 3, 4, 5, 6, and 8, with numerators 1 through d-1, except for fractions that can be reduced.

If d = 7, 9, or 10, there’s a character for 1/d but not for fractions with numerators other than 1. For example, there is a character for ⅐ but not for 2/7.

## HTML Entities

For denominators 2, 3, 4, 5, 6, and 8 the HTML entity for characters is easy: they all have the form

& frac <n> <d> ;

where n is the numerator and d is the denominator. For example, &frac35; is the HTML entity for ⅗.

There are no HTML entities for 1/7, 1/9, or 1/10.

# Number sets in HTML and Unicode

When I started blogging I was very cautious about what characters I used because browsers often didn’t have font support for uncommon characters. Things have changed since then and I’ve gotten less cautious. Nobody has complained, so I assume readers are seeing the characters I intend them to see.

There are Unicode characters for sets of numbers such as the integers and the real numbers, double-struck letters similar to the blackboard bold letters \mathbb{Z} etc. in LaTeX.

Here’s a table of the characters, their Unicode values, and two HTML entities associated with each.

    ℕ U+2115 &Nopf; &naturals;
ℤ U+2124 &Zopf; &integers;
ℚ U+211A &Qopf; &rationals;
ℝ U+211D &Ropf; &reals;
ℂ U+2102 &Copf; &complexes;
ℍ U+210D &Hopf; &quaternions;


If you’re going to use these symbols, you will likely also need to use ∈ (U+2208, &in;) and ∉ (U+2209, &notin;).

## More letters

If you want more letters in the style of those above, you can find them starting at U+1D538 for . However, the characters corresponding to letters above are reserved.

So for example, is U+1D538, is U+1D539, but U+1D53A is reserved and you must use ℂ (U+2102) instead.

One letter not mentioned above is ℙ (U+2119). It has HTML entities &Popf; and &primes;.

So the double-struck versions of C, H, N, P, Q, R, and Z are down in the BMP (Basic Multilingual Plane) and the rest are in the SMP (Supplementary Multilingual Plane). I suspect characters in the SMP are less likely to have font support, but that may not be a problem.

# Unicode superscripts and subscripts

There are alternatives to using <sup> and <sub> tags for superscripts and subscripts in HTML. These alternatives may look better, depending on context, and they can be used in plain (Unicode) text files where HTML markup isn’t available.

## Superscripts

When I started blogging I would use <sup>2</sup> and <sup>3</sup> for squares and cubes. Then somewhere along the way someone told me about &sup2; (U+00B2) and &sup3; (U+00B3) and I started using these. The superscript characters generally produce slightly smaller subscripts and look nicer in my opinion.

Example with sup tags:

a2 + b2 = c2

Example with superscript characters:

a² + b² = c²

But there are no characters for exponents larger than 3. Or so I thought until recently.

There are no HTML entities for exponents larger than 3, nothing written in notation analogous to &sup2; and &sup3;. There are Unicode characters for other superscripts up to 9, but they don’t have corresponding HTML entities.

The Unicode code point for superscript n is

2070hex + n

except for n = 2 or 3. For example, U+2075 is a superscript 5. So you could write x⁵ as

<em>x</em>&#x2075;.

## Subscripts

There are also Unicode symbols for subscripts, though they don’t have corresponding HTML entities. The Unicode code point for superscript n = 0, 1, 2, … 9 is

2080hex + n

For example, U+2087 is a subscript 7.

The subscript characters don’t extend below the baseline as far as subscripts in <sub> tags do. Here are x‘s with subsubcripts in <sub> tags.

x0, x1, x2, x3, x4, x5, x6, x7, x8, x9

And here are the single character subscripts.

x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x

I think the former looks better, but subscripts in HTML paragraphs may increase the vertical spacing between lines. If consistent line spacing is more important than conventional subscripts, you might prefer the subscript characters.

# Planetary code golf

Suppose you’re asked to write a function that takes a number and returns a planet. We’ll number the planets in order from the sun, starting at 1, and for our purposes Pluto is the 9th planet.

Here’s an obvious solution:

    def planet(n):
planets = [
"Mercury",
"Venus",
"Earth",
"Mars",
"Jupiter",
"Saturn",
"Uranus",
"Neptune",
"Pluto"
]
# subtract 1 to correct for unit offset
return planets[n-1]


Now let’s have some fun with this and play a little code golf. Here’s a much shorter program.

    def planet(n): return chr(0x263e+n)


I was deliberately ambiguous at the top of the post, saying the code should return a planet. The first function naturally interprets that to mean the name of a planet, but the second function takes it to mean the symbol of a planet.

The symbol for Mercury has Unicode value U+263F, and Unicode lists the symbols for the planets in order from the sun. So the Unicode character for the nth planet is the character for Mercury plus n. That would be true if we numbered the planets starting from 0, but conventionally we call Mercury the 1st planet, not the 0th planet, so we have to subtract one. That’s why the code contains 0x263e rather than 0x263f.

We could make the function just a tiny bit shorter by using a lambda expression and using a decimal number for 0x263e.

    planet = lambda n: chr(9790+n)

## Display issues

Here’s a brief way to list the symbols from the Python REPL.

    >>> [chr(0x263f+n) for n in range(9)]
['☿', '♀', '♁', '♂', '♃', '♄', '♅', '♆', '♇']


You may see the symbols for Venus and Mars above rendered with a colored background, depending on your browser and OS.

Here’s what the lines above looked like at my command line.

Here’s what the symbols looked like when I posted them on Twitter.

For reasons explained here, some software interprets some characters as emoji rather than literal characters. The symbols for Venus and Mars are used as emoji for female and male respectively, based on the use of the symbols in biology. Incidentally, these symbols were used to represent planets long before biologists used them to represent sexes.

# Unicode and Emoji, or The Giant Pawn Mystery

I generally despise emoji, but I reluctantly learned a few things about them this morning.

My latest couple blog posts involved chess, and I sent out a couple tweets using chess symbols. Along the way I ran into a mystery: sometimes the black pawn is much larger than other chess symbols. I first noticed this in Excel. Then I noticed that sometimes it happens in the Twitter app, and sometimes not, sometimes on the twitter web site, and sometimes not.

For example, the following screen shot is from Safari on iOS.

What’s going on? I explained in a footnote to this post, but I wanted to make this its own post to make it easier to find in the future.

In a nutshell, something in the software environment is deciding that 11 of the twelve chess characters are to be taken literally, but the character for the black pawn is to be interpreted as an emojus [1] representing chess. I’m not clear on whether this is happening in the font or in an app. Probably one, both, or neither depending on circumstances.

I erroneously thought that emoji were all outside Unicode’s BMP (Basic Multilingual Plane) so as not to be confused with ordinary characters. Alas, that is not true.

Here is a full list of Unicode characters interpreted (by …?) as emoji. There are 210 emoji characters in the BMP and 380 outside, i.e. 210 below FFFF and 380 above FFFF.

***

[1] I know that “emoji” is a Japanese word, not a Latin word, but to my ear the singular of “emoji” should be “emojus.”

# How to make a chessboard in Excel

I needed to make an image of a chessboard for the next blog post, and I’m not very good at image editing, so I make one using Excel.

There are Unicode characters for chess pieces— white king is U+2654, etc.—and so you can make a chessboard out of (Unicode) text.

    ♔♕♖♗♘♙♚♛♜♝♞♟

I placed the character for each piece in an cell and changed the formatting for all the cells to be centered horizontally and vertically. The following is a screenshot of the Excel file.

The trickiest part is getting the cells to be square. By default Excel uses different units for height and width, with no apparent way to change the units. But if you switch the View to Page Layout, you can set row height and column width in inches or centimeters.

Another quirk is that you may have to experiment with the font to get all the pieces the same size. In some fonts, the black pawns were larger than everything else [1].

You can download my Excel file here. You could make any chessboard configuration with this file by copying and pasting characters where you want them.

When I tested the file in Libre Office, it worked, but I had to reset the row height to match the column width.

## Related posts

[1] Thanks to a reply on twitter I now understand why black’s pawn is sometimes outsized. The black pawn is used as an emoji, a sort of synecdoche representing chess. That’s why some fonts treat U+265E, black knight, entirely differently than U+265F, black pawn. The latter is interpreted not as a peer of the other pieces, but as the chess emoji. See the chess pawn entry in Emojipedia.

# Upper case, lower case, title case

Converting text to all upper case or all lower case is a fairly common task.

One way to convert text to upper case would be to use the tr utility to replace the letters a through z with the letters A through Z. For example,

    $echo Now is the time | tr '[a-z]' '[A-Z]' NOW IS THE TIME  You could convert to lower case by reversing the arguments to tr. The approach above works if your text consists of only unadorned Roman letters. But it wouldn’t work, for example, if you gave it a jalapeño or π: $ echo jalapeño π | tr '[a-z]' '[A-Z]'
JALAPEñO π


Using the character classes [:lower:] and [:upper:] won’t help either.

## Tussling with Unicode

One alternative would be to use the uc command from the Unicode::Tussle package [1] I mentioned a few days ago. There’s also a lc counterpart, and a tc for title case. These utilities handle far more than Roman letters.

    $echo jalapeño π | uc JALAPEÑO Π  Unicode capitalization rules are a black hole, but we’ll just look at one example and turn around quickly before we cross the event horizon. Suppose you want to send all the letters in the Greek word σόφος to upper case. $ echo σόφος | uc
ΣΌΦΟΣ


Greek has two lower case forms of sigma: ς at the end of a word and σ everywhere else. But there’s only one upper case sigma, so both get mapped to Σ. This means that if we convert the text to upper case and then to lower case, we won’t end up exactly where we started.

    $echo σόφος | uc | lc σόφοσ  Note that the lc program chose σ as the lower case of Σ and didn’t take into account that it was at the end of a word. ## Related posts [1] “Tussle” is an acronym for Tom [Christiansen]’s Unicode Scripts So Life is Easier. # Removing Unicode formatting Several people responded to my previous post asserting that screen readers would not be able to read text formatted via Unicode variants. Maybe some screen readers can’t handle this, but there’s no reason they couldn’t. Before I go any further, I’d like to repeat my disclaimer from the previous post: It’s a dirty hack, and I’d recommend not overdoing it. But it could come in handy occasionally. On the other hand, some people may not see what you intend them to see. This formatting is gimmicky and there are reasons to only use it sparingly or not at all. But I don’t see why screen readers need to be stumped by it. In the example below, I format the text “The quick brown fox” by running it through unifont as in the previous post. If we pipe the output through unidecode then we mostly recover the original text. (I wrote about unidecode here.) $ unifont The quick brown fox | unidecode

Double-Struck: The quick brown fox
Monospace: The quick brown fox
Sans-Serif: The quick brown fox
Sans-Serif Italic: The quick brown fox
Sans-Serif Bold: The quick brown fox
Sans-Serif Bold Italic: The quick brown fox
Script: T he quick brown fox
Italic: The quick brown fox
Bold: The quick brown fox
Bold Italic: The quick brown fox
Fraktur: T he quick brown fox
Bold Fraktur: T he quick brown fox


The only problem is that sometimes there’s an extra space after capital letters. I don’t know whether this is a quirk of unifont or unidecode.

This isn’t perfect, but it’s a quick proof of concept that suggests this shouldn’t be a hard thing for a screen reader to do.

Maybe you don’t want to normalize Unicode characters this way all the time, but you could have some configuration option to only do this for Twitter, or to only do it for characters outside a certain character range.