Dominoes in Unicode

I was spelunking around in Unicode and found that there are assigned characters for representing domino tiles and that the characters are enumerated in a convenient order. Here is the code chart.

There are codes for representing tiles horizontally or vertically. And even though, for example, the 5-3 is the same domino as the 3-5, there are separate characters for representing the orientation of the tile: one for 3 on the left and one for 5 on the left.

When you include orientation like this, a domino becomes essentially a base 7 number: the number of spots on one end is the number of 7s and the number of spots on the other end is the number of 1s. And the order of the characters corresponds to the order as base 7 numbers:

0-0, 0-1, 0-2, …, 1-0, 1-1, 1-2, … 6-6.

The horizontal dominoes start with the double blank at U+1F031 and the vertical dominoes start with U+1F063, a difference of 32 in hexadecimal or 50 in base 10. So you can rotate a domino tile by adding or subtracting 50 to its code point.

The following tiny Python function gives the codepoint for the domino with a spots on the left (or top) and b spots on the right (or bottom).

    def code(a, b, wide):
        cp = 0x1f031 if wide else 0x1f063
        return cp + 7*a + b

We can use this function to print a (3, 5) tile horizontally and a (6, 4) tile vertically.

    print( chr(code(3, 5, True )),
           chr(code(6, 4, False)) )

To my surprise, my computer had the fonts installed to display the results. This isn’t guaranteed for such high Unicode values.

horizontal 3-5 domino and vertical 6-4

Letter-like Unicode symbols

Unicode provides a way to distinguish symbols that look alike but have different meanings.

We can illustrate this with the following Python code.

    import unicodedata as u

    for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]:
        for c in pair:
            print(format(ord(c), '4X'), u.bidirectional(c), u.name(c))

This produces

      4B L LATIN CAPITAL LETTER K
    212A L KELVIN SIGN
     3A9 L GREEK CAPITAL LETTER OMEGA
    2126 L OHM SIGN
    2135 L ALEF SYMBOL
     5D0 R HEBREW LETTER ALEF

Even though K and K look similar, the former is a Roman letter and the latter is a symbol for temperature in Kelvin. Similarly, Ω and Ω are semantically different even though they look alike.

Or rather, they probably look similar. A font may or may not use different glyphs for different code points. The editor I’m using to write this post uses a font that makes no difference between ohm and omega. The letter K and the Kelvin symbol are slightly different if I look very closely. The two alefs appear substantially different.

Note that the mathematical symbol alef is a left-to-right character and the Hebrew latter alef is a right-to-left character. The former could be useful to tell a word processor “This isn’t really a Hebrew letter; it’s a symbol that looks like a Hebrew letter. Don’t change the direction of my text or switch any language-sensitive features like spell checking.”

These letter-like symbols can be used to provide semantic information, but they can also be used to deceive. For example, a malicious website could change a K in a URL to a Kelvin sign.

Related posts

Corner quotes in Unicode

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
    | Char | Code   | Name                |
    |------+--------+---------------------|
    | ⌜    | U+231C | TOP LEFT CORNER     |
    | ⌝    | U+231D | TOP RIGHT CORNER    |
    | ⌞    | U+231E | BOTTOM LEFT CORNER  |
    | ⌟    | U+231F | BOTTOM RIGHT CORNER |
    |------+--------+---------------------|

Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

Related posts

Fractions in Unicode

There are Unicode characters for a few fractions, such as ½. This looks a little better than 1/2, depending on the context.

Here’s the Taylor series for log(1 + x) written in pure HTML:

log(1 + x) = x – ½x² + ⅓x³ – ¼x⁴ + ⅕x⁵ – ⋯

See this post for how the exponents were made.

Notice that the three dots ⋯ on the end are centered vertically, like \cdots in LaTeX. This was done with ⋯ (U+22EF).

Available fractions

The selection of available fraction number forms is small and a little strange.

There are characters for fractions with denominator d equal to 2, 3, 4, 5, 6, and 8, with numerators 1 through d-1, except for fractions that can be reduced.

If d = 7, 9, or 10, there’s a character for 1/d but not for fractions with numerators other than 1. For example, there is a character for ⅐ but not for 2/7.

HTML Entities

For denominators 2, 3, 4, 5, 6, and 8 the HTML entity for characters is easy: they all have the form

& frac <n> <d> ;

where n is the numerator and d is the denominator. For example, &frac35; is the HTML entity for ⅗.

There are no HTML entities for 1/7, 1/9, or 1/10.

Related posts

Number sets in HTML and Unicode

When I started blogging I was very cautious about what characters I used because browsers often didn’t have font support for uncommon characters. Things have changed since then and I’ve gotten less cautious. Nobody has complained, so I assume readers are seeing the characters I intend them to see.

There are Unicode characters for sets of numbers such as the integers and the real numbers, double-struck letters similar to the blackboard bold letters \mathbb{Z} etc. in LaTeX.

\mathbb{N} \quad \mathbb{Z} \quad \mathbb{Q} \quad \mathbb{R} \quad \mathbb{C} \quad \mathbb{H}

Here’s a table of the characters, their Unicode values, and two HTML entities associated with each.

    ℕ U+2115 &Nopf; &naturals;
    ℤ U+2124 &Zopf; &integers;
    ℚ U+211A &Qopf; &rationals;
    ℝ U+211D &Ropf; &reals;
    ℂ U+2102 &Copf; &complexes;
    ℍ U+210D &Hopf; &quaternions;

If you’re going to use these symbols, you will likely also need to use ∈ (U+2208, &in;) and ∉ (U+2209, &notin;).

More letters

If you want more letters in the style of those above, you can find them starting at U+1D538 for . However, the characters corresponding to letters above are reserved.

So for example, is U+1D538, is U+1D539, but U+1D53A is reserved and you must use ℂ (U+2102) instead.

One letter not mentioned above is ℙ (U+2119). It has HTML entities &Popf; and &primes;.

So the double-struck versions of C, H, N, P, Q, R, and Z are down in the BMP (Basic Multilingual Plane) and the rest are in the SMP (Supplementary Multilingual Plane). I suspect characters in the SMP are less likely to have font support, but that may not be a problem.

Unicode superscripts and subscripts

There are alternatives to using <sup> and <sub> tags for superscripts and subscripts in HTML. These alternatives may look better, depending on context, and they can be used in plain (Unicode) text files where HTML markup isn’t available.

Superscripts

When I started blogging I would use <sup>2</sup> and <sup>3</sup> for squares and cubes. Then somewhere along the way someone told me about &sup2; (U+00B2) and &sup3; (U+00B3) and I started using these. The superscript characters generally produce slightly smaller subscripts and look nicer in my opinion.

Example with sup tags:

a2 + b2 = c2

Example with superscript characters:

a² + b² = c²

But there are no characters for exponents larger than 3. Or so I thought until recently.

There are no HTML entities for exponents larger than 3, nothing written in notation analogous to &sup2; and &sup3;. There are Unicode characters for other superscripts up to 9, but they don’t have corresponding HTML entities.

The Unicode code point for superscript n is

2070hex + n

except for n = 2 or 3. For example, U+2075 is a superscript 5. So you could write x⁵ as

<em>x</em>&#x2075;.

Subscripts

There are also Unicode symbols for subscripts, though they don’t have corresponding HTML entities. The Unicode code point for superscript n = 0, 1, 2, … 9 is

2080hex + n

For example, U+2087 is a subscript 7.

The subscript characters don’t extend below the baseline as far as subscripts in <sub> tags do. Here are x‘s with subsubcripts in <sub> tags.

x0, x1, x2, x3, x4, x5, x6, x7, x8, x9

And here are the single character subscripts.

x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x

I think the former looks better, but subscripts in HTML paragraphs may increase the vertical spacing between lines. If consistent line spacing is more important than conventional subscripts, you might prefer the subscript characters.

Related posts

Planetary code golf

Suppose you’re asked to write a function that takes a number and returns a planet. We’ll number the planets in order from the sun, starting at 1, and for our purposes Pluto is the 9th planet.

Here’s an obvious solution:

    def planet(n):
        planets = [
            "Mercury",
            "Venus",
            "Earth",
            "Mars",
            "Jupiter",
            "Saturn",
            "Uranus",
            "Neptune",
            "Pluto"
        ]
        # subtract 1 to correct for unit offset
        return planets[n-1] 

Now let’s have some fun with this and play a little code golf. Here’s a much shorter program.

    def planet(n): return chr(0x263e+n)

I was deliberately ambiguous at the top of the post, saying the code should return a planet. The first function naturally interprets that to mean the name of a planet, but the second function takes it to mean the symbol of a planet.

The symbol for Mercury has Unicode value U+263F, and Unicode lists the symbols for the planets in order from the sun. So the Unicode character for the nth planet is the character for Mercury plus n. That would be true if we numbered the planets starting from 0, but conventionally we call Mercury the 1st planet, not the 0th planet, so we have to subtract one. That’s why the code contains 0x263e rather than 0x263f.

We could make the function just a tiny bit shorter by using a lambda expression and using a decimal number for 0x263e.

    planet = lambda n: chr(9790+n)

Display issues

Here’s a brief way to list the symbols from the Python REPL.

    >>> [chr(0x263f+n) for n in range(9)]
    ['☿', '♀', '♁', '♂', '♃', '♄', '♅', '♆', '♇']

You may see the symbols for Venus and Mars above rendered with a colored background, depending on your browser and OS.

Here’s what the lines above looked like at my command line.

screen shot of repl

Here’s what the symbols looked like when I posted them on Twitter.

symbols for planets

For reasons explained here, some software interprets some characters as emoji rather than literal characters. The symbols for Venus and Mars are used as emoji for female and male respectively, based on the use of the symbols in biology. Incidentally, these symbols were used to represent planets long before biologists used them to represent sexes.

Related posts

Unicode and Emoji, or The Giant Pawn Mystery

I generally despise emoji, but I reluctantly learned a few things about them this morning.

My latest couple blog posts involved chess, and I sent out a couple tweets using chess symbols. Along the way I ran into a mystery: sometimes the black pawn is much larger than other chess symbols. I first noticed this in Excel. Then I noticed that sometimes it happens in the Twitter app, and sometimes not, sometimes on the twitter web site, and sometimes not.

For example, the following screen shot is from Safari on iOS.

screenshot of tweet with giant pawn

What’s going on? I explained in a footnote to this post, but I wanted to make this its own post to make it easier to find in the future.

In a nutshell, something in the software environment is deciding that 11 of the twelve chess characters are to be taken literally, but the character for the black pawn is to be interpreted as an emojus [1] representing chess. I’m not clear on whether this is happening in the font or in an app. Probably one, both, or neither depending on circumstances.

I erroneously thought that emoji were all outside Unicode’s BMP (Basic Multilingual Plane) so as not to be confused with ordinary characters. Alas, that is not true.

Here is a full list of Unicode characters interpreted (by …?) as emoji. There are 210 emoji characters in the BMP and 380 outside, i.e. 210 below FFFF and 380 above FFFF.

***

[1] I know that “emoji” is a Japanese word, not a Latin word, but to my ear the singular of “emoji” should be “emojus.”

How to make a chessboard in Excel

I needed to make an image of a chessboard for the next blog post, and I’m not very good at image editing, so I make one using Excel.

There are Unicode characters for chess pieces— white king is U+2654, etc.—and so you can make a chessboard out of (Unicode) text.

    ♔♕♖♗♘♙♚♛♜♝♞♟

I placed the character for each piece in an cell and changed the formatting for all the cells to be centered horizontally and vertically. The following is a screenshot of the Excel file.

Chessboard: screenshot from Excel file

The trickiest part is getting the cells to be square. By default Excel uses different units for height and width, with no apparent way to change the units. But if you switch the View to Page Layout, you can set row height and column width in inches or centimeters.

Another quirk is that you may have to experiment with the font to get all the pieces the same size. In some fonts, the black pawns were larger than everything else [1].

You can download my Excel file here. You could make any chessboard configuration with this file by copying and pasting characters where you want them.

When I tested the file in Libre Office, it worked, but I had to reset the row height to match the column width.

Related posts

[1] Thanks to a reply on twitter I now understand why black’s pawn is sometimes outsized. The black pawn is used as an emoji, a sort of synecdoche representing chess. That’s why some fonts treat U+265E, black knight, entirely differently than U+265F, black pawn. The latter is interpreted not as a peer of the other pieces, but as the chess emoji. See the chess pawn entry in Emojipedia.

Upper case, lower case, title case

Converting text to all upper case or all lower case is a fairly common task.

One way to convert text to upper case would be to use the tr utility to replace the letters a through z with the letters A through Z. For example,

    $ echo Now is the time | tr '[a-z]' '[A-Z]'
    NOW IS THE TIME

You could convert to lower case by reversing the arguments to tr.

The approach above works if your text consists of only unadorned Roman letters. But it wouldn’t work, for example, if you gave it a jalapeño or π:

    $ echo jalapeño π | tr '[a-z]' '[A-Z]'
    JALAPEñO π

Using the character classes [:lower:] and [:upper:] won’t help either.

Tussling with Unicode

One alternative would be to use the uc command from the Unicode::Tussle package [1] I mentioned a few days ago. There’s also a lc counterpart, and a tc for title case. These utilities handle far more than Roman letters.

    $ echo jalapeño π | uc
    JALAPEÑO Π

Unicode capitalization rules are a black hole, but we’ll just look at one example and turn around quickly before we cross the event horizon.

Suppose you want to send all the letters in the Greek word σόφος to upper case.

    $ echo σόφος | uc
    ΣΌΦΟΣ

Greek has two lower case forms of sigma: ς at the end of a word and σ everywhere else. But there’s only one upper case sigma, so both get mapped to Σ. This means that if we convert the text to upper case and then to lower case, we won’t end up exactly where we started.

    $ echo σόφος | uc | lc
    σόφοσ

Note that the lc program chose σ as the lower case of Σ and didn’t take into account that it was at the end of a word.

Related posts

[1] “Tussle” is an acronym for Tom [Christiansen]’s Unicode Scripts So Life is Easier.