Making flags in Unicode

Posted on 2 October 2022 by John

I recently found out [1] that the Unicode sequences for flag emoji are created by taking the two-letter country abbreviation (ISO 3166-1 alpha-2) and replacing both letters with their counterparts in the range U+1F1E6 through U+1F1FF.

For example, the abbreviation for Canada is CA, and the characters 🇨 (U+1F1e8) and 🇦 (U+1F!E6) together create 🇨🇦.

boxed C plus boxed A = Canadian flag

This is illustrated by the following Python code.

    import iso3166

    def flag_emoji(name):
        alpha = iso3166.countries.get(name).alpha2
        box = lambda ch: chr( ord(ch) + 0x1f1a5 )
        return box(alpha[0]) + box(alpha[1])
    print(flag_emoji("Canada"))

The name we give to flag_emoji need not be the full country name, like Canada. It can be anything that iso3166.countries.get supports, which also includes two-letter abbreviations like CA, three-letter abbreviations like CAN, or ISO 3166 numeric codes like 124.

We can use the following code to print a collage of flags:

    def print_all_flags():
        for i, c in enumerate( iso3166.countries ):
            print(flag_emoji(c.name), end="")
            if i%25 == 24: print()

10 by 25 array of flags

[1] I learned this from watching Dylan Beattie’s talk Plain Text on YouTube.

Preventing characters from displaying as emoji

Posted on 30 September 2022 by John

I rarely intentionally use emoji, and yet I often run into them unbidden. This is because some Unicode characters double as emoji.

For example, the zodiac symbol for Aries is used both in celestial navigation and in astrology. The latter is much more common, and so when some software sees U+2648 it interprets the character as the emoji for the horoscope sign.

There is a way to prevent this: append the “variation selector” character U+FE0E after the symbol. This tells software that you want the preceding character to be interpreted as a character. And if you want to request that a character be displayed as an emoji, you can append U+FE0F. But a particular software package may or may not honor your request.

I tried this in several terminals to see what would happen. On Linux and Mac, the symbol for Aries (U+2648) prints as an emoji by default, and the symbol for a black pawn (U+265F) does not. But by when I add variation selectors to reverse the defaults the shells complied.

Here’s the same output as text:

    >>> print("\u2648")
    ♈
    >>> print("\u2648\ufe0e")
    ♈︎
    >>> print("\u265f")
    ♟
    >>> print("\u265f\ufe0f")
    ♟️

When I tested this on Windows I got different results in different terminals. The default cmd terminal was unable to display either Aries or the pawn. The ConEmu terminal displayed both as characters, even when I requested emoji. The new Windows Terminal app displayed both as emoji, even when I requested plain characters. This probably has something to do with the fonts the terminals use as well as the terminal software itself.

Update: View this page to see how your browser renders various characters as text or emoji. Also see this Twitter thread on how Twitter renders these characters.

Katakana, Hiragana, and Unicode

Posted on 25 September 2022 by John

I figured out something that I wasn’t able to find by searching, so I’m posting it here in case other people have the same question and the same difficulty finding an answer.

I’m sure other people have written about this, but I couldn’t find it. Maybe lots of people have written about this in Japanese but not many in English.

Japanese kana consists of two syllabaries, hiragana and katakana, that are like phonetic alphabets. Each has 46 basic characters, and each corresponds to a block of 96 Unicode characters. I had two simple questions:

How do the 46 characters map into the 90 characters?
Do they map the same way for both hiragana and katakana?

Hiragana / katakana correspondence

I’ll start with the second question because it’s easier. Hiragana and katakana are different ways of representing the same sounds, and they correspond one to one. For example, the full name of U+3047 (ぇ) is

HIRAGANA LETTER SMALL E

and the full name of its katakana counterpart U+30A7 (ェ) is

KATAKANA LETTER SMALL E

The only difference as far as Unicode goes is that katakana has three code points whose hiragana counterpart is unused, but these are not part of the basic letters.

The following Python code shows that the names of all the characters are the same except for the name of the system.

    from unicodedata import name

    unused = [0, 151, 152] # not in hiragana

    for i in range(0,63):
        if i in unused:
            continue
        h = name(chr(0x3040 + i)) 
        k = name(chr(0x30a0 + i))
        assert(h == k.replace("KATAKANA", "HIRAGANA"))
    print("done")

Mapping 46 into 50 and 96

You’ll see kana written in grid with one side labeled with 5 vowels and the other labeled with 10 consonants called a gojūon (五十音). That’s 50 cells, and in fact gojūon literally means 50 sounds, so how do we get 46? Five cells are empty, and one letter doesn’t fit into the grid. The empty cells are unused or archaic, and the extra character doesn’t fit the grid structure.

In the image below, the table on the left is for hiragana and the table on the right is for katakana. HTML versions of the tables available here.

Left out of each table is ん in hiragana and ン in katakana.

So does each set of 46 characters map into its Unicode code block?

Unicode numbers the letters consecutively if you traverse the grid increasing vowels first, then consonants, and adding the straggler at the end. But the reason 46 letters expand into more code points is that each letter can have one, two, or three variations. And there are various miscellaneous other symbols in the Unicode block.

For example, there is a LETTER E as well as the SMALL LETTER E mentioned above. Other variations seem to correspond to voiced and unvoiced versions of a consonant with a phonetic marker added to the voiced version. For example, く is U+304F, HIRAGANA LETTER KU, and ぐ is U+3050, HIRAGANA LETTER GU.

Here is how hiragana maps into Unicode. Each cell should be U+3000 plus the characters show.

         a  i  u  e  o 
        42 44 46 48 4A 
     k  4B 4D 4F 51 53 
     s  55 57 59 5B 5D 
     t  5F 61 64 66 68 
     n  6A 6B 6C 6D 6E 
     h  6F 72 75 78 7B 
     m  7E 7F 80 81 82 
     y  84    86    88 
     r  89 8A 8B 8C 8D 
     w  8F          92

The corresponding table for katakana is the previous table plus 0x60:

         a  i  u  e  o 
        A2 A4 A6 A8 AA 
     k  AB AD AF B1 B3 
     s  B5 B7 B9 BB BD 
     t  BF C1 C4 C6 C8 
     n  CA CB CC CD CE 
     h  CF D2 D5 D8 DB 
     m  DE DF E0 E1 E2 
     y  E4    E6    E8 
     r  E9 EA EB EC ED 
     w  EF          F2

In each case, the letter missing from the table is the next consecutive value after the last in the table, i.e. ン is U+30F3.

Dominoes in Unicode

Posted on 22 September 2022 by John

I was spelunking around in Unicode and found that there are assigned characters for representing domino tiles and that the characters are enumerated in a convenient order. Here is the code chart.

There are codes for representing tiles horizontally or vertically. And even though, for example, the 5-3 is the same domino as the 3-5, there are separate characters for representing the orientation of the tile: one for 3 on the left and one for 5 on the left.

When you include orientation like this, a domino becomes essentially a base 7 number: the number of spots on one end is the number of 7s and the number of spots on the other end is the number of 1s. And the order of the characters corresponds to the order as base 7 numbers:

0-0, 0-1, 0-2, …, 1-0, 1-1, 1-2, … 6-6.

The horizontal dominoes start with the double blank at U+1F031 and the vertical dominoes start with U+1F063, a difference of 32 in hexadecimal or 50 in base 10. So you can rotate a domino tile by adding or subtracting 50 to its code point.

The following tiny Python function gives the codepoint for the domino with a spots on the left (or top) and b spots on the right (or bottom).

    def code(a, b, wide):
        cp = 0x1f031 if wide else 0x1f063
        return cp + 7*a + b

We can use this function to print a (3, 5) tile horizontally and a (6, 4) tile vertically.

    print( chr(code(3, 5, True )),
           chr(code(6, 4, False)) )

To my surprise, my computer had the fonts installed to display the results. This isn’t guaranteed for such high Unicode values.

horizontal 3-5 domino and vertical 6-4

Letter-like Unicode symbols

Posted on 16 June 2022 by John

Unicode provides a way to distinguish symbols that look alike but have different meanings.

We can illustrate this with the following Python code.

    import unicodedata as u

    for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]:
        for c in pair:
            print(format(ord(c), '4X'), u.bidirectional(c), u.name(c))

This produces

      4B L LATIN CAPITAL LETTER K
    212A L KELVIN SIGN
     3A9 L GREEK CAPITAL LETTER OMEGA
    2126 L OHM SIGN
    2135 L ALEF SYMBOL
     5D0 R HEBREW LETTER ALEF

Even though K and K look similar, the former is a Roman letter and the latter is a symbol for temperature in Kelvin. Similarly, Ω and Ω are semantically different even though they look alike.

Or rather, they probably look similar. A font may or may not use different glyphs for different code points. The editor I’m using to write this post uses a font that makes no difference between ohm and omega. The letter K and the Kelvin symbol are slightly different if I look very closely. The two alefs appear substantially different.

Note that the mathematical symbol alef is a left-to-right character and the Hebrew latter alef is a right-to-left character. The former could be useful to tell a word processor “This isn’t really a Hebrew letter; it’s a symbol that looks like a Hebrew letter. Don’t change the direction of my text or switch any language-sensitive features like spell checking.”

These letter-like symbols can be used to provide semantic information, but they can also be used to deceive. For example, a malicious website could change a K in a URL to a Kelvin sign.

Corner quotes in Unicode

Posted on 8 January 2022 by John

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
    | Char | Code   | Name                |
    |------+--------+---------------------|
    | ⌜    | U+231C | TOP LEFT CORNER     |
    | ⌝    | U+231D | TOP RIGHT CORNER    |
    | ⌞    | U+231E | BOTTOM LEFT CORNER  |
    | ⌟    | U+231F | BOTTOM RIGHT CORNER |
    |------+--------+---------------------|

Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

Fractions in Unicode

Posted on 2 November 2021 by John

There are Unicode characters for a few fractions, such as ½. This looks a little better than 1/2, depending on the context.

Here’s the Taylor series for log(1 + x) written in pure HTML:

log(1 + x) = x − ½x² + ⅓x³ − ¼x⁴ + ⅕x⁵ – ⋯

See this post for how the exponents were made.

Notice that the three dots ⋯ on the end are centered vertically, like \cdots in LaTeX. This was done with &ctdot; (U+22EF).

Available fractions

The selection of available fraction number forms is small and a little strange.

There are characters for fractions with denominator d equal to 2, 3, 4, 5, 6, and 8, with numerators 1 through d − 1, except for fractions that can be reduced.

If d = 7, 9, or 10, there’s a character for 1/d but not for fractions with numerators other than 1. For example, there is a character for ⅐ but not for 2/7.

HTML Entities

For denominators 2, 3, 4, 5, 6, and 8 the HTML entity for characters is easy: they all have the form

& frac <n> <d> ;

where n is the numerator and d is the denominator. For example, &frac35; is the HTML entity for ⅗.

There are no HTML entities for 1/7, 1/9, or 1/10.

Number sets in HTML and Unicode

Posted on 1 November 2021 by John

When I started blogging I was very cautious about what characters I used because browsers often didn’t have font support for uncommon characters. Things have changed since then and I’ve gotten less cautious. Nobody has complained, so I assume readers are seeing the characters I intend them to see.

There are Unicode characters for sets of numbers such as the integers and the real numbers, double-struck letters similar to the blackboard bold letters \mathbb{Z} etc. in LaTeX.

$\mathbb{N} \quad \mathbb{Z} \quad \mathbb{Q} \quad \mathbb{R} \quad \mathbb{C} \quad \mathbb{H}$

Here’s a table of the characters, their Unicode values, and two HTML entities associated with each.

    ℕ U+2115 &Nopf; &naturals;
    ℤ U+2124 &Zopf; &integers;
    ℚ U+211A &Qopf; &rationals;
    ℝ U+211D &Ropf; &reals;
    ℂ U+2102 &Copf; &complexes;
    ℍ U+210D &Hopf; &quaternions;

If you’re going to use these symbols, you will likely also need to use ∈ (U+2208, &in;) and ∉ (U+2209, ∉).

More letters

If you want more letters in the style of those above, you can find them starting at U+1D538 for . However, the characters corresponding to letters above are reserved.

So for example, is U+1D538, is U+1D539, but U+1D53A is reserved and you must use ℂ (U+2102) instead.

One letter not mentioned above is ℙ (U+2119). It has HTML entities &Popf; and &primes;.

So the double-struck versions of C, H, N, P, Q, R, and Z are down in the BMP (Basic Multilingual Plane) and the rest are in the SMP (Supplementary Multilingual Plane). I suspect characters in the SMP are less likely to have font support, but that may not be a problem.

Unicode superscripts and subscripts

Posted on 31 October 2021 by John

There are alternatives to using  and  tags for superscripts and subscripts in HTML. These alternatives may look better, depending on context, and they can be used in plain (Unicode) text files where HTML markup isn’t available.

Superscripts

When I started blogging I would use 2 and 3 for squares and cubes. Then somewhere along the way someone told me about ² (U+00B2) and ³ (U+00B3) and I started using these. The superscript characters generally produce slightly smaller subscripts and look nicer in my opinion.

Example with sup tags:

a² + b² = c²

Example with superscript characters:

a² + b² = c²

But there are no characters for exponents larger than 3. Or so I thought until recently.

There are no HTML entities for exponents larger than 3, nothing written in notation analogous to ² and ³. There are Unicode characters for other superscripts up to 9, but they don’t have corresponding HTML entities.

The Unicode code point for superscript n is

2070_hex + n

except for n = 2 or 3. For example, U+2075 is a superscript 5. So you could write x⁵ as

x⁵.

Subscripts

There are also Unicode symbols for subscripts, though they don’t have corresponding HTML entities. The Unicode code point for superscript n = 0, 1, 2, … 9 is

2080_hex + n

For example, U+2087 is a subscript 7.

The subscript characters don’t extend below the baseline as far as subscripts in  tags do. Here are x‘s with subsubcripts in  tags.

x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x₉

And here are the single character subscripts.

x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x₉

I think the former looks better, but subscripts in HTML paragraphs may increase the vertical spacing between lines. If consistent line spacing is more important than conventional subscripts, you might prefer the subscript characters.

Planetary code golf

Posted on 18 September 2021 by John

Suppose you’re asked to write a function that takes a number and returns a planet. We’ll number the planets in order from the sun, starting at 1, and for our purposes Pluto is the 9th planet.

Here’s an obvious solution:

    def planet(n):
        planets = [
            "Mercury",
            "Venus",
            "Earth",
            "Mars",
            "Jupiter",
            "Saturn",
            "Uranus",
            "Neptune",
            "Pluto"
        ]
        # subtract 1 to correct for unit offset
        return planets[n-1]

Now let’s have some fun with this and play a little code golf. Here’s a much shorter program.

    def planet(n): return chr(0x263e+n)

I was deliberately ambiguous at the top of the post, saying the code should return a planet. The first function naturally interprets that to mean the name of a planet, but the second function takes it to mean the symbol of a planet.

The symbol for Mercury has Unicode value U+263F, and Unicode lists the symbols for the planets in order from the sun. So the Unicode character for the nth planet is the character for Mercury plus n. That would be true if we numbered the planets starting from 0, but conventionally we call Mercury the 1st planet, not the 0th planet, so we have to subtract one. That’s why the code contains 0x263e rather than 0x263f.

We could make the function just a tiny bit shorter by using a lambda expression and using a decimal number for 0x263e.

    planet = lambda n: chr(9790+n)

Display issues

Here’s a brief way to list the symbols from the Python REPL.

    >>> [chr(0x263f+n) for n in range(9)]
    ['☿', '♀', '♁', '♂', '♃', '♄', '♅', '♆', '♇']

You may see the symbols for Venus and Mars above rendered with a colored background, depending on your browser and OS.

Here’s what the lines above looked like at my command line.

Here’s what the symbols looked like when I posted them on Twitter.

For reasons explained here, some software interprets some characters as emoji rather than literal characters. The symbols for Venus and Mars are used as emoji for female and male respectively, based on the use of the symbols in biology. Incidentally, these symbols were used to represent planets long before biologists used them to represent sexes.

Unicode

Making flags in Unicode

Related posts

Preventing characters from displaying as emoji

Katakana, Hiragana, and Unicode

Hiragana / katakana correspondence

Mapping 46 into 50 and 96

Related posts

Dominoes in Unicode

Letter-like Unicode symbols

Related posts

Corner quotes in Unicode

Related posts

Fractions in Unicode

Available fractions

HTML Entities

Related posts

Number sets in HTML and Unicode

More letters

Unicode superscripts and subscripts

Superscripts

Subscripts

Related posts

Planetary code golf

Display issues

Related posts