Arabic numerals and numerals that are Arabic

The characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 are called Arabic numerals, but there are a lot of other numerals that are Arabic.

I discovered this when reading the documentation on Perl regular expressions, perlre. Here’s the excerpt from that page that caught my eye.

Many scripts have their own sets of digits equivalent to the Western 0 through 9 ones. A few, such as Arabic, have more than one set. For a string to be considered a script run, all digits in it must come from the same set of ten, as determined by the first digit encountered.

Emphasis added.

I took some code I’d written for previous posts on Unicode numbers and modified it to search the range of Arabic Unicode characters and report all characters that represent 0 through 9.

from unicodedata import numeric, name

a = set(range(0x00600, 0x006FF+1)) | \
    set(range(0x00750, 0x0077F+1)) | \
    set(range(0x008A0, 0x008FF+1)) | \
    set(range(0x00870, 0x0089F+1)) | \
    set(range(0x0FB50, 0x0FDFF+1)) | \
    set(range(0x0FE70, 0x0FEFF+1)) | \
    set(range(0x10EC0, 0x10EFF+1)) | \
    set(range(0x1EE00, 0x1EEFF+1)) | \
    set(range(0x1EC70, 0x1ECBF+1)) | \
    set(range(0x1ED00, 0x1ED4F+1)) | \
    set(range(0x10E60, 0x10E7F+1)) 

f = open('digits.txt','w',encoding='utf8')

def uni(i):
    return "U+" + format(i, "X")

for i in sorted(a):
    ch = chr(i)
    if ch.isnumeric() and numeric(ch) in range(10):
        print(ch, uni(i), numeric(ch), name(ch), file=f)

Apparently there are two ways to write 0, eight ways to write 2, and seven ways to write 1, 3, 4, 5, 6, 7, 8, and 9. I’ll include the full results at the bottom of the post.

I first wrote my Python script to write to the command line and redirected the output to a file. This resulted in some of the Arabic characters being replaced with a blank or with 0. Then I changed the script as above to write to a file opened to receive UTF-8 text. All the characters were preserved, though I can’t see most of them because the font my editor is using doesn’t have glyphs for the characters outside the BMP (i.e. those with Unicode values above 0xFFFF).

Related posts

٠ U+660 0.0 ARABIC-INDIC DIGIT ZERO
١ U+661 1.0 ARABIC-INDIC DIGIT ONE
٢ U+662 2.0 ARABIC-INDIC DIGIT TWO
٣ U+663 3.0 ARABIC-INDIC DIGIT THREE
٤ U+664 4.0 ARABIC-INDIC DIGIT FOUR
٥ U+665 5.0 ARABIC-INDIC DIGIT FIVE
٦ U+666 6.0 ARABIC-INDIC DIGIT SIX
٧ U+667 7.0 ARABIC-INDIC DIGIT SEVEN
٨ U+668 8.0 ARABIC-INDIC DIGIT EIGHT
٩ U+669 9.0 ARABIC-INDIC DIGIT NINE
۰ U+6F0 0.0 EXTENDED ARABIC-INDIC DIGIT ZERO
۱ U+6F1 1.0 EXTENDED ARABIC-INDIC DIGIT ONE
۲ U+6F2 2.0 EXTENDED ARABIC-INDIC DIGIT TWO
۳ U+6F3 3.0 EXTENDED ARABIC-INDIC DIGIT THREE
۴ U+6F4 4.0 EXTENDED ARABIC-INDIC DIGIT FOUR
۵ U+6F5 5.0 EXTENDED ARABIC-INDIC DIGIT FIVE
۶ U+6F6 6.0 EXTENDED ARABIC-INDIC DIGIT SIX
۷ U+6F7 7.0 EXTENDED ARABIC-INDIC DIGIT SEVEN
۸ U+6F8 8.0 EXTENDED ARABIC-INDIC DIGIT EIGHT
۹ U+6F9 9.0 EXTENDED ARABIC-INDIC DIGIT NINE
 U+10E60 1.0 RUMI DIGIT ONE
 U+10E61 2.0 RUMI DIGIT TWO
 U+10E62 3.0 RUMI DIGIT THREE
 U+10E63 4.0 RUMI DIGIT FOUR
 U+10E64 5.0 RUMI DIGIT FIVE
 U+10E65 6.0 RUMI DIGIT SIX
 U+10E66 7.0 RUMI DIGIT SEVEN
 U+10E67 8.0 RUMI DIGIT EIGHT
 U+10E68 9.0 RUMI DIGIT NINE
 U+1EC71 1.0 INDIC SIYAQ NUMBER ONE
 U+1EC72 2.0 INDIC SIYAQ NUMBER TWO
 U+1EC73 3.0 INDIC SIYAQ NUMBER THREE
 U+1EC74 4.0 INDIC SIYAQ NUMBER FOUR
 U+1EC75 5.0 INDIC SIYAQ NUMBER FIVE
 U+1EC76 6.0 INDIC SIYAQ NUMBER SIX
 U+1EC77 7.0 INDIC SIYAQ NUMBER SEVEN
 U+1EC78 8.0 INDIC SIYAQ NUMBER EIGHT
 U+1EC79 9.0 INDIC SIYAQ NUMBER NINE
 U+1ECA3 1.0 INDIC SIYAQ NUMBER PREFIXED ONE
 U+1ECA4 2.0 INDIC SIYAQ NUMBER PREFIXED TWO
 U+1ECA5 3.0 INDIC SIYAQ NUMBER PREFIXED THREE
 U+1ECA6 4.0 INDIC SIYAQ NUMBER PREFIXED FOUR
 U+1ECA7 5.0 INDIC SIYAQ NUMBER PREFIXED FIVE
 U+1ECA8 6.0 INDIC SIYAQ NUMBER PREFIXED SIX
 U+1ECA9 7.0 INDIC SIYAQ NUMBER PREFIXED SEVEN
 U+1ECAA 8.0 INDIC SIYAQ NUMBER PREFIXED EIGHT
 U+1ECAB 9.0 INDIC SIYAQ NUMBER PREFIXED NINE
 U+1ECB1 1.0 INDIC SIYAQ NUMBER ALTERNATE ONE
 U+1ECB2 2.0 INDIC SIYAQ NUMBER ALTERNATE TWO
 U+1ED01 1.0 OTTOMAN SIYAQ NUMBER ONE
 U+1ED02 2.0 OTTOMAN SIYAQ NUMBER TWO
 U+1ED03 3.0 OTTOMAN SIYAQ NUMBER THREE
 U+1ED04 4.0 OTTOMAN SIYAQ NUMBER FOUR
 U+1ED05 5.0 OTTOMAN SIYAQ NUMBER FIVE
 U+1ED06 6.0 OTTOMAN SIYAQ NUMBER SIX
 U+1ED07 7.0 OTTOMAN SIYAQ NUMBER SEVEN
 U+1ED08 8.0 OTTOMAN SIYAQ NUMBER EIGHT
 U+1ED09 9.0 OTTOMAN SIYAQ NUMBER NINE
 U+1ED2F 2.0 OTTOMAN SIYAQ ALTERNATE NUMBER TWO
 U+1ED30 3.0 OTTOMAN SIYAQ ALTERNATE NUMBER THREE
 U+1ED31 4.0 OTTOMAN SIYAQ ALTERNATE NUMBER FOUR
 U+1ED32 5.0 OTTOMAN SIYAQ ALTERNATE NUMBER FIVE
 U+1ED33 6.0 OTTOMAN SIYAQ ALTERNATE NUMBER SIX
 U+1ED34 7.0 OTTOMAN SIYAQ ALTERNATE NUMBER SEVEN
 U+1ED35 8.0 OTTOMAN SIYAQ ALTERNATE NUMBER EIGHT
 U+1ED36 9.0 OTTOMAN SIYAQ ALTERNATE NUMBER NINE

Writing math with Unicode

A LaTeX document looks better than an HTML document, but an HTML document looks better than an awkward hybrid of HTML and inline images created by LaTeX.

My rule is to only use LaTeX-generated images for displayed equations and not for math symbols in the middle of a sentence. This works pretty well, but it’s less than ideal. I use HTML for displayed equations too when I can. Over time I’ve learned how to do more in HTML, and browser support for special characters has improved [1].

My personal prohibition on inline images requires saying in words what I would say in symbols if I were writing LaTeX. For example, in the context of complex variables I have written “z conjugate” rather than put a conjugate bar over z.

There’s a way to fix the particular problem of typesetting conjugates: the Unicode character U+0305 will put a bar (i.e. conjugation symbol) over a character. For example:

If z = a + bi then bi.

Here’s the HTML code for the sentence above:

    If <em>z</em> = <em>a</em> + <em>b</em>&#x200a;<em>i</em> 
    then <em>z&#x0305;</em> &#x2212; <em>b</em>&#x200a;<em>i</em>.

Note that the character U+0305, written in HTML as &#x0305;, goes inside the em tags. A couple other things: I put a hair space (U+200A) between b and i, and I used a minus sign (U+2212) rather than a hyphen.

I normally just use a hyphen for a minus sign when I’m blogging about math, but sometimes this doesn’t look right. For example, yesterday’s post about fractional factorial designs had three tables filled with plus signs and minus signs. I first used a hyphen, but that didn’t look right because it was too narrow to visually pair with the plus signs.

Just as you can use U+0305 to put a bar on top of a character, you can use U+20D7 to put a vector on top. For example,

x⃗ = (x₁, x₂, x₃).

This was created with

    <em>x&#x20d7;</em> = (<em>x</em>&#x2081;, 
    <em>x</em>&#x2082;, <em>x</em>&#x2083;).

Here I used the Unicode characters for subscript 1, subscript 2, and subscript 3. Sometimes these look better than <sub>1</sub> etc, but not always. Here’s the equation for x⃗ using sub tags:

x⃗ = (x1, x2, x3).

Unicode typically has all the symbols you need to write mathematics. You can use this page to find the Unicode counterpart to most LaTeX symbols. But text is inherently linear, and you need more than text to lay out typesetting in two dimensions.

Update: Looks like I spoke too soon. The tricks presented here work well on Linux and Mac but not on Windows. Some readers are saying the vector symbol is missing on Windows. On my Windows laptop the bar and vector appear but are not centered over the intended character.

Related posts

[1] When I started blogging you couldn’t count on browsers having font support for all the mathematical symbols you might want to use. (This post summarizes my experience as of nine years ago.) Now I believe you can, especially for any symbol in the BMP (Basic Multilingual Plane, code points below FFFF). I haven’t gotten feedback from anyone saying they’re missing symbols that I use.

Ideograph numerals

This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I’m walking into an area I know next to nothing about.

Consecutive code points

Unicode generally assigns code points to number-like things in consecutive order. For example, the Python code

    for n in range(1,10):
        print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))

prints

    1 ⓵ Ⅰ
    2 ⓶ Ⅱ
    3 ⓷ Ⅲ
    4 ⓸ Ⅳ
    5 ⓹ Ⅴ
    6 ⓺ Ⅵ
    7 ⓻ Ⅶ
    8 ⓼ Ⅷ
    9 ⓽ Ⅸ

showing that ASCII digits, circled numerals, and Roman numerals are encoded consecutively.

Parenthesized and circled ideographs

So the same is probably true for ideographs representing digits, right?

一 二 三 四 五 六 七 八 九 ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㊀ ㊁ ㊂ ㊃ ㊄ ㊅ ㊆ ㊇ ㊈

No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code

    from unicodedata import numeric, name

    for i in range(1, 10):
        cp = 0x321f + i
        ch = chr(cp)
        print(ch, hex(cp), numeric(ch), name(ch))
    
    for i in range(1, 10):
        cp = 0x327f + i
        ch = chr(cp)
        print(ch, hex(cp), numeric(ch), name(ch))    

prints

    ㈠ 0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE
    ㈡ 0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO
    ㈢ 0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE
    ㈣ 0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR
    ㈤ 0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE
    ㈥ 0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX
    ㈦ 0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN
    ㈧ 0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT
    ㈨ 0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE
    ㊀ 0x3280 1.0 CIRCLED IDEOGRAPH ONE
    ㊁ 0x3281 2.0 CIRCLED IDEOGRAPH TWO
    ㊂ 0x3282 3.0 CIRCLED IDEOGRAPH THREE
    ㊃ 0x3283 4.0 CIRCLED IDEOGRAPH FOUR
    ㊄ 0x3284 5.0 CIRCLED IDEOGRAPH FIVE
    ㊅ 0x3285 6.0 CIRCLED IDEOGRAPH SIX
    ㊆ 0x3286 7.0 CIRCLED IDEOGRAPH SEVEN
    ㊇ 0x3287 8.0 CIRCLED IDEOGRAPH EIGHT
    ㊈ 0x3288 9.0 CIRCLED IDEOGRAPH NINE

CJK Unified Ideographs

Now let’s take the parentheses and circles off.

The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.

    numerals = "一二三四五六七八九十"
    for n in numerals:
        print(n, hex(ord(n)), n.isdigit(), numeric(n))

This outputs the following.

    一 0x4e00 False 1.0
    二 0x4e8c False 2.0
    三 0x4e09 False 3.0
    四 0x56db False 4.0
    五 0x4e94 False 5.0
    六 0x516d False 6.0
    七 0x4e03 False 7.0
    八 0x516b False 8.0
    九 0x4e5d False 9.0
    十 0x5341 False 10.0

I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.

The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for 九 (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.

Unicode numbers

There are 10 digits in ASCII, and I bet you can guess what they are. In ASCII, a digit is a decimal is a number.

Things are much wilder in Unicode. There are hundreds of decimals, digits, and numeric characters, and they’re different sets.

 ꩓ ٦ ³ ⓶ ₅ ⅕ Ⅷ ㊈

The following Python code loops through all possible Unicode characters, extracting the set of decimals, digits, and numbers.

    numbers  = set()
    decimals = set() 
    digits   = set()

    for i in range(1, 0x110000):
        ch = chr(i)
        if ch.isdigit():
            digits.add(ch)
        if ch.isdecimal():
            decimals.add(ch)
        if ch.isnumeric():
            numbers.add(ch)

These sets are larger than you may expect. The code

    print(len(decimals), len(digits), len(numbers))

tells us that the size of the three sets are 650, 778, and 1862 respectively.

The following code verifies that decimals are a proper subset of digits and that digits are a proper subset of numerical characters.

    assert(decimals < digits < numbers)

Now let’s look at the characters in the image above. The following code describes what each character is and how it is classified. The first three characters are digits, the next three are decimals but not digits, and the last three are numeric but not decimals.

    from unicodedata import name
    for c in "꩓٦":
        print(name(c))
        assert(c.isdecimal())
    for c in "³⓶₅":
        print(name(c))    
        assert(c.isdigit() and not c.isdecimal())
    for c in "⅕Ⅷ㊈":
        print(name(c))    
        assert(c.isnumeric() and not c.isdigit())

The names of the characters are

  1. MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT
  2. CHAM DIGIT THREE
  3. ARABIC-INDIC DIGIT SIX
  4. SUPERSCRIPT THREE
  5. DOUBLE CIRCLED DIGIT TWO
  6. SUBSCRIPT FIVE
  7. VULGAR FRACTION ONE FIFTH
  8. ROMAN NUMERAL EIGHT
  9. CIRCLED IDEOGRAPH NINE

Update: See the next post on ideographic numerals.

Update: There are 142 distinct numbers that correspond to the numerical value associated with a Unicode character. This page gives a list of the values and an example of each value.

Related posts

Making flags in Unicode

I recently found out [1] that the Unicode sequences for flag emoji are created by taking the two-letter country abbreviation (ISO 3166-1 alpha-2) and replacing both letters with their counterparts in the range U+1F1E6 through U+1F1FF.

For example, the abbreviation for Canada is CA, and the characters 🇨 (U+1F1e8) and 🇦 (U+1F!E6) together create 🇨🇦.

boxed C plus boxed A = Canadian flag

This is illustrated by the following Python code.

    import iso3166

    def flag_emoji(name):
        alpha = iso3166.countries.get(name).alpha2
        box = lambda ch: chr( ord(ch) + 0x1f1a5 )
        return box(alpha[0]) + box(alpha[1])
    print(flag_emoji("Canada"))

The name we give to flag_emoji need not be the full country name, like Canada. It can be anything that iso3166.countries.get supports, which also includes two-letter abbreviations like CA, three-letter abbreviations like CAN, or ISO 3166 numeric codes like 124.

We can use the following code to print a collage of flags:

    def print_all_flags():
        for i, c in enumerate( iso3166.countries ):
            print(flag_emoji(c.name), end="")
            if i%25 == 24: print()

10 by 25 array of flags

Related posts

[1] I learned this from watching Dylan Beattie’s talk Plain Text on YouTube.

Preventing characters from displaying as emoji

I rarely intentionally use emoji, and yet I often run into them unbidden. This is because some Unicode characters double as emoji.

For example, the zodiac symbol for Aries is used both in celestial navigation and in astrology. The latter is much more common, and so when some software sees U+2648 it interprets the character as the emoji for the horoscope sign.

There is a way to prevent this: append the “variation selector” character U+FE0E after the symbol. This tells software that you want the preceding character to be interpreted as a character. And if you want to request that a character be displayed as an emoji, you can append U+FE0F. But a particular software package may or may not honor your request.

I tried this in several terminals to see what would happen. On Linux and Mac, the symbol for Aries (U+2648) prints as an emoji by default, and the symbol for a black pawn (U+265F) does not. But by when I add variation selectors to reverse the defaults the shells complied.

Here’s the same output as text:

    >>> print("\u2648")
    ♈
    >>> print("\u2648\ufe0e")
    ♈︎
    >>> print("\u265f")
    ♟
    >>> print("\u265f\ufe0f")
    ♟️

When I tested this on Windows I got different results in different terminals. The default cmd terminal was unable to display either Aries or the pawn. The ConEmu terminal displayed both as characters, even when I requested emoji. The new Windows Terminal app displayed both as emoji, even when I requested plain characters. This probably has something to do with the fonts the terminals use as well as the terminal software itself.

Update: View this page to see how your browser renders various characters as text or emoji. Also see this Twitter thread on how Twitter renders these characters.

Katakana, Hiragana, and Unicode

I figured out something that I wasn’t able to find by searching, so I’m posting it here in case other people have the same question and the same difficulty finding an answer.

I’m sure other people have written about this, but I couldn’t find it. Maybe lots of people have written about this in Japanese but not many in English.

Japanese kana consists of two syllabaries, hiragana and katakana, that are like phonetic alphabets. Each has 46 basic characters, and each corresponds to a block of 96 Unicode characters. I had two simple questions:

  1. How do the 46 characters map into the 90 characters?
  2. Do they map the same way for both hiragana and katakana?

Hiragana / katakana correspondence

I’ll start with the second question because it’s easier. Hiragana and katakana are different ways of representing the same sounds, and they correspond one to one. For example, the full name of U+3047 () is

HIRAGANA LETTER SMALL E

and the full name of its katakana counterpart U+30A7 () is

KATAKANA LETTER SMALL E

The only difference as far as Unicode goes is that katakana has three code points whose hiragana counterpart is unused, but these are not part of the basic letters.

The following Python code shows that the names of all the characters are the same except for the name of the system.

    from unicodedata import name

    unused = [0, 151, 152] # not in hiragana

    for i in range(0,63):
        if i in unused:
            continue
        h = name(chr(0x3040 + i)) 
        k = name(chr(0x30a0 + i))
        assert(h == k.replace("KATAKANA", "HIRAGANA"))
    print("done")

Mapping 46 into 50 and 96

You’ll see kana written in grid with one side labeled with 5 vowels and the other labeled with 10 consonants called a gojūon (五十音). That’s 50 cells, and in fact gojūon literally means 50 sounds, so how do we get 46? Five cells are empty, and one letter doesn’t fit into the grid. The empty cells are unused or archaic, and the extra character doesn’t fit the grid structure.

In the image below, the table on the left is for hiragana and the table on the right is for katakana. HTML versions of the tables available here.

Left out of each table is in hiragana and in katakana.

So does each set of 46 characters map into its Unicode code block?

Unicode numbers the letters consecutively if you traverse the grid increasing vowels first, then consonants, and adding the straggler at the end. But the reason 46 letters expand into more code points is that each letter can have one, two, or three variations. And there are various miscellaneous other symbols in the Unicode block.

For example, there is a LETTER E as well as the SMALL LETTER E mentioned above. Other variations seem to correspond to voiced and unvoiced versions of a consonant with a phonetic marker added to the voiced version. For example, く is U+304F, HIRAGANA LETTER KU, and ぐ is U+3050, HIRAGANA LETTER GU.

Here is how hiragana maps into Unicode. Each cell should be U+3000 plus the characters show.

         a  i  u  e  o 
        42 44 46 48 4A 
     k  4B 4D 4F 51 53 
     s  55 57 59 5B 5D 
     t  5F 61 64 66 68 
     n  6A 6B 6C 6D 6E 
     h  6F 72 75 78 7B 
     m  7E 7F 80 81 82 
     y  84    86    88 
     r  89 8A 8B 8C 8D 
     w  8F          92 

The corresponding table for katakana is the previous table plus 0x60:

         a  i  u  e  o 
        A2 A4 A6 A8 AA 
     k  AB AD AF B1 B3 
     s  B5 B7 B9 BB BD 
     t  BF C1 C4 C6 C8 
     n  CA CB CC CD CE 
     h  CF D2 D5 D8 DB 
     m  DE DF E0 E1 E2 
     y  E4    E6    E8 
     r  E9 EA EB EC ED 
     w  EF          F2 

In each case, the letter missing from the table is the next consecutive value after the last in the table, i.e. is U+30F3.

Related posts

Dominoes in Unicode

I was spelunking around in Unicode and found that there are assigned characters for representing domino tiles and that the characters are enumerated in a convenient order. Here is the code chart.

There are codes for representing tiles horizontally or vertically. And even though, for example, the 5-3 is the same domino as the 3-5, there are separate characters for representing the orientation of the tile: one for 3 on the left and one for 5 on the left.

When you include orientation like this, a domino becomes essentially a base 7 number: the number of spots on one end is the number of 7s and the number of spots on the other end is the number of 1s. And the order of the characters corresponds to the order as base 7 numbers:

0-0, 0-1, 0-2, …, 1-0, 1-1, 1-2, … 6-6.

The horizontal dominoes start with the double blank at U+1F031 and the vertical dominoes start with U+1F063, a difference of 32 in hexadecimal or 50 in base 10. So you can rotate a domino tile by adding or subtracting 50 to its code point.

The following tiny Python function gives the codepoint for the domino with a spots on the left (or top) and b spots on the right (or bottom).

    def code(a, b, wide):
        cp = 0x1f031 if wide else 0x1f063
        return cp + 7*a + b

We can use this function to print a (3, 5) tile horizontally and a (6, 4) tile vertically.

    print( chr(code(3, 5, True )),
           chr(code(6, 4, False)) )

To my surprise, my computer had the fonts installed to display the results. This isn’t guaranteed for such high Unicode values.

horizontal 3-5 domino and vertical 6-4

Letter-like Unicode symbols

Unicode provides a way to distinguish symbols that look alike but have different meanings.

We can illustrate this with the following Python code.

    import unicodedata as u

    for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]:
        for c in pair:
            print(format(ord(c), '4X'), u.bidirectional(c), u.name(c))

This produces

      4B L LATIN CAPITAL LETTER K
    212A L KELVIN SIGN
     3A9 L GREEK CAPITAL LETTER OMEGA
    2126 L OHM SIGN
    2135 L ALEF SYMBOL
     5D0 R HEBREW LETTER ALEF

Even though K and K look similar, the former is a Roman letter and the latter is a symbol for temperature in Kelvin. Similarly, Ω and Ω are semantically different even though they look alike.

Or rather, they probably look similar. A font may or may not use different glyphs for different code points. The editor I’m using to write this post uses a font that makes no difference between ohm and omega. The letter K and the Kelvin symbol are slightly different if I look very closely. The two alefs appear substantially different.

Note that the mathematical symbol alef is a left-to-right character and the Hebrew latter alef is a right-to-left character. The former could be useful to tell a word processor “This isn’t really a Hebrew letter; it’s a symbol that looks like a Hebrew letter. Don’t change the direction of my text or switch any language-sensitive features like spell checking.”

These letter-like symbols can be used to provide semantic information, but they can also be used to deceive. For example, a malicious website could change a K in a URL to a Kelvin sign.

Related posts

Corner quotes in Unicode

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
    | Char | Code   | Name                |
    |------+--------+---------------------|
    | ⌜    | U+231C | TOP LEFT CORNER     |
    | ⌝    | U+231D | TOP RIGHT CORNER    |
    | ⌞    | U+231E | BOTTOM LEFT CORNER  |
    | ⌟    | U+231F | BOTTOM RIGHT CORNER |
    |------+--------+---------------------|

Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

Related posts