This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I’m walking into an area I know next to nothing about.
Consecutive code points
Unicode generally assigns code points to number-like things in consecutive order. For example, the Python code
for n in range(1,10): print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))
1 ⓵ Ⅰ 2 ⓶ Ⅱ 3 ⓷ Ⅲ 4 ⓸ Ⅳ 5 ⓹ Ⅴ 6 ⓺ Ⅵ 7 ⓻ Ⅶ 8 ⓼ Ⅷ 9 ⓽ Ⅸ
showing that ASCII digits, circled numerals, and Roman numerals are encoded consecutively.
Parenthesized and circled ideographs
So the same is probably true for ideographs representing digits, right?
No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code
from unicodedata import numeric, name for i in range(1, 10): cp = 0x321f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch)) for i in range(1, 10): cp = 0x327f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch))
㈠ 0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE ㈡ 0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO ㈢ 0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE ㈣ 0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR ㈤ 0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE ㈥ 0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX ㈦ 0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN ㈧ 0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT ㈨ 0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE ㊀ 0x3280 1.0 CIRCLED IDEOGRAPH ONE ㊁ 0x3281 2.0 CIRCLED IDEOGRAPH TWO ㊂ 0x3282 3.0 CIRCLED IDEOGRAPH THREE ㊃ 0x3283 4.0 CIRCLED IDEOGRAPH FOUR ㊄ 0x3284 5.0 CIRCLED IDEOGRAPH FIVE ㊅ 0x3285 6.0 CIRCLED IDEOGRAPH SIX ㊆ 0x3286 7.0 CIRCLED IDEOGRAPH SEVEN ㊇ 0x3287 8.0 CIRCLED IDEOGRAPH EIGHT ㊈ 0x3288 9.0 CIRCLED IDEOGRAPH NINE
CJK Unified Ideographs
Now let’s take the parentheses and circles off.
The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.
numerals = "一二三四五六七八九十" for n in numerals: print(n, hex(ord(n)), n.isdigit(), numeric(n))
This outputs the following.
一 0x4e00 False 1.0 二 0x4e8c False 2.0 三 0x4e09 False 3.0 四 0x56db False 4.0 五 0x4e94 False 5.0 六 0x516d False 6.0 七 0x4e03 False 7.0 八 0x516b False 8.0 九 0x4e5d False 9.0 十 0x5341 False 10.0
I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.
The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for 九 (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.