This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I’m walking into an area I know next to nothing about.
Consecutive code points
Unicode generally assigns code points to number-like things in consecutive order. For example, the Python code
for n in range(1,10): print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))
1 ⓵ Ⅰ 2 ⓶ Ⅱ 3 ⓷ Ⅲ 4 ⓸ Ⅳ 5 ⓹ Ⅴ 6 ⓺ Ⅵ 7 ⓻ Ⅶ 8 ⓼ Ⅷ 9 ⓽ Ⅸ
showing that ASCII digits, circled numerals, and Roman numerals are encoded consecutively.
Parenthesized and circled ideographs
So the same is probably true for ideographs representing digits, right?
No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code
from unicodedata import numeric, name for i in range(1, 10): cp = 0x321f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch)) for i in range(1, 10): cp = 0x327f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch))
㈠ 0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE ㈡ 0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO ㈢ 0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE ㈣ 0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR ㈤ 0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE ㈥ 0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX ㈦ 0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN ㈧ 0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT ㈨ 0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE ㊀ 0x3280 1.0 CIRCLED IDEOGRAPH ONE ㊁ 0x3281 2.0 CIRCLED IDEOGRAPH TWO ㊂ 0x3282 3.0 CIRCLED IDEOGRAPH THREE ㊃ 0x3283 4.0 CIRCLED IDEOGRAPH FOUR ㊄ 0x3284 5.0 CIRCLED IDEOGRAPH FIVE ㊅ 0x3285 6.0 CIRCLED IDEOGRAPH SIX ㊆ 0x3286 7.0 CIRCLED IDEOGRAPH SEVEN ㊇ 0x3287 8.0 CIRCLED IDEOGRAPH EIGHT ㊈ 0x3288 9.0 CIRCLED IDEOGRAPH NINE
CJK Unified Ideographs
Now let’s take the parentheses and circles off.
The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.
numerals = "一二三四五六七八九十" for n in numerals: print(n, hex(ord(n)), n.isdigit(), numeric(n))
This outputs the following.
一 0x4e00 False 1.0 二 0x4e8c False 2.0 三 0x4e09 False 3.0 四 0x56db False 4.0 五 0x4e94 False 5.0 六 0x516d False 6.0 七 0x4e03 False 7.0 八 0x516b False 8.0 九 0x4e5d False 9.0 十 0x5341 False 10.0
I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.
The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for 九 (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.
5 thoughts on “Ideograph numerals”
Almost always 9.
The numbering has something to do with Han Unification (https://en.wikipedia.org/wiki/Han_unification , https://en.wikipedia.org/wiki/CJK_Unified_Ideographs ) but the details are beyond me. Probably they started with one of the national standards (more likely than not, a Chinese one), and the ordering there may have originally been based on one of the many “dictionary order” schemes (how to look up a word in the dictionary when it isn’t made out of letters is its own interesting topic).
The ideograph 九 represents the number nine, rather than the digit 9. This becomes more apparent when you consider that there are ideographs for numbers such as ten (十), one hundred (百) and so on. The number 109, for example, could be written as 百九.
The ordering is by radical and number of additional strokes. See “dictionary lookup” in Wikipedia’s “radical” page.
So kind of like that old elementary school brain teaser, “Complete the sequence 8, 5, 4, 9, 7, ….”, where the numbers are sorted alphabetically.