Katakana, Hiragana, and Unicode

I figured out something that I wasn’t able to find by searching, so I’m posting it here in case other people have the same question and the same difficulty finding an answer.

I’m sure other people have written about this, but I couldn’t find it. Maybe lots of people have written about this in Japanese but not many in English.

Japanese kana consists of two syllabaries, hiragana and katakana, that are like phonetic alphabets. Each has 46 basic characters, and each corresponds to a block of 96 Unicode characters. I had two simple questions:

  1. How do the 46 characters map into the 90 characters?
  2. Do they map the same way for both hiragana and katakana?

Hiragana / katakana correspondence

I’ll start with the second question because it’s easier. Hiragana and katakana are different ways of representing the same sounds, and they correspond one to one. For example, the full name of U+3047 () is

HIRAGANA LETTER SMALL E

and the full name of its katakana counterpart U+30A7 () is

KATAKANA LETTER SMALL E

The only difference as far as Unicode goes is that katakana has three code points whose hiragana counterpart is unused, but these are not part of the basic letters.

The following Python code shows that the names of all the characters are the same except for the name of the system.

    from unicodedata import name

    unused = [0, 151, 152] # not in hiragana

    for i in range(0,63):
        if i in unused:
            continue
        h = name(chr(0x3040 + i)) 
        k = name(chr(0x30a0 + i))
        assert(h == k.replace("KATAKANA", "HIRAGANA"))
    print("done")

Mapping 46 into 50 and 96

You’ll see kana written in grid with one side labeled with 5 vowels and the other labeled with 10 consonants called a gojūon (五十音). That’s 50 cells, and in fact gojūon literally means 50 sounds, so how do we get 46? Five cells are empty, and one letter doesn’t fit into the grid. The empty cells are unused or archaic, and the extra character doesn’t fit the grid structure.

In the image below, the table on the left is for hiragana and the table on the right is for katakana. HTML versions of the tables available here.

Left out of each table is in hiragana and in katakana.

So does each set of 46 characters map into its Unicode code block?

Unicode numbers the letters consecutively if you traverse the grid increasing vowels first, then consonants, and adding the straggler at the end. But the reason 46 letters expand into more code points is that each letter can have one, two, or three variations. And there are various miscellaneous other symbols in the Unicode block.

For example, there is a LETTER E as well as the SMALL LETTER E mentioned above. Other variations seem to correspond to voiced and unvoiced versions of a consonant with a phonetic marker added to the voiced version. For example, く is U+304F, HIRAGANA LETTER KU, and ぐ is U+3050, HIRAGANA LETTER GU.

Here is how hiragana maps into Unicode. Each cell should be U+3000 plus the characters show.

         a  i  u  e  o 
        42 44 46 48 4A 
     k  4B 4D 4F 51 53 
     s  55 57 59 5B 5D 
     t  5F 61 64 66 68 
     n  6A 6B 6C 6D 6E 
     h  6F 72 75 78 7B 
     m  7E 7F 80 81 82 
     y  84    86    88 
     r  89 8A 8B 8C 8D 
     w  8F          92 

The corresponding table for katakana is the previous table plus 0x60:

         a  i  u  e  o 
        A2 A4 A6 A8 AA 
     k  AB AD AF B1 B3 
     s  B5 B7 B9 BB BD 
     t  BF C1 C4 C6 C8 
     n  CA CB CC CD CE 
     h  CF D2 D5 D8 DB 
     m  DE DF E0 E1 E2 
     y  E4    E6    E8 
     r  E9 EA EB EC ED 
     w  EF          F2 

In each case, the letter missing from the table is the next consecutive value after the last in the table, i.e. is U+30F3.

Related posts