Yesterday I wrote about
cjhebrew, a LaTeX package that lets you insert Hebrew text by using a sort of transliteration scheme. That reminded me of
unidecode, a Python package for transliterating Unicode to ASCII, that I wrote about before. I wondered how the two compare, and so this post will answer that question.
Transliteration is a crude approximation. I started to say it’s no substitute for a proper translation, but in fact sometimes it is a substitute for a proper translation. It takes in the smallest context possible—one character—and is utterly devoid of nuance, but it still might be good enough for some purposes. It might, for example, help in searching some text for relevant content worth the effort of a proper translation.
Here’s a short bit of code to display
unidecode‘s transliterations of the Hebrew alphabet.
for i in range(22+5): ch = chr(i + ord('א')) print(ch, unidecode.unidecode(ch))
I wrote 22 + 5 rather than 27 above to give a hint that the extra values are the final forms of five letters . Also if
ord('א') doesn’t work for you, you can replace it with
Here’s a comparison of the transliterations used in
unidecode. I’ve abbreviated the column headings to make a narrower table.
|---------+---+----+----| | Unicode | | cj | ud | |---------+---+----+----| | U+05d0 | א | ' | A | | U+05d1 | ב | b | b | | U+05d2 | ג | g | g | | U+05d3 | ד | d | d | | U+05d4 | ה | h | h | | U+05d5 | ו | w | v | | U+05d6 | ז | z | z | | U+05d7 | ח | .h | KH | | U+05d8 | ט | .t | t | | U+05d9 | י | y | y | | U+05da | ך | K | k | | U+05db | כ | k | k | | U+05dc | ל | l | l | | U+05dd | ם | M | m | | U+05de | מ | m | m | | U+05df | ן | N | n | | U+05e0 | נ | n | n | | U+05e1 | ס | s | s | | U+05e2 | ע | ` | ` | | U+05e3 | ף | P | p | | U+05e4 | פ | p | p | | U+05e5 | ץ | .S | TS | | U+05e6 | צ | s | TS | | U+05e7 | ק | q | q | | U+05e8 | ר | r | r | | U+05e9 | ש | /s | SH | | U+05ea | ת | t | t | |---------+---+----+----|
The transliterations are pretty similar, despite different design goals. The
unidecode module is trying to pick the best mapping to ASCII characters. The
cjhebrew package is trying to use mnemonic ASCII sequences to map into Hebrew. The former doesn’t need to be unique, but the latter does. The post on cjhebrew explains, for example, that it uses capital letters for final forms of Hebrew letters.
Here’s the corresponding table for vowel points (niqqud).
|---------+---+----+----| | Unicode | | cj | ud | |---------+---+----+----| | U+05b0 | ְ | : | @ | | U+05b1 | ֱ | E: | e | | U+05b2 | ֲ | a: | a | | U+05b3 | ֳ | A: | o | | U+05b4 | ִ | i | i | | U+05b5 | ֵ | e | e | | U+05b6 | ֶ | E | e | | U+05b7 | ַ | a | a | | U+05b8 | ָ | A | a | | U+05b9 | ֹ | o | o | | U+05ba | ֺ | o | o | | U+05bb | ֻ | u | u | |---------+---+----+----|
 Unicode lists the final forms of letters come before the ordinary form. For example, final kaf has Unicode value U+05da and kaf has value U+05db.