Entering Russian characters in Vim with digraphs

The purpose of this post is to expand on the following sentence [1]:

Russian letters are created by entering [Ctrl-k followed by] a corresponding Latin letter followed by an equals sign -, or, in a few places, a percent sign %.

The Russian alphabet has 33 letters, so there couldn’t be a Latin letter for every Russian letter. Also, there are Latin letters that don’t have a Russian counterpart and vice versa. So the mapping can’t be simple. But still, the above summary is nice: try control-k followed by the English analog and the equal sign. If that doesn’t work, try a percent sign instead.

Which Latin letters does Vim chose as corresponding to Russian letters? Does it go by sound or appearance? For example, the Russian letter Н looks like a Latin H but it sounds like a Latin N. Vim goes by sound. You would enter the Russian letter Н by typing Ctrl-k N =.

For full details, see the Vim documentation :h digraph-table. I give a simplified excerpt from the documentation below. I just look at capital letters because the lower case letters are analogous. All the official Unicode names begin with CYRILLIC CAPITAL LETTER and so I cut that part out.

char    digraph hex     official name 
А       A=      0410    A
Б       B=      0411    BE
В       V=      0412    VE
Г       G=      0413    GHE
Д       D=      0414    DE
Е       E=      0415    IE
Ё       IO      0401    IO
Ж       Z%      0416    ZHE
З       Z=      0417    ZE
И       I=      0418    I
Й       J=      0419    SHORT I
К       K=      041A    KA
Л       L=      041B    EL
М       M=      041C    EM
Н       N=      041D    EN
О       O=      041E    O
П       P=      041F    PE
Р       R=      0420    ER
С       S=      0421    ES
Т       T=      0422    TE
У       U=      0423    U
Ф       F=      0424    EF
Х       H=      0425    HA
Ц       C=      0426    TSE
Ч       C%      0427    CHE
Ш       S%      0428    SHA
Щ       Sc      0429    SHCHA
Ъ       ="      042A    HARD SIGN
Ы       Y=      042B    YERU
Ь       %"      042C    SOFT SIGN
Э       JE      042D    E
Ю       JU      042E    YU
Я       JA      042F    YA

Note that the end of the alphabet is more complicated than simply using a Latin letter and either an equal or percent sign. Also, the table is in alphabetical order, which doesn’t quite correspond to Unicode numerical order because of a quirk with the letter Ё (U+0401) explained here.

[1] Arnold Robbins and Elbert Hannah. Learning the vi & Vim Editors, 8th edition

Chebyshev and Russian transliteration

It’s not simple to transliterate Russian names to English. Sometimes there is a unique mapping, or at least a standard mapping, of a particular name, but often there is not.

An example that comes up frequently in mathematics is Pafnuty Lvovich Chebyshev (1821–1894). This Russian mathematician’s name Пафну́тий Льво́вич Чебышёв has been transliterated at Tchebichef, Tchebychev, Tchebycheff, Tschebyschev, Tschebyschef, Tschebyscheff, Čebyčev, Čebyšev, Chebysheff, Chebychov, Chebyshov, etc.

The American Mathematical Society has settled on “Chebyshev” as its standard, and this is now common in English mathematical writing. But things named after Chebyshev, such as Chebyshev polynomials, are often denoted with a T because the French prefer “Tchebyshev.”

There is an ISO standard, ISO 9, for transliterating Cyrillic characters into Latin characters. Under this standard, Чебышёв becomes Čebyšëv. This maps Cyrillic into Latin characters with diacritical marks but not into ASCII. The AMS realized that the vast majority of Americans would not type Čebyšëv into a search bar, for example, and chose Chebyshev instead.

Related posts

A deck of cards

One time when I was in grad school, I was a teaching assistant for a business math class that included calculus and a smattering of other topics, including a little bit of probability. I made up examples involving a deck of cards, but then learned to my surprise that not everyone was familiar with playing cards. I was young and had a lot to learn about seeing things through the eyes of other people.

Later I learned that a “standard” deck of cards is not quite as standard as I thought. The “standard 52-card deck” is indeed standard in the English-speaking world, but there are variations used in other countries.

The Unicode values for the characters representing playing cards are laid out in a nice pattern. The last digit of the Unicode value corresponds to the point value of the card (aces low), and the second-to-last digit corresponds to the suite.

The ace of spades has code point U+1F0A1, and the aces of hearts, diamonds, and clubs have values  U+1F0B1, U+1F0C1, and U+1F0D1 respectively. The three of spades is U+1F0A3 and the seven of hearts is U+1F0B7.

But there’s a surprise if you’re only away of the standard 52-card deck: there’s a knight between the jack and the queen. The Unicode values for jacks end in B (Bhex = 11ten) as you’d expect, but queens end in D and kings end in E. The cards ending in C are knights, cards that don’t exist in the standard 52-card deck.

Here’s a little Python code that illustrates the Unicode layout.

for i in range(4):
   for j in range(14):
     print(chr(0x1f0a1+j+16*i), end='')
   print()

The output is too small to read as plain text, but here’s a screen shot from running the code above in a terminal with a huge font.

Unicode cards

Playing cards have their origin in tarot cards, which are commonly associated with the occult. But John C. Wright argues that tarot cards were based on Christian iconography before they became associated with the occult.

Almost ASCII

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 217, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 27, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Related posts

Why “a caret, euro, trademark” ’ in a file?

A with caret, euro, trademark

Why might you see ’ in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see �, which I explained in the previous post. You might want to read that post first if you’re not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put ’ in the opening sentence. But assuming it is an error, it’s likely an encoding error.

But it’s the opposite of the � error. The � occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as UTF-8.

The ’ sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text ’.

Windows-1252 is a superset of IDO 8859-1, the error resulting in � could also be described as a Windows-1252 error. So a � means Windows-1252 text has been interpreted as UTF-8, and ’ means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they’re not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

A valid character to represent an invalid character

U+FFFD REPLACEMENT CHARACTER

You may have seen a web page with the symbol � scattered throughout the text, especially in older web pages. What is this symbol and why does it appear unexpected?

The symbol we’re discussing is a bit of a paradox. It’s the (valid) Unicode character to represent an invalid Unicode character. If you just read the first sentence of this post, without inspecting the code point values, you can’t tell whether the symbol appears because I’ve made a mistake, or whether I’ve deliberately included the symbol.

The symbol in question is U+FFFD, named REPLACEMENT CHARACTER, a perfectly valid Unicode character. But unlike this post, you’re most likely to see it when the author did not intend for you to see it. What’s going on?

It all has to do with character encoding.

If all you want to do is represent Roman letters and common punctuation marks, and control characters, ASCII is fine. There are 128 ASCII characters, so they fit into a single 8-bit byte. But as soon as you want to write façade, jalapeño, or Gödel you have a problem. And of course you have a bigger problem if your language doesn’t use the Roman alphabet at all.

ASCII wastes one bit per byte, so naturally people wanted to take advantage of that extra bit to represent additional characters, such as the ç, ñ, and ö above. One popular way of doing this was described in the standard ISO 8859-1.

Of course there are other ways of encoding characters. If your language is Russian or Hebrew or Chinese, you’re no happier with ISO 8859-1 than you are with ASCII.

Enter Unicode. Let’s represent all the word’s alphabets (and ideograms and common symbols and …) in a single system. Great idea, though there are a mind-boggling number of details to work out. Even so, once you’ve assigned a number to every symbol that you care about, there’s still more work to do.

You could represent every character with two bytes. Now you can represent 65,536 characters. That’s too much and too little. If you want to represent text that is essentially made of Roman letters plus occasional exotic characters, using two bytes per letter makes the text 100% larger.

And while 65,536 sounds like a lot of characters, it’s not enough to represent every Chinese character, much less all the characters in other languages. Turns out we need four bytes to do what Unicode was intended to do.

So now we have to deal with encodings. UTF-8 is a brilliant solution to this problem. It can handle all Unicode characters, but if text is just ASCII, it won’t be any longer: ASCII is a valid UTF-8 encoding of the subset of Unicode that belonged to ASCII.

But there were other systems before Unicode, like ISO 8859-1, and if your file is encoded as ISO 8859-1, but a web browser thinks its UTF-8 encoded Unicode, some characters could be invalid. Browsers will use the character � as a replacement for invalid text that it could not otherwise display. That’s probably what’s going on when you see �.

See the article on How UTF-8 works to understand why some characters are not just a different character than intended but illegal UTF-8 byte sequences.

Cross-platform way to enter Unicode characters

The previous post describes the hoops I jumped through to enter Unicode characters on a Mac. Here’s a script to run from the command line that will copy Unicode characters to the system clipboard. It runs anywhere the Python module pyperclip runs.

    #!/usr/bin/env python3

    import sys
    import pyperclip

    cp = sys.argv[1]
    ch = eval(f"chr(0x{cp})")
    print(ch)
    pyperclip.copy(ch)

I called this script U so I could type

    U 03c0

at the command line, for example, it would print π to the command line and also copy it to the clipboard.

Unlike the MacOS solution in the previous post, this works for any Unicode value, i.e. for code points above FFFF.

On my Linux box I had to install xclip before pyperclip would work.

Double-struck capital letters

I’ve needed to use double-struck capital letters lately, also called blackboard bold. There are a few quirks in how they are represented in Unicode and in HTML entities, so I’m leaving some notes for myself here and for anyone else who might need to look this up.

Unicode

The double-struck capital letters are split into two blocks for historical reasons. The double-struck capital letters used most often in math — ℂ, ℍ, ℕ, ℙ, ℚ, ℝ, ℤ — are located in the U+21XX range, while the rest are in the U+1D5XX range.

Low characters

The letters in the U+21XX range were the first to be included in the Unicode standard. There’s no significance to the code points.

ℂ U+2102
ℍ U+210D 
ℕ U+2115
ℙ U+2119 
ℚ U+211A
ℝ U+211D
ℤ U+2124 

The names, however are consistent. The official name of ℂ is

DOUBLE-STRUCK CAPITAL C

and the rest are analogous.

High characters

The code point for double-struck capital A is U+1D538 and the rest are in alphabetical order: the nth letter of the English alphabet has code point

0x1D537 + n.

However, the codes corresponding to the letters in the low range are missing. For example, U+1D53A, the code that would logically be the code for ℂ, is unused so as not to duplicate the codes in the low range.

The official names of these characters are

MATHEMATICAL DOUBLE-STRUCK CAPITAL C

and so forth. Note the “MATHEMTICAL” prefix that the letters in the low range do not have.

Incidentally, X is the 24th letter, so the new logo for Twitter has code point U+1D54F.

HTML Entities

All double-struck letters have HTML shortcuts of the form *opf; where * is the letter, whether capital or lower case. For example, is 𝔸 and is 𝕒.

The letters with lower Unicode values also have semantic entities as well.

ℂ ℂ ℂ
ℍ ℍ ℍ
ℕ ℕ ℕ
ℙ ℙ ℙ
ℚ ℚ ℚ
ℝ ℝ ℝ
ℤ ℤ ℤ

LaTeX

The LaTeX command for any double-struck capital letter is \mathbb{}. This only applies to capital letters.

Python

Here’s Python code to illustrate the discussion of Unicode values above.

def doublestrike(ch):

    exceptions = {
        'C' : 0x2102,
        'H' : 0x210D,
        'N' : 0x2115,
        'P' : 0x2119,
        'Q' : 0x211A,
        'R' : 0x211D,
        'Z' : 0x2124
    }
    if ch in exceptions:
        codepoint = exceptions[ch]
    else:
        codepoint = 0x1D538 + ord(ch) - ord('A')
    print(chr(codepoint), f"U+{format(codepoint, 'X')}")

for n in range(ord('A'), ord('Z')+1):
    doublestrike(chr(n))

Russian transliteration hack

I mentioned in the previous post that I had been poking around in HTML entities and noticed symbols for Fourier transforms and such. I also noticed HTML entities for Cyrillic letters. These entities have the form

& + transliteration + cy;.

For example, the Cyrillic letter П is based on the Greek letter Π and its closest English counterpart is P, and its HTML entity is П.

The Cyrillic letter Р has HTML entity &Rpcy; and not П because although it looks like an English P, it sounds more like an English R.

Just as a hack, I decided to write code to transliterate Russian text by converting letters to their HTML entities, then chopping off the initial & and the final cy;.

I don’t speak Russian, but according to Google Translate, the Russian translation of “Hello world” is “Привет, мир.”

Here’s my hello-world program for transliterating Russian.

    from bs4.dammit import EntitySubstitution

    def transliterate(ch):
        entity = escaper.substitute_html(ch)[1:]
        return entity[:-3]
    
    a = [transliterate(c) for c in "Привет, мир."]
    print(" ".join(a))

This prints

P r i v ie t m i r

Here’s what I get trying to transliterate Chebyshev’s native name Пафну́тий Льво́вич Чебышёв.

P a f n u t i j L soft v o v i ch CH ie b y sh io v

I put a space between letters because of possible outputs like “soft v” above.

This was just a fun hack. Here’s what I’d get if I used software intended to be used for transliteration.

    import unidecode

    for x in ["Привет, мир", "Пафну́тий Льво́вич Чебышёв"]:
        print(unidecode.unidecode(x))

This produces

Privet, mir
Pafnutii L’vovich Chebyshiov

The results are similar.

Related posts

Symbols for transforms

I was looking through HTML entities and ran across ℱ. I searched for all entities ending in trf; and also found ℳ, ℒ, and ℨ.

Apparently “trf” stands “transform” and these symbols are intended to be used to represent the Fourier transform, Mellin transform, Laplace transform, and z-transform.

You would not know from the Unicode names that these symbols are intended to be used for transforms. For example, U+2131 has Unicode name SCRIPT CAPITAL F. But the HTML entity ℱ suggests how the symbol could be used.

These are not the symbols I’d use for these transforms if I had access to LaTeX. I’d use {\cal F} for the Fourier transform, for example. But if I were writing about math and restricted to Unicode symbols, as I am on Twitter, I might use these symbols. I could imagine, for example, using ℒ for Laplace transform on @AnalysisFact.

Here’s a table with more on the four transform symbols that have HTML entities.

| ℒ      | ℒ | U+2112  | SCRIPT CAPITAL L       |
| ℨ      | ℨ     | U+2128  | BLACK-LETTER CAPITAL Z |
| ℱ      | ℱ | U+2131  | SCRIPT CAPITAL F       |
| ℳ      | ℳ  | U+2133  | SCRIPT CAPITAL M       |

Related posts