# Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names.

## Python module

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

    import unidecode

for x in ["Poincaré", "Gödel"]:
print(x, unidecode.unidecode(x))


This produces

    Poincaré Poincare
Gödel Godel


Installing the unidecode module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode
>>> unidecode("北亰")
'Bei Jing '


(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that unidecode transliterates both to “Bei Jing.”)

## Projections

I titled this post “Projecting Unicode to ASCII” because this code is a projection in the mathematical sense. A projection is a function P such that for all inputs x,

PP(x) ) = P(x).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say P is idempotent. [1]

The unidecode function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

## Related posts

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as linear idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

# Trademark symbol, LaTeX, and Unicode

Earlier this year I was a coauthor on a paper about the Cap Score™ test for male fertility from Androvia Life Sciences [1]. I just noticed today that when I added the publication to my CV, it caused some garbled text to appear in the PDF.

Here is the corresponding LaTeX source code.

## Fixing the LaTeX problem

There were two problems: the trademark symbol and the non-printing symbol denoted by a red underscore in the source file. The trademark was a non-ASCII character (Unicode U+2122) and the underscore represented a non-printing (U+00A0). At first I only noticed the trademark symbol, and I fixed it by including a LaTeX package to allow Unicode characters:

    \usepackage[utf8x]{inputenc}

An alternative fix, one that doesn’t require including a new package, would be to replace the trademark Unicode character with \texttrademark\. Note the trailing backslash. Without the backslash there would be no space after the trademark symbol. The problem with the unprintable character would remain, but the character could just be deleted.

I found out there are two Unicode code points render the trademark glyph, U+0099 and U+2122. The former is in the Latin 1 Supplement section and is officially a control character. The correct code point for the trademark symbol is the latter. Unicode files U+2122 under Letterlike Symbols and gives it the official name TRADE MARK SIGN.

## Related posts

[1] Jay Schinfeld, Fady Sharara, Randy Morris, Gianpiero D. Palermo, Zev Rosenwaks, Eric Seaman, Steve Hirshberg, John Cook, Cristina Cardona, G. Charles Ostermeier, and Alexander J. Travis. Cap-Score™ Prospectively Predicts Probability of Pregnancy, Molecular Reproduction and Development. To appear.

# Typesetting modal logic

Modal logic extends propositional logic with two new operators, □ (“box”) and ◇ (“diamond”). There are many interpretations of these two symbols, the most common being necessity and possibility respectively. That is, □p means the proposition p is necessary, and ◇p means that p is possible. Another interpretation is using the symbols to represent things a person knows to be true and things that may be true as far as that person knows.

There are also many axiom systems for inference concerning these operators. For example, some axiom systems include the rule

and some do not. If you interpret □ as saying a proposition is provable, this axiom says whatever is provable is provably provable, which makes sense. But if you take □ to be a statement about what an agent knows, you may not want to say that if an agent knows something, it knows that it knows it.

See the next post for an example of applying logic to security, a logic with lots of modal operators and axioms. But for now, we’ll focus on how to typeset the box and diamond operators.

## LaTeX

In LaTeX, the most obvious commands would be \box and \diamond, but that doesn’t work. There is no \box command, though there is a \square command. And although there is a \diamond command, it produces a symbol much smaller than \square and so the two look odd together. The two operators are dual in the sense that

and so they should have symbols of similar size. A better approach is to use \Box and \Diamond. Those were used in the displayed equations above.

## Unicode

There are many box-like and diamond-like symbols in Unicode. It seems reasonable to use U+25A1 for box and U+25C7 for diamond. I don’t know of any more semantically appropriate characters. There are no Unicode characters with “modal” in their name, for example.

## HTML

You can always insert Unicode characters into HTML by using &#x, followed by the hexadecimal value of the codepoint, followed by a semicolon. For example, I typed &#x25a1; and &#x25c7; to enter the box and diamond symbols above.

If you want to stick to HTML entities because they’re easier to remember, you’re mostly out of luck. There is no HTML entity for the box operator. There is an entity &loz; for “lozenge,” the typographical term for a diamond. This HTML entity corresponds to U+25CA and is smaller than U+25c7 recommended above. As discussed in the context of LaTeX, you want the box and diamond operators to have a similar size.

# Fraktur symbols in mathematics

When mathematicians run out of symbols, they turn to other alphabets. Most math symbols are Latin or Greek letters, but occasionally you’ll run into Russian or Hebrew letters.

Sometimes math uses a new font rather than a new alphabet, such as Fraktur. This is common in Lie groups when you want to associate related symbols to a Lie group and its Lie algebra. By convention a Lie group is denoted by an ordinary Latin letter and its associated Lie algebra is denoted by the same letter in Fraktur font.

## LaTeX

To produce Fraktur letters in LaTeX, load the amssymb package and use the command \mathfrak{}.

Symbols such as \mathfrak{A} are math symbols and can only be used in math mode. They are not intended to be a substitute for setting text in Fraktur font. This is consistent with the semantic distinction in Unicode described below.

## Unicode

The Unicode standard tries to distinguish the appearance of a symbol from its semantics, though there are compromises. For example, the Greek letter Ω has Unicode code point U+03A9 but the symbol Ω for electrical resistance in Ohms is U+2621 even though they are rendered the same [1].

The letters a through z, rendered in Fraktur font and used as mathematical symbols, have Unicode values U+1D51E through U+1D537. These values are in the “Supplementary Multilingual Plane” and do not commonly have font support [2].

The corresponding letters A through Z are encoded as U+1D504 through U+1D51C, though interestingly a few letters are missing. The code point U+1D506, which you’d expect to be Fraktur C, is reserved. The spots corresponding to H, I, and R are also reserved. Presumably these are reserved because they are not commonly used as mathematical symbols. However, the corresponding bold versions U+1D56C through U+ID585 have no such gaps [3].

## Footnotes

[1] At least they usually are. A font designer could choose provide different glyphs for the two symbols. I used the same character for both because some I thought some readers might not see the Ohm symbol properly rendered.

[2] If you have the necessary fonts installed you should see the alphabet in Fraktur below:
𝔞 𝔟 𝔠 𝔡 𝔢 𝔣 𝔤 𝔥 𝔦 𝔧 𝔨 𝔩 𝔪 𝔫 𝔬 𝔭 𝔮 𝔯 𝔰 𝔱 𝔲 𝔳 𝔴 𝔵 𝔶 𝔷

I can see these symbols from my desktop and from my iPhone, but not from my Android tablet. Same with the symbols below.

[3] Here are the bold upper case and lower case Fraktur letters in Unicode:
𝕬 𝕭 𝕮 𝕯 𝕰 𝕱 𝕲 𝕳 𝕴 𝕵 𝕶 𝕷 𝕸 𝕹 𝕺 𝕻 𝕼 𝕽 𝕾 𝕿 𝖀 𝖁 𝖂 𝖃 𝖄 𝖅
𝖆 𝖇 𝖈 𝖉 𝖊 𝖋 𝖌 𝖍 𝖎 𝖏 𝖐 𝖑 𝖒 𝖓 𝖔 𝖕 𝖖 𝖗 𝖘 𝖙 𝖚 𝖛 𝖜 𝖝 𝖞 𝖟

# Why don’t you simply use XeTeX?

From an FAQ post I wrote a few years ago:

This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will ask “Why not use XeTeX and just enter these characters?”

If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type \AA or \ss than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

Related posts:

# Unicode / LaTeX page updated

Almost three years ago I put up a web page to let you go back and forth between Unicode code points and LaTeX commands. Here’s the page and here’s a blog post explaining it.

I’ve expanded the data the page uses by merging in data from the STIX Project. More queries should return successfully now.

* * *

# Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)


# U+0226, o with diaeresis and macron
my $a = "\x{22B}"; # U+00F6 U+0304, (o with diaeresis) + macron my$b = "\x{F6}\x{304}";

# o U+0308 U+0304, o + diaeresis + macron
my $c = "o\x{308}\x{304}"; my @versions = ($a, $b,$c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches . say map {length$_ if /^.$/} @versions; # All versions consist of one grapheme. say map {length$_ if /^\X$/} @versions;  For daily tips on regular expressions, follow @RegexTip on Twitter. # Which Unicode characters can you depend on? Unicode is supported everywhere, but font support for Unicode characters is sparse. When you use any slightly uncommon character, you have no guarantee someone else will be able to see it. I’m starting a Twitter account @MusicTheoryTip and so I wanted to know whether I could count on followers seeing music symbols. I asked whether people could see ♭ (flat, U+266D), ♮ (natural, U+266E), and ♯ (sharp, U+266F). Most people could see all three symbols, from desktop or phone, browser or Twitter app. However, several were unable to see the natural sign from an Android phone, whether using a browser or a Twitter app. One person said none of the symbols show up on his Blackberry. I also asked @diff_eq followers whether they could see the math symbols ∂ (partial, U+2202), Δ (Delta, U+0394), and ∇ (gradient, U+2207). One person said he couldn’t see the gradient symbol, but the rest of the feedback was positive. So what characters can you count on nearly everyone being able to see? To answer this question, I looked at the characters in the intersection of several common fonts: Verdana, Georgia, Times New Roman, Arial, Courier New, and Droid Sans. My thought was that this would make a very conservative set of characters. There are 585 characters supported by all the fonts listed above. Most of the characters with code points up to U+01FF are included. This range includes the code blocks for Basic Latin, Latin-1 Supplement, Latin Extended-A, and some of Latin Extended-B. The rest of the characters in the intersection are Greek and Cyrillic letters and a few scattered symbols. Flat, natural, sharp, and gradient didn’t make the cut. There are a dozen math symbols included: 0x2202 ∂ 0x2206 ∆ 0x220F ∏ 0x2211 ∑ 0x2212 − 0x221A √ 0x221E ∞ 0x222B ∫ 0x2248 ≈ 0x2260 ≠ 0x2264 ≤ 0x2265 ≥ Interestingly, even in such a conservative set of characters, there are a three characters included for semantic distinction: the minus sign (i.e. not a hyphen), the difference operator (i.e. not the Greek letter Delta), and the summation operator (i.e. not the Greek letter Sigma). And in case you’re interested, here’s the complete list of the Unicode characters in the intersection of the fonts listed here. (Update: Added notes to indicate the start of a new code block and listed some of the isolated characters.) 0x0009 Basic Latin 0x000d 0x0020 - 0x007e 0x00a0 - 0x017f Latin-1 supplement 0x0192 0x01fa - 0x01ff 0x0218 - 0x0219 0x02c6 - 0x02c7 0x02c9 0x02d8 - 0x02dd 0x0300 - 0x0301 0x0384 - 0x038a Greek and Coptic 0x038c 0x038e - 0x03a1 0x03a3 - 0x03ce 0x0401 - 0x040c 0x040e - 0x044f Cyrillic 0x0451 - 0x045c 0x045e - 0x045f 0x0490 - 0x0491 0x1e80 - 0x1e85 Latin extended additional 0x1ef2 - 0x1ef3 0x200c - 0x200f General punctuation 0x2013 - 0x2015 0x2017 - 0x201e 0x2020 - 0x2022 0x2026 0x2028 - 0x202e 0x2030 0x2032 - 0x2033 0x2039 - 0x203a 0x203c 0x2044 0x206a - 0x206f 0x207f 0x20a3 - 0x20a4 Currency symbols ₣ ₤ 0x20a7 ₧ 0x20ac € 0x2105 Letterlike symbols ℅ 0x2116 № 0x2122 ™ 0x2126 Ω 0x212e ℮ 0x215b - 0x215e ⅛ ⅜ ⅝ ⅞ 0x2202 Mathematical operators ∂ 0x2206 ∆ 0x220f ∏ 0x2211 - 0x2212 ∑ − 0x221a √ 0x221e ∞ 0x222b ∫ 0x2248 ≈ 0x2260 ≠ 0x2264 - 0x2265 ≤ ≥ 0x25ca Box drawing ◊ 0xfb01 - 0xfb02 Alphabetic presentation forms ﬁ ﬂ # Unicode to LaTeX I’ve run across a couple web sites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own. Unicode / LaTeX Conversion If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command. For example, the screenshot above shows what happens if you enter U+221E and click “convert.” You could also enter infty and get back U+221E. However, if you go from Unicode to LaTeX to Unicode, you won’t always end up where you started. There may be multiple Unicode values that map to a single LaTeX symbol. This is because Unicode is semantic and LaTeX is not. For example, Unicode distinguishes between the Greek letter Ω and the symbol Ω for ohms, the unit of electrical resistance, but LaTeX does not. * * * For daily tips on LaTeX and typography, follow @TeXtip on Twitter. # Letters that fell out of the alphabet Mental Floss had an interesting article called 12 letters that didn’t make the alphabet. A more accurate title might be 12 letters that fell out of the modern English alphabet. I thought it would have been better if the article had included the Unicode values of the letters, so I did a little research and created the following table.  Name Capital Small Thorn U+00DE U+00FE Wynn U+01F7 U+01BF Yogh U+021C U+021D Ash U+00C6 U+00E6 Eth U+00D0 U+00F0 Ampersand U+0026 Insular g U+A77D U+1D79 Thorn with stroke U+A764 U+A765 Ethel U+0152 U+0153 Tironian ond U+204A Long s U+017F Eng U+014A U+014B Once you know the Unicode code point for a symbol, you can find out more about it, for example, here. Related posts: Entering Unicode characters in Windows and Linux. To enter a Unicode character in Emacs, you can type C-x 8 <return>, then enter the value. # Draw a symbol, look it up LaTeX users may know about Detexify, a web site that lets you draw a character then looks up its TeX command. Now there’s a new site Shapecatcher that does the same thing for Unicode. According to the site, “Currently, there are 10,007 Unicode character glyphs in the database.” It does not yet support Chinese, Japanese, or Korean. For example, I drew a treble clef on the page: The site came back with a list of possible matches, and the first one was what I was hoping for: Interestingly, the sixth possible match on the list was a symbol for contour integration: Notice the treble clef response has a funny little box on the right side. That’s because my browser did not have a glyph to display that Unicode character. The browser did have a glyph for the contour integration symbol and displayed it. Another Unicode resource I recommend is this Unicode Codepoint Chart. It is organized by code point value, in blocks of 256. If you were looking for the contour integration symbol above, for example, you could click on a link “U+2200 to U+22FF: Mathematical Operators” and see a grid of 256 symbols and click on the one you’re looking for. This site gives more detail about each character than does Shapecatcher. So you might use Shapecatcher to find where to start looking, then go to the Unicode Codepoint Chart to find related symbols or more details. Other posts on Unicode: For daily tips on LaTeX and typography, follow @TeXtip on Twitter. # The disappointing state of Unicode fonts Modern operating systems understand Unicode internally, but font support for Unicode is spotty. For an example of the problems this can cause, take a look at these screen shots of how the same Twitter message appears differently depending on what program is used to read it. No font can display all Unicode characters. According to Wikipedia … it would be impossible to create such a font in any common font format, as Unicode includes over 100,000 characters, while no widely-used font format supports more than 65,535 glyphs. However, the biggest problem isn’t the number of characters a font can display. Most Unicode characters are quite rare. About 30,000 characters are enough to display the vast majority of characters in use in all the world’s languages as well as a generous selection of symbols. However Unicode fonts vary greatly in their support even for the more commonly used ranges of characters. See this comparison chart. The only range completely covered by all Unicode fonts in the chart is the 128 characters of Latin Extended-A. Unifont supports all printable characters in the basic multilingual plane, characters U+0000 through U+FFFF. This includes the 30,000 characters mentioned above plus many more. Unifont isn’t pretty, but it’s complete. As far as I know, it’s the only font that covers the characters below U+FFFF. Related posts: # Unicode function names Keith Hill has a fun blog post on using Unicode characters in PowerShell function names. Here’s an example from his article using the square root symbol for the square root function. PS> function √($num) { [Math]::Sqrt($num) } PS> √ 81 9 As Keith points out, these symbols are not practical since they’re difficult to enter, but they’re fun to play around with. Here’s another example using the symbol for pounds sterling for the function to convert British pounds to US dollars. PS> function £($num) { 1.44*$num } PS> £ 300.00 432 (As I write this, a British pound is worth$1.44 USD. If you wanted to get fancy, you could call a web service in your function to get the current exchange rate.)

I read once that someone (Larry Wall?) had semi-seriously suggested using the Japanese Yen currency symbol

for the “zip” function in Perl 6 since the symbol looks like a zipper.

Mathematica lets you use Greek letters as variable and function names, and it provides convenient ways to enter these characters, either graphically or via their TeX representations. I think this is a great idea. It could make mathematical source code much more readable. But I don’t use it because I’ve never got into the habit of doing so.

There are some dangers to allowing Unicode characters in programming languages. Because Unicode characters are semantic rather than visual, two characters may have the same graphical representation. Here are a couple examples. The Roman letter A (U+0041) and the capital Greek letter Α (U+0391) look the same but correspond to different characters. Also, the the Greek letter Ω (U+03A9) and the symbol Ω (U+2126) for Ohms (unit of electrical resistance) have the same visual representation but are different characters. (Or at least they may have the same visual representation. A font designer may choose, for example, to distinguish Omega and Ohm, but that’s not a concern to the Unicode Consortium.)

* * *

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

# Sharps and flats in HTML

Apparently there’s no HTML entity for the flat symbol, ♭. In my previous post, I just spelled out B-flat because I thought that was safer; it’s possible not everyone would have the fonts installed to display B♭ correctly.

So how do you display music symbols for flat, sharp, and natural in HTML? You can insert any symbol if you know its Unicode value, though you run the risk that someone viewing the page may not have the necessary fonts installed to view the symbol. Here are the Unicode values for flat, natural, and sharp.

Since the flat sign has Unicode value U+266D, you could enter &#x266d; into HTML to display that symbol.

The sharp sign raises an interesting question. I’m sure most web pages referring to G-sharp would use the number sign # (U+0023) rather than the sharp sign ♯ (U+266F). And why not? The number sign is conveniently located on a standard keyboard and the sharp sign isn’t. It would be nice if people used sharp symbols rather than number signs. It would make it easier to search on specifically musical terms. But it’s not going to happen.

Update: See this post on font support for Unicode. Most people can see all three symbols, but some, especially Android users, might not see the natural sign.

Related posts: