Why don’t you simply use XeTeX?

From an FAQ post I wrote a few years ago:

This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will ask “Why not use XeTeX and just enter these characters?”

If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type \AA or \ss than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

One does not simply type Unicode characters.

Related posts:


Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Notes on HTML, XML, TeX, and Unicode

This week’s resource post: some notes on typesetting, Unicode, etc.

See also blog posts tagged LaTeX, HTML, and Unicode.

For daily tips on LaTeX and typography, follow @TeXtip on Twitter.

TeXtip logo

Last week: C++ resources

Next week: Special functions


Which Unicode characters can you depend on?

Unicode is supported everywhere, but font support for Unicode characters is sparse. When you use any slightly uncommon character, you have no guarantee someone else will be able to see it.

I’m starting a Twitter account @MusicTheoryTip and so I wanted to know whether I could count on followers seeing music symbols. I asked whether people could see ♭ (flat, U+266D), ♮ (natural, U+266E), and ♯ (sharp, U+266F). Most people could see all three symbols, from desktop or phone, browser or Twitter app. However, several were unable to see the natural sign from an Android phone, whether using a browser or a Twitter app. One person said none of the symbols show up on his Blackberry.

I also asked @diff_eq followers whether they could see the math symbols ∂ (partial, U+2202), Δ (Delta, U+0394), and ∇ (gradient, U+2207). One person said he couldn’t see the gradient symbol, but the rest of the feedback was positive.

So what characters can you count on nearly everyone being able to see? To answer this question, I looked at the characters in the intersection of several common fonts: Verdana, Georgia, Times New Roman, Arial, Courier New, and Droid Sans. My thought was that this would make a very conservative set of characters.

There are 585 characters supported by all the fonts listed above. Most of the characters with code points up to U+01FF are included. This range includes the code blocks for Basic Latin, Latin-1 Supplement, Latin Extended-A, and some of Latin Extended-B.

The rest of the characters in the intersection are Greek and Cyrillic letters and a few scattered symbols. Flat, natural, sharp, and gradient didn’t make the cut.

There are a dozen math symbols included:

0x2202 ∂
0x2206 ∆
0x220F ∏
0x2211 ∑
0x2212 −
0x221A √
0x221E ∞
0x222B ∫
0x2248 ≈
0x2260 ≠
0x2264 ≤
0x2265 ≥

Interestingly, even in such a conservative set of characters, there are a three characters included for semantic distinction: the minus sign (i.e. not a hyphen), the difference operator (i.e. not the Greek letter Delta), and the summation operator (i.e. not the Greek letter Sigma).

And in case you’re interested, here’s the complete list of the Unicode characters in the intersection of the fonts listed here. (Update: Added notes to indicate the start of a new code block and listed some of the isolated characters.)

0x0009           Basic Latin
0x0020 - 0x007e 
0x00a0 - 0x017f  Latin-1 supplement
0x01fa - 0x01ff
0x0218 - 0x0219  
0x02c6 - 0x02c7  
0x02d8 - 0x02dd 
0x0300 - 0x0301 
0x0384 - 0x038a  Greek and Coptic
0x038e - 0x03a1
0x03a3 - 0x03ce
0x0401 - 0x040c 
0x040e - 0x044f  Cyrillic
0x0451 - 0x045c
0x045e - 0x045f
0x0490 - 0x0491
0x1e80 - 0x1e85  Latin extended additional
0x1ef2 - 0x1ef3
0x200c - 0x200f  General punctuation
0x2013 - 0x2015
0x2017 - 0x201e
0x2020 - 0x2022
0x2028 - 0x202e
0x2032 - 0x2033
0x2039 - 0x203a
0x206a - 0x206f  
0x20a3 - 0x20a4  Currency symbols ₣ ₤
0x20a7           ₧
0x20ac           €
0x2105           Letterlike symbols ℅
0x2116           №
0x2122           ™
0x2126           Ω
0x212e           ℮
0x215b - 0x215e  ⅛ ⅜ ⅝ ⅞
0x2202 	         Mathematical operators ∂
0x2206           ∆
0x220f           ∏
0x2211 - 0x2212  ∑ −
0x221a           √
0x221e           ∞
0x222b           ∫
0x2248           ≈
0x2260           ≠
0x2264 - 0x2265  ≤ ≥
0x25ca           Box drawing ◊
0xfb01 - 0xfb02  Alphabetic presentation forms fi fl

Unicode to LaTeX

I’ve run across a couple web sites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own.

Unicode / LaTeX Conversion

If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command.

screenshot of www.johndcook.com/unicode_latex.png

For example, the screenshot above shows what happens if you enter U+221E and click “convert.” You could also enter infty and get back U+221E.

However, if you go from Unicode to LaTeX to Unicode, you won’t always end up where you started. There may be multiple Unicode values that map to a single LaTeX symbol. This is because Unicode is semantic and LaTeX is not. For example, Unicode distinguishes between the Greek letter Ω and the symbol Ω for ohms, the unit of electrical resistance, but LaTeX does not.

* * *

For daily tips on LaTeX and typography, follow @TeXtip on Twitter.

TeXtip logo

Letters that fell out of the alphabet

Mental Floss had an interesting article called 12 letters that didn’t make the alphabet. A more accurate title might be 12 letters that fell out of the modern English alphabet.

I thought it would have been better if the article had included the Unicode values of the letters, so I did a little research and created the following table.

Insular gU+A77DU+1D79
Thorn with strokeU+A764U+A765
Tironian ondU+204A
Long sU+017F


Once you know the Unicode code point for a symbol, you can find out more about it, for example, here.

Related posts:

Entering Unicode characters in Windows and Linux.

To enter a Unicode character in Emacs, you can type C-x 8 <return>, then enter the value.

Draw a symbol, look it up

LaTeX users may know about Detexify, a web site that lets you draw a character then looks up its TeX command. Now there’s a new site Shapecatcher that does the same thing for Unicode. According to the site, “Currently, there are 10,007 Unicode character glyphs in the database.” It does not yet support Chinese, Japanese, or Korean.

For example, I drew a treble clef on the page:

The site came back with a list of possible matches, and the first one was what I was hoping for:

Interestingly, the sixth possible match on the list was a symbol for contour integration:

Notice the treble clef response has a funny little box on the right side. That’s because my browser did not have a glyph to display that Unicode character. The browser did have a glyph for the contour integration symbol and displayed it.

Another Unicode resource I recommend is this Unicode Codepoint Chart. It is organized by code point value, in blocks of 256. If you were looking for the contour integration symbol above, for example, you could click on a link “U+2200 to U+22FF: Mathematical Operators” and see a grid of 256 symbols and click on the one you’re looking for. This site gives more detail about each character than does Shapecatcher. So you might use Shapecatcher to find where to start looking, then go to the Unicode Codepoint Chart to find related symbols or more details.

Other posts on Unicode:

For daily tips on LaTeX and typography, follow @TeXtip on Twitter.

TeXtip logo

The disappointing state of Unicode fonts

Modern operating systems understand Unicode internally, but font support for Unicode is spotty. For an example of the problems this can cause, take a look at these screen shots of how the same Twitter message appears differently depending on what program is used to read it.

No font can display all Unicode characters. According to Wikipedia

… it would be impossible to create such a font in any common font format, as Unicode includes over 100,000 characters, while no widely-used font format supports more than 65,535 glyphs.

However, the biggest problem isn’t the number of characters a font can display. Most Unicode characters are quite rare. About 30,000 characters are enough to display the vast majority of characters in use in all the world’s languages as well as a generous selection of symbols. However Unicode fonts vary greatly in their support even for the more commonly used ranges of characters. See this comparison chart. The only range completely covered by all Unicode fonts in the chart is the 128 characters of Latin Extended-A.

Unifont supports all printable characters in the basic multilingual plane, characters U+0000 through U+FFFF. This includes the 30,000 characters mentioned above plus many more. Unifont isn’t pretty, but it’s complete. As far as I know, it’s the only font that covers the characters below U+FFFF.

Related posts:

Unicode function names

Keith Hill has a fun blog post on using Unicode characters in PowerShell function names. Here’s an example from his article using the square root symbol for the square root function.

PS> function √($num) { [Math]::Sqrt($num) }
PS> √ 81

As Keith points out, these symbols are not practical since they’re difficult to enter, but they’re fun to play around with.

Here’s another example using the symbol for pounds sterling

for the function to convert British pounds to US dollars.

PS> function £($num) { 1.44*$num }
PS> £ 300.00

(As I write this, a British pound is worth $1.44 USD. If you wanted to get fancy, you could call a web service in your function to get the current exchange rate.)

I read once that someone (Larry Wall?) had semi-seriously suggested using the Japanese Yen currency symbol

for the “zip” function in Perl 6 since the symbol looks like a zipper.

Mathematica lets you use Greek letters as variable and function names, and it provides convenient ways to enter these characters, either graphically or via their TeX representations. I think this is a great idea. It could make mathematical source code much more readable. But I don’t use it because I’ve never got into the habit of doing so.

There are some dangers to allowing Unicode characters in programming languages. Because Unicode characters are semantic rather than visual, two characters may have the same graphical representation. Here are a couple examples. The Roman letter A (U+0041) and the capital Greek letter Α (U+0391) look the same but correspond to different characters. Also, the the Greek letter Ω (U+03A9) and the symbol Ω (U+2126) for Ohms (unit of electrical resistance) have the same visual representation but are different characters. (Or at least they may have the same visual representation. A font designer may choose, for example, to distinguish Omega and Ohm, but that’s not a concern to the Unicode Consortium.)

* * *

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Sharps and flats in HTML

Apparently there’s no HTML entity for the flat symbol, ♭. In my previous post, I just spelled out B-flat because I thought that was safer; it’s possible not everyone would have the fonts installed to display B♭ correctly.

So how do you display music symbols for flat, sharp, and natural in HTML? You can insert any symbol if you know its Unicode value, though you run the risk that someone viewing the page may not have the necessary fonts installed to view the symbol. Here are the Unicode values for flat, natural, and sharp.

Since the flat sign has Unicode value U+266D, you could enter &#x266d; into HTML to display that symbol.

The sharp sign raises an interesting question. I’m sure most web pages referring to G-sharp would use the number sign # (U+0023) rather than the sharp sign ♯ (U+266F). And why not? The number sign is conveniently located on a standard keyboard and the sharp sign isn’t. It would be nice if people used sharp symbols rather than number signs. It would make it easier to search on specifically musical terms. But it’s not going to happen.

Update: See this post on font support for Unicode. Most people can see all three symbols, but some, especially Android users, might not see the natural sign.

Related posts:

Shorter URLs by using Unicode

Tinyarro.ws is a service like tinyurl.com and others that shorten URLs. However, unlike similar services, Tinyarro.ws uses Unicode characters, allowing it to encode more possibilities into each character. These sub-compact URLs may contain Chinese characters, for example, or other symbols unfamiliar to many users. They’re no good for reading aloud, say over the phone or on a podcast. But they’re ideal for Twitter because you only have to click on the link, not type it into a browser.

Here’s a URL I got when I tried the Tinyarro.ws site:

screen shot from tinyarro.ws

The resulting URL may not display correctly in your browser depending on what fonts you have installed: http://➡.ws/㣸.

I pasted the URL into Microsoft Word and used Alt-x to see what the Unicode characters were. (See Three ways to enter Unicode characters in Windows.) The arrow is code point U+27A1 and the final character is code point U+38F8. I have no idea what that character means. I would appreciate someone letting me know in the comments.

Unicode character U=38F8

Related post: How to insert graphics in Twitter messages

How to insert graphics in Twitter messages

I saw a couple interesting messages from John Udell on Twitter yesterday.

screen shot of Twitter posts

Apparently Bill Zeller had the clever idea of using Unicode characters to put sparklines in Twitter messages. It didn’t occur to me that Twitter would accept Unicode. I’m sure the intent of Unicode support is multi-lingual text, not clever hacks such as creating sparklines, but this is fun to play around with.

The sparkline above was created using block element symbols. I tried a few other Unicode characters. Some worked, but most didn’t.  The Twitter protocol supports Unicode, but particular Twitter clients may not have fonts for displaying some characters. For example, I found some characters would display correctly when I went to twitter.com that didn’t display from the Twhirl client.

To enter a Unicode character, you can find out the numeric value here. Once you know the value, here are tips for how to insert Unicode characters in Windows and Linux applications. You might, for example, create the symbols you want using Microsoft Word and then paste them into your Twitter client.

Update (8 January 2010):

Here are some examples of how the differently same tweet may appear on the same computer using some screen shots I took this morning.

First, here’s the view from Twitter’s web site using Firefox 3.5.7 on Windows:

Here’s Firefox 3.5.3 on Ubuntu 9.10, nearly the same but a couple characters are missing.

Here’s the view from IE 8 on Windows:

And now Safari on Windows:

Here’s the view from TweetDeck on Windows with its default font:

And now TweetDeck on Windows with its international font:

I imagine TweetDeck would look the same on other operating systems since Adobe Air is largely self-contained.

Related links:

PowerShell output redirection: Unicode or ASCII?

What does the redirection operator > in PowerShell do to text: leave it as Unicode or convert it to ASCII? The answer depends on whether the thing to the right of the > operator is a file or a program.

Strings inside PowerShell are 16-bit Unicode, instances of .NET’s System.String class. When you redirect the output to a file, the file receives Unicode text. As Bruce Payette says in his book Windows PowerShell in Action,

myScript > file.txt is just syntactic sugar for myScript | out-file -path file.txt

and out-file defaults to Unicode. The advantage of explicitly using out-file is that you can then specify the output format using the -encoding parameter. Possible encoding values include Unicode, UTF8, ASCII, and others.

If the thing on the right side of the redirection operator is a program rather than a file, the encoding is determined by the variable $OutputEncoding. This variable defaults to ASCII encoding because most existing applications do not handle Unicode correctly. However, you can set this variable so PowerShell sends applications Unicode. See Jeffrey Snover’s blog post OuputEncoding to the rescue for details.

Of course if you’re passing strings between pieces of PowerShell code, everything says in Unicode.

Thanks to J_Tom_Moon_79 for suggesting a blog post on this topic.

Entering Unicode characters in Linux

Won currency symbol

I ran across this post from Aaron Toponce explaining how to enter Unicode characters in Linux applications. Hold down the shift and control keys while typing “u” and the hex values of the Unicode character you wish to enter. I tried this and it worked in Firefox, GEdit, and Gnome Terminal, but not in OpenOffice. I was running Ubuntu 7.10.

See also  Three ways to enter Unicode characters in Windows.

* * *

For daily tips on using Unix, follow @UnixToolTip on Twitter.

UnixToolTip twitter icon