Posts tagged as:

Unicode

The disappointing state of Unicode fonts

by John on January 16, 2010

Modern operating systems understand Unicode internally, but font support for Unicode is spotty. For an example of the problems this can cause, take a look at these screen shots of how the same Twitter message appears differently depending on what program is used to read it.

No font can display all Unicode characters. According to Wikipedia

… it would be impossible to create such a font in any common font format, as Unicode includes over 100,000 characters, while no widely-used font format supports more than 65,535 glyphs.

However, the biggest problem isn’t the number of characters a font can display. Most Unicode characters are quite rare. About 30,000 characters are enough to display the vast majority of characters in use in all the world’s languages as well as a generous selection of symbols. However Unicode fonts vary greatly in their support even for the more commonly used ranges of characters. See this comparison chart. The only range completely covered by all Unicode fonts in the chart is the 128 characters of Latin Extended-A.

Unifont supports all printable characters in the basic multilingual plane, characters U+0000 through U+FFFF. This includes the 30,000 characters mentioned above plus many more. Unifont isn’t pretty, but it’s complete. As far as I know, it’s the only font that covers the characters below U+FFFF.

Related posts:

Why Unicode is subtle

Entering Unicode characters in Windows, Linux

{ 15 comments }

Unicode function names

by John on March 22, 2009

Keith Hill has a fun blog post on using Unicode characters in PowerShell function names. Here’s an example from his article using the square root symbol for the square root function.

PS> function √($num) { [Math]::Sqrt($num) }
PS> √ 81
9

As Keith points out, these symbols are not practical since they’re difficult to enter, but they’re fun to play around with.

Here’s another example using the symbol for pounds sterling

for the function to convert British pounds to US dollars.

PS> function £($num) { 1.44*$num }
PS> £ 300.00
432

(As I write this, a British pound is worth $1.44 USD. If you wanted to get fancy, you could call a web service in your function to get the current exchange rate.)

I read once that someone (Larry Wall?) had semi-seriously suggested using the Japanese Yen currency symbol

for the “zip” function in Perl 6 since the symbol looks like a zipper.

Mathematica lets you use Greek letters as variable and function names, and it provides convenient ways to enter these characters, either graphically or via their TeX representations. I think this is a great idea. It could make mathematical source code much more readable. But I don’t use it because I’ve never got into the habit of doing so.

There are some dangers to allowing Unicode characters in programming languages. Because Unicode characters are semantic rather than visual, two characters may have the same graphical representation. Here are a couple examples. The Roman letter A (U+0041) and the capital Greek letter Α (U+0391) look the same but correspond to different characters. Also, the the Greek letter Ω (U+03A9) and the symbol Ω (U+2126) for Ohms (unit of electrical resistance) have the same visual representation but are different characters. (Or at least they may have the same visual representation. A font designer may choose, for example, to distinguish Omega and Ohm, but that’s not a concern to the Unicode Consortium.)

{ 1 comment }

Sharps and flats in HTML

by John on March 16, 2009

Apparently there’s no HTML entity for the flat symbol, ♭. In my previous post, I just spelled out B-flat because I thought that was safer; it’s possible not everyone would have the fonts installed to display B♭ correctly.

So how do you display music symbols for flat, sharp, and natural in HTML? You can insert any symbol if you know its Unicode value, though you run the risk that someone viewing the page may not have the necessary fonts installed to view the symbol. Here are the Unicode values for flat, natural, and sharp.

Since the flat sign has Unicode value U+266D, you could enter ♭ into HTML to display that symbol.

The sharp sign raises an interesting question. I’m sure most web pages referring to G-sharp would use the number sign # (U+0023) rather than the sharp sign ♯ (U+266F). And why not? The number sign is conveniently located on a standard keyboard and the sharp sign isn’t. It would be nice if people used sharp symbols rather than number signs. It would make it easier to search on specifically musical terms. But it’s not going to happen.

Related posts:

Entering Unicode characters in Linux
Three ways to enter Unicode characters in Windows
Greek letters and math symbols in (X)HTML

{ 2 comments }

Shorter URLs by using Unicode

by John on March 12, 2009

Tinyarro.ws is a service like tinyurl.com and others that shorten URLs. However, unlike similar services, Tinyarro.ws uses Unicode characters, allowing it to encode more possibilities into each character. These sub-compact URLs may contain Chinese characters, for example, or other symbols unfamiliar to many users. They’re no good for reading aloud, say over the phone or on a podcast. But they’re ideal for Twitter because you only have to click on the link, not type it into a browser.

Here’s a URL I got when I tried the Tinyarro.ws site:

screen shot from tinyarro.ws

The resulting URL may not display correctly in your browser depending on what fonts you have installed: http://➡.ws/㣸.

I pasted the URL into Microsoft Word and used Alt-x to see what the Unicode characters were. (See Three ways to enter Unicode characters in Windows.) The arrow is code point U+27A1 and the final character is code point U+38F8. I have no idea what that character means. I would appreciate someone letting me know in the comments.

Unicode character U=38F8

Related post: How to insert graphics in Twitter messages

{ 4 comments }

How to insert graphics in Twitter messages

by John on January 14, 2009

I saw a couple interesting messages from John Udell on Twitter yesterday.

screen shot of Twitter posts

Apparently Bill Zeller had the clever idea of using Unicode characters to put sparklines in Twitter messages. It didn’t occur to me that Twitter would accept Unicode. I’m sure the intent of Unicode support is multi-lingual text, not clever hacks such as creating sparklines, but this is fun to play around with.

The sparkline above was created using block element symbols. I tried a few other Unicode characters. Some worked, but most didn’t.  The Twitter protocol supports Unicode, but particular Twitter clients may not have fonts for displaying some characters. For example, I found some characters would display correctly when I went to twitter.com that didn’t display from the Twhirl client.

To enter a Unicode character, you can find out the numeric value here. Once you know the value, here are tips for how to insert Unicode characters in Windows and Linux applications. You might, for example, create the symbols you want using Microsoft Word and then paste them into your Twitter client.

Update (8 January 2010):

Here are some examples of how the differently same tweet may appear on the same computer using some screen shots I took this morning.

First, here’s the view from Twitter’s web site using Firefox 3.5.7 on Windows:

Here’s Firefox 3.5.3 on Ubuntu 9.10, nearly the same but a couple characters are missing.

Here’s the view from IE 8 on Windows:

And now Safari on Windows:

Here’s the view from TweetDeck on Windows with its default font:

And now TweetDeck on Windows with its international font:

I imagine TweetDeck would look the same on other operating systems since Adobe Air is largely self-contained.

{ 2 comments }

What does the redirection operator > in PowerShell do to text: leave it as Unicode or convert it to ASCII? The answer depends on whether the thing to the right of the > operator is a file or a program.

Strings inside PowerShell are 16-bit Unicode, instances of .NET’s System.String class. When you redirect the output to a file, the file receives Unicode text. As Bruce Payette says in his book Windows PowerShell in Action,

myScript > file.txt is just syntactic sugar for myScript | out-file -path file.txt

and out-file defaults to Unicode. The advantage of explicitly using out-file is that you can then specify the output format using the -encoding parameter. Possible encoding values include Unicode, UTF8, ASCII, and others.

If the thing on the right side of the redirection operator is a program rather than a file, the encoding is determined by the variable $OutputEncoding. This variable defaults to ASCII encoding because most existing applications do not handle Unicode correctly. However, you can set this variable so PowerShell sends applications Unicode. See Jeffrey Snover’s blog post OuputEncoding to the rescue for details.

Of course if you’re passing strings between pieces of PowerShell code, everything says in Unicode.

Thanks to J_Tom_Moon_79 for suggesting a blog post on this topic.

{ 3 comments }

Entering Unicode characters in Linux

by John on August 18, 2008

I ran across this post from Aaron Toponce explaining how to enter Unicode characters in Linux applications. Hold down the shift and control keys while typing “u” and the hex values of the Unicode character you wish to enter. I tried this and it worked in Firefox, GEdit, and Gnome Terminal, but not in OpenOffice. I was running Ubuntu 7.10.

See also  Three ways to enter Unicode characters in Windows.

{ 4 comments }

Here are three approaches to entering Unicode characters in Windows. See the next post for entering Unicode characters in Linux.

(1) In Microsoft Word you can insert Unicode characters by typing the hex value of the character then typing Alt-x. You can also see the Unicode value of a character by placing the cursor immediately after the character and pressing Alt-x. This also works in applications that use the Windows rich edit control such as WordPad and Outlook.

Pros: Nothing to install or configure. You can see the numeric value before you turn it into a symbol. It’s handy to be able to go the opposite direction, looking up Unicode values for characters.

Cons: Does not work with many applications.

(2) Another approach which works with more applications is as follows. First create a registry key under HKEY_CURRENT_USER of type REG_SZ called EnableHexNumpad, set its value to 1, and reboot. Then you can enter Unicode symbols by holding down the Alt key and typing the plus sign on the numeric keypad followed by the character value. When you release the Alt key, the symbol will appear. This approach worked with most applications I tried, including Firefox and Safari, but did not with Internet Explorer.

Pros: Works with many applications. No software to install.

Cons: Requires a registry edit and a reboot. It’s awkward to hold down the Alt key while typing several other keys. You cannot see the numbers you’re typing. Doesn’t work with every application.

(3) Another option is to install the UnicodeInput utility. This worked with every application I tried, including Internet Explorer. Once installed, the window below pops up whenever you hold down the Alt key and type the plus sign on the numeric keypad. Type the numeric value of the character in the box, click the Send button, and the character will be inserted into the window that had focus when you clicked Alt-plus.

UnicodeInput screenshot

Pros: Works everywhere (as far as I’ve tried). The software is free. Easy to use.

Cons: Requires installing software.

Related links:

Entering Unicode characters in Linux
Unicode resources
Greek letters in HTML, XML, TeX, and Unicode

{ 2 comments }

This morning I listened to a podcast interview with Kate Gregory. She used some terms I hadn’t heard in years: BSTR, OLE strings, etc.

Around a decade ago I was working with COM in C++ and had to deal with the menagerie of string types Kate Gregory mentioned. I wrote an article to get all the various types straight in my head: all the different memory allocation rules, conventions for use, conversions between types, etc. I never published the article. When I started my personal web site I thought about posting the article there, but then I thought that by now nobody cared about such things. But the interview I listened to this morning made me think more people might be interested than I’d thought. So I posted my article Unravelling Strings in Visual C++ in case someone finds it useful.

{ 0 comments }

Why Unicode is subtle

by John on April 5, 2008

On it’s surface, Unicode is simple. It’s a replacement for ASCII to make room for more characters. Joel Spolsky assures us that it’s not that hard. But then how did Jukka Korpela have enough to say to fill his 678-page book Unicode Explained? Why is the Unicode standard 1472 printed pages?

It’s hard to say anything pithy about Unicode that is entirely correct. The best way to approach Unicode may be through a sequence of partially true statements.

The first approximation to a description of Unicode is that it is a 16 bit character set. Sixteen bits are enough to represent the union of all previous character set standards. It’s enough to contain nearly 30,000 CJK (Chinese-Japanese-Korean) characters with space left for mathematical symbols, braille, dingbats, etc.

Actually, Unicode is a 32-bit character set. It started out as a 16-bit character set. The first 16 bit range of the Unicode standard is called the Basic Multilingual Plane (BMP), and is complete for most purposes. The regions outside the BMP contain characters for archaic and fictional languages, rare CJK characters, and various symbols.

So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name. What could be so complicated about that?

Well, for starters there’s the issue of just what constitutes a character. For example, Greek writes the letter sigma as σ in the middle of a word but as ς at the end of a word. Are σ and ς two representations of one character or two characters? (Unicode says two characters.) Should the Greek letter π and the mathematical constant π be the same character? (Unicode says yes.) Should the Greek letter Ω and the symbol for electrical resistence in Ohms Ω be the same character? (Unicode says no.) The difficulties get more subtle (and politically charged) when considering Asian ideographs.

Once have agreement on how to catalog tens of thousands of characters, there’s still the question of how to map the Unicode characters to bytes. You could think of each byte representation as a compression or compatibility scheme. The most commonly used systems are UTF-8, and  UTF-16. The former is more compact (for Western languages) and compatible with ASCII. The latter is simpler to process. Once you agree on a byte representation, there’s the issue of how to order the bytes (endianness).

Once you’ve resolved character sets and encoding, there remain issues of software compatibility. For example, which web browsers and operating systems support which representations of Unicode? Which operating systems supply fonts for which characters? How do they behave when the desired font is unavailable? How do various programming languages support Unicode? What software can be used to produce Unicode? What happens when you copy a Unicode string from one program and paste it into another?

Things get even more complicated when you want to process Unicode text because this brings up internationalization and localization issues. These are extremely complex, though they’re not complexities with Unicode per se.

For more links, see my Unicode resources.

{ 0 comments }