Posts Tagged ‘Unicode’

Entering Unicode characters in Linux

Monday, August 18th, 2008

I ran across this post from Aaron Toponce explaining how to enter Unicode characters in Linux applications. Hold down the shift and control keys while typing “u” and the hex values of the Unicode character you wish to enter. I tried this and it worked in Firefox, GEdit, and Gnome Terminal, but not in OpenOffice. I was running Ubuntu 7.10.

See also  Three ways to enter Unicode characters in Windows.

Three ways to enter Unicode characters in Windows

Sunday, August 17th, 2008

Here are three approaches to entering Unicode characters in Windows. See the next post for entering Unicode characters in Linux.

(1) In Microsoft Word you can insert Unicode characters by typing the hex value of the character then typing Alt-x. You can also see the Unicode value of a character by placing the cursor immediately after the character and pressing Alt-x. This also works in applications that use the Windows rich edit control such as WordPad and Outlook.

Pros: Nothing to install or configure. You can see the numeric value before you turn it into a symbol. It’s handy to be able to go the opposite direction, looking up Unicode values for characters.

Cons: Does not work with many applications.

(2) Another approach which works with more applications is as follows. First create a registry key under HKEY_CURRENT_USER of type REG_SZ called EnableHexNumpad, set its value to 1, and reboot. Then you can enter Unicode symbols by holding down the Alt key and typing the plus sign on the numeric keypad followed by the character value. When you release the Alt key, the symbol will appear. This approach worked with most applications I tried, including Firefox and Safari, but did not with Internet Explorer.

Pros: Works with many applications. No software to install.

Cons: Requires a registry edit and a reboot. It’s awkward to hold down the Alt key while typing several other keys. You cannot see the numbers you’re typing. Doesn’t work with every application.

(3) Another option is to install the UnicodeInput utility. This worked with every application I tried, including Internet Explorer. Once installed, the window below pops up whenever you hold down the Alt key and type the plus sign on the numeric keypad. Type the numeric value of the character in the box, click the Send button, and the character will be inserted into the window that had focus when you clicked Alt-plus.

UnicodeInput screenshot

Pros: Works everywhere (as far as I’ve tried). The software is free. Easy to use.

Cons: Requires installing software.

Multiple string types: BSTR, wchar_t, etc.

Saturday, July 26th, 2008

This morning I listened to a podcast interview with Kate Gregory. She used some terms I hadn’t heard in years: BSTR, OLE strings, etc.

Around a decade ago I was working with COM in C++ and had to deal with the menagerie of string types Kate Gregory mentioned. I wrote an article to get all the various types straight in my head: all the different memory allocation rules, conventions for use, conversions between types, etc. I never published the article. When I started my personal web site I thought about posting the article there, but then I thought that by now nobody cared about such things. But the interview I listened to this morning made me think more people might be interested than I’d thought. So I posted my article Unravelling Strings in Visual C++ in case someone finds it useful.

Why Unicode is subtle

Saturday, April 5th, 2008

On it’s surface, Unicode is simple. It’s a replacement for ASCII to make room for more characters. Joel Spolsky assures us that it’s not that hard. But then how did Jukka Korpela have enough to say to fill his 678-page book Unicode Explained? Why is the Unicode standard 1472 printed pages?

It’s hard to say anything pithy about Unicode that is entirely correct. The best way to approach Unicode may be through a sequence of partially true statements.

The first approximation to a description of Unicode is that it is a 16 bit character set. Sixteen bits are enough to represent the union of all previous character set standards. It’s enough to contain nearly 30,000 CJK (Chinese-Japanese-Korean) characters with space left for mathematical symbols, braille, dingbats, etc.

Actually, Unicode is a 32-bit character set. It started out as a 16-bit character set. The first 16 bit range of the Unicode standard is called the Basic Multilingual Plane (BMP), and is complete for most purposes. The regions outside the BMP contain characters for archaic and fictional languages, rare CJK characters, and various symbols.

So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name. What could be so complicated about that?

Well, for starters there’s the issue of just what constitutes a character. For example, Greek writes the letter sigma as σ in the middle of a word but as ς at the end of a word. Are σ and ς two representations of one character or two characters? (Unicode says two characters.) Should the Greek letter π and the mathematical constant π be the same character? (Unicode says yes.) Should the Greek letter Ω and the symbol for electrical resistence in Ohms Ω be the same character? (Unicode says no.) The difficulties get more subtle (and politically charged) when considering Asian ideographs.

Once have agreement on how to catalog tens of thousands of characters, there’s still the question of how to map the Unicode characters to bytes. You could think of each byte representation as a compression or compatibility scheme. The most commonly used systems are UTF-8, and  UTF-16. The former is more compact (for Western languages) and compatible with ASCII. The latter is simpler to process. Once you agree on a byte representation, there’s the issue of how to order the bytes (endianness).

Once you’ve resolved character sets and encoding, there remain issues of software compatibility. For example, which web browsers and operating systems support which representations of Unicode? Which operating systems supply fonts for which characters? How do they behave when the desired font is unavailable? How do various programming languages support Unicode? What software can be used to produce Unicode? What happens when you copy a Unicode string from one program and paste it into another?

Things get even more complicated when you want to process Unicode text because this brings up internationalization and localization issues. These are extremely complex, though they’re not complexities with Unicode per se.

For more links, see my Unicode resources.