Why Unicode is subtle

On it’s surface, Unicode is simple. It’s a replacement for ASCII to make room for more characters. Joel Spolsky assures us that it’s not that hard. But then how did Jukka Korpela have enough to say to fill his 678-page book Unicode Explained? Why is the Unicode standard 1472 printed pages?

It’s hard to say anything pithy about Unicode that is entirely correct. The best way to approach Unicode may be through a sequence of partially true statements.

The first approximation to a description of Unicode is that it is a 16 bit character set. Sixteen bits are enough to represent the union of all previous character set standards. It’s enough to contain nearly 30,000 CJK (Chinese-Japanese-Korean) characters with space left for mathematical symbols, braille, dingbats, etc.

Actually, Unicode is a 32-bit character set. It started out as a 16-bit character set. The first 16 bit range of the Unicode standard is called the Basic Multilingual Plane (BMP), and is complete for most purposes. The regions outside the BMP contain characters for archaic and fictional languages, rare CJK characters, and various symbols.

So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name. What could be so complicated about that?

Well, for starters there’s the issue of just what constitutes a character. For example, Greek writes the letter sigma as σ in the middle of a word but as ς at the end of a word. Are σ and ς two representations of one character or two characters? (Unicode says two characters.) Should the Greek letter π and the mathematical constant π be the same character? (Unicode says yes.) Should the Greek letter Ω and the symbol for electrical resistence in Ohms Ω be the same character? (Unicode says no.) The difficulties get more subtle (and politically charged) when considering Asian ideographs.

Once have agreement on how to catalog tens of thousands of characters, there’s still the question of how to map the Unicode characters to bytes. You could think of each byte representation as a compression or compatibility scheme. The most commonly used systems are UTF-8, and  UTF-16. The former is more compact (for Western languages) and compatible with ASCII. The latter is simpler to process. Once you agree on a byte representation, there’s the issue of how to order the bytes (endianness).

Once you’ve resolved character sets and encoding, there remain issues of software compatibility. For example, which web browsers and operating systems support which representations of Unicode? Which operating systems supply fonts for which characters? How do they behave when the desired font is unavailable? How do various programming languages support Unicode? What software can be used to produce Unicode? What happens when you copy a Unicode string from one program and paste it into another?

Things get even more complicated when you want to process Unicode text because this brings up internationalization and localization issues. These are extremely complex, though they’re not complexities with Unicode per se.

For more links, see my Unicode resources.

7 thoughts on “Why Unicode is subtle

  1. Actually, Unicode is a 21-bit character set.

    Then, the question about σ and ς is probably a hsitorical one. Unicode is committed on a 1:1 mapping of legacy character sets into Unicode. So every character that was once included in a character set has a corresponding code point in the UCS. This isn’t nice from an idealist point of view but Unicode is essentially a tool that tries to work in the current (non-ideal) world.

    Another thing to note about Unicode is that it’s very complex as a whole. The character set as such is only a tiny part. The standard also includes collation rules for various languages, algorithms for bidirectional text display and handling and many more things.

    Another point that’s making dealing with it so complex is that the application working with the raw data is one thing, the proper rendering and fonts are a different beast. Latin is one of the easiest scripts to support and yet even there things can go horribly wrong. A layout engine (such as Uniscribe in Windows) has to to proper ligatures of characters in Indic or Arabic scripts, contextual glyphs are needed in the latter as well. Diacritical marks and other combining characters must be positioned properly, mixing writing directions due to script changes is also a fairly complex matter. Rules for embedding sinographs and Latin into Mongolic script (all of them use a different writing direction) exist. Things like those make the standard quite complex to begin with – a necessary but unfortunate consequence of supporting every written language that exists (or existed).

  2. Marius: Because in most cases, you can simply treat UTF-16 as UCS-2, since the need for mapping into 0x1???? points rarely comes up for most applications. As for endianness, that’s what the BOM bytes are for, if present. If not, just treat as system-default endian and user beware.

  3. John, thank you for sharing your Unicode wisdom. Since you’ve now become an expert on the subject in my mind, I’m coming to you with a question. I’m just starting to play with the J language.

    As a derivative of APL, J doesn’t always stick to ASCII. For example if you run a: in the J console, you get the following list of characters which comprise its alphabet:

    

    ┌┬┐├┼┤└┴┘│─ !”#$%&'()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~������������������������������������������������������������������������������������������������������������������������������

    If you know of any J-specific unicode resources, would you mind adding them to your list?

  4. Nice post! But I think it could be clarified/improved by distinguishing Unicode from the UCS from the various encodings. This is something that I tried to do in my post Unicode – the basics where I wrote:

    Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

    Unicode = one character set, plus several encodings

    I love your statement (because it is so straight-forward and clear):

    So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name.

    … but it would be more accurate to say:

    So essentially the Unicode Universal Character Set (UCS) is just a catalog of characters with each character assigned a number and a standard name.

    And I’d like to point out that statements like “Unicode is a 16-bit (or 21-bit, or 32-bit) character set” are, if not false, at least misleading. Unicode, in addition to the Universal Character Set (UCS) defines a number of different kinds of encodings — some fixed-length and some variable-length — of different lengths.

    One statement that I very much agree with is

    The best way to approach Unicode may be through a sequence of partially true statements.

    So in my post I really should have written:

    Unicode = one character set, plus several encodings, plus Other Stuff

    Other Stuff includes the really interesting and challenging stuff, such as your question “What constitutes a character?”

    Another of those interesting questions is “What is (or how do you make) a collating sequence for the characters in [name of a natural language goes here]?” How do you, for instance, define a collation sequence (so you can, for instance, make a dictionary) for a language such as Chinese? Or languages where the same character (depending on context) can be represent by different glyphs, or multiple glyphs? Interesting stuff!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>