Comments on: Why Unicode is subtle

By: Stephen Ferg

Stephen Ferg — Tue, 03 Apr 2012 16:37:58 +0000

Nice post! But I think it could be clarified/improved by distinguishing Unicode from the UCS from the various encodings. This is something that I tried to do in my post Unicode – the basics where I wrote:

Unicode is actually not one thing, but two separate and distinct things. The first is a character set and the second is a set of encodings.

Unicode = one character set, plus several encodings

I love your statement (because it is so straight-forward and clear):

So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name.

… but it would be more accurate to say:

So essentially the Unicode Universal Character Set (UCS) is just a catalog of characters with each character assigned a number and a standard name.

And I’d like to point out that statements like “Unicode is a 16-bit (or 21-bit, or 32-bit) character set” are, if not false, at least misleading. Unicode, in addition to the Universal Character Set (UCS) defines a number of different kinds of encodings — some fixed-length and some variable-length — of different lengths.

One statement that I very much agree with is

The best way to approach Unicode may be through a sequence of partially true statements.

So in my post I really should have written:

Unicode = one character set, plus several encodings, plus Other Stuff

Other Stuff includes the really interesting and challenging stuff, such as your question “What constitutes a character?”

Another of those interesting questions is “What is (or how do you make) a collating sequence for the characters in [name of a natural language goes here]?” How do you, for instance, define a collation sequence (so you can, for instance, make a dictionary) for a language such as Chinese? Or languages where the same character (depending on context) can be represent by different glyphs, or multiple glyphs? Interesting stuff!

By: John

John — Wed, 30 Nov 2011 03:21:44 +0000

In reply to human mathematics.

I don’t know anything about the J language, but here’s their web site: http://www.jsoftware.com/

Also, Tracy Harms (@kaleidic on Twitter) often writes about J.

By: human mathematics

human mathematics — Tue, 29 Nov 2011 23:54:35 +0000

John, thank you for sharing your Unicode wisdom. Since you've now become an expert on the subject in my mind, I'm coming to you with a question. I'm just starting to play with the J language. As a derivative of APL, J doesn't always stick to ASCII. For example if you run a: in the J console, you get the following list of characters which comprise its alphabet: ┌┬┐├┼┤└┴┘│─ !"#$%&'()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~�� If you know of any J-specific unicode resources, would you mind adding them to your list?

By: Chris Charabaruk

Chris Charabaruk — Mon, 14 Nov 2011 23:41:24 +0000

Marius: Because in most cases, you can simply treat UTF-16 as UCS-2, since the need for mapping into 0x1???? points rarely comes up for most applications. As for endianness, that’s what the BOM bytes are for, if present. If not, just treat as system-default endian and user beware.

By: Marius Gedminas

Marius Gedminas — Sat, 12 Nov 2011 14:37:01 +0000

UTF-16 simpler to process? How so? It combines the disadvantages of UTF-8 (variable-length character encoding) with the disadvantages of UTF-32 (endianness issues).

By: Johannes

Johannes — Mon, 28 Jun 2010 22:23:20 +0000

Actually, Unicode is a 21-bit character set.

Then, the question about σ and ς is probably a hsitorical one. Unicode is committed on a 1:1 mapping of legacy character sets into Unicode. So every character that was once included in a character set has a corresponding code point in the UCS. This isn’t nice from an idealist point of view but Unicode is essentially a tool that tries to work in the current (non-ideal) world.

Another thing to note about Unicode is that it’s very complex as a whole. The character set as such is only a tiny part. The standard also includes collation rules for various languages, algorithms for bidirectional text display and handling and many more things.

Another point that’s making dealing with it so complex is that the application working with the raw data is one thing, the proper rendering and fonts are a different beast. Latin is one of the easiest scripts to support and yet even there things can go horribly wrong. A layout engine (such as Uniscribe in Windows) has to to proper ligatures of characters in Indic or Arabic scripts, contextual glyphs are needed in the latter as well. Diacritical marks and other combining characters must be positioned properly, mixing writing directions due to script changes is also a fairly complex matter. Rules for embedding sinographs and Latin into Mongolic script (all of them use a different writing direction) exist. Things like those make the standard quite complex to begin with – a necessary but unfortunate consequence of supporting every written language that exists (or existed).