Unicode resources

Unicode is essentially a universal character set. It contains nearly every character in every human language. However, Unicode is subtle. As I point out in my blog article on Unicode, it’s hard to say anything pithy about Unicode that is entirely correct. Every simple statement requires footnotes. Here are some resources I’ve found useful in understanding and using Unicode.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Great introduction to Unicode for developers, as the title suggests.

Unicode Standard by the Unicode Consortium
The 1472-page tome is the indispensible reference for Unicode. The Unicode Consortium has made much of the information in this book available online.

In general, Unicode characters can be inserted into HTML by putting their hexadecimal representation between &#x and a semicolon. For example, the Greek theta (θ) can be inserted into HTML by typing θ. Some commonly used characters have mnemonic counterparts, such as θ for θ. However, there are only 252 such HTML entities and over 40,000 Unicode characters. Also, in general HTML mnemonic entities cannot be used in XML. There are four exceptions: &, >, <, and ". Note that just because a character is legal HTML does not mean the client’s browser will display it or display it correctly. See also math symbols and Greek letters.

Unicode in XML
Unicode characters can be inserted into XML by quoting their code point numbers in hexadecimal, much like HTML. However, some characters are illegal or at least discouraged because they could confuse XML processors.

XeTeX is a version of TeX that works with Unicode. There is a XeLaTeX version of LaTeX as well.

All about Unicode and Python
Very extensive article.

There Ain’t No Such Thing as Plain Text by Jeff Atwood
Mostly about Unicode encodings such as UTF-8.

Unicode and ISO 10646
Why these are not exactly the same thing and just what the relationship between the two is.

Unicode Explained by Jukka Korpela.
This book gets into many of the issues surrounding Unicode that are not part of Unicode per se, such as internationalization and software compatibility.

Andrew West offers numerous Unicode resources on his BabelStone web site, including his BabelPad and BabelMap software. See Andrew West’s blog.

* * *

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon