<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Why Unicode is subtle</title>
	<atom:link href="http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/</link>
	<description>The blog of John D. Cook</description>
	<lastBuildDate>Sat, 11 Feb 2012 01:10:06 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: John</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-118329</link>
		<dc:creator>John</dc:creator>
		<pubDate>Wed, 30 Nov 2011 03:21:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-118329</guid>
		<description>I don&#039;t know anything about the J language, but here&#039;s their web site: http://www.jsoftware.com/

Also, Tracy Harms (@kaleidic on Twitter) often writes about J.</description>
		<content:encoded><![CDATA[<p>I don&#8217;t know anything about the J language, but here&#8217;s their web site: <a href="http://www.jsoftware.com/" rel="nofollow">http://www.jsoftware.com/</a></p>
<p>Also, Tracy Harms (@kaleidic on Twitter) often writes about J.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: human mathematics</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-118294</link>
		<dc:creator>human mathematics</dc:creator>
		<pubDate>Tue, 29 Nov 2011 23:54:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-118294</guid>
		<description>John, thank you for sharing your Unicode wisdom. Since you&#039;ve now become an expert on the subject in my mind, I&#039;m coming to you with a question. I&#039;m just starting to play with the &lt;code&gt;J&lt;/code&gt; language.

As a derivative of APL, &lt;code&gt;J&lt;/code&gt; doesn&#039;t always stick to ASCII. For example if you run &lt;code&gt;a:&lt;/code&gt; in the &lt;code&gt;J&lt;/code&gt; console, you get the following list of characters which comprise its alphabet:





┌┬┐├┼┤└┴┘│─ !&quot;#$%&amp;&#039;()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{&#124;}~������������������������������������������������������������������������������������������������������������������������������



If you know of any J-specific unicode resources, would you mind adding them to your list?</description>
		<content:encoded><![CDATA[<p>John, thank you for sharing your Unicode wisdom. Since you&#8217;ve now become an expert on the subject in my mind, I&#8217;m coming to you with a question. I&#8217;m just starting to play with the <code>J</code> language.</p>
<p>As a derivative of APL, <code>J</code> doesn&#8217;t always stick to ASCII. For example if you run <code>a:</code> in the <code>J</code> console, you get the following list of characters which comprise its alphabet:</p>
<p></p>
<p>┌┬┐├┼┤└┴┘│─ !&#8221;#$%&amp;&#8217;()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~������������������������������������������������������������������������������������������������������������������������������</p>
<p>If you know of any J-specific unicode resources, would you mind adding them to your list?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Charabaruk</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-114834</link>
		<dc:creator>Chris Charabaruk</dc:creator>
		<pubDate>Mon, 14 Nov 2011 23:41:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-114834</guid>
		<description>Marius: Because in most cases, you can simply treat UTF-16 as UCS-2, since the need for mapping into 0x1???? points rarely comes up for most applications. As for endianness, that&#039;s what the BOM bytes are for, if present. If not, just treat as system-default endian and user beware.</description>
		<content:encoded><![CDATA[<p>Marius: Because in most cases, you can simply treat UTF-16 as UCS-2, since the need for mapping into 0&#215;1???? points rarely comes up for most applications. As for endianness, that&#8217;s what the BOM bytes are for, if present. If not, just treat as system-default endian and user beware.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Marius Gedminas</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-114373</link>
		<dc:creator>Marius Gedminas</dc:creator>
		<pubDate>Sat, 12 Nov 2011 14:37:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-114373</guid>
		<description>UTF-16 simpler to process?  How so?  It combines the disadvantages of UTF-8 (variable-length character encoding) with the disadvantages of UTF-32 (endianness issues).</description>
		<content:encoded><![CDATA[<p>UTF-16 simpler to process?  How so?  It combines the disadvantages of UTF-8 (variable-length character encoding) with the disadvantages of UTF-32 (endianness issues).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Draw a symbol, look it up &#8212; The Endeavour</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-114369</link>
		<dc:creator>Draw a symbol, look it up &#8212; The Endeavour</dc:creator>
		<pubDate>Sat, 12 Nov 2011 14:18:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-114369</guid>
		<description>[...] Why Unicode is subtle The disappointing state of Unicode fonts Entering Unicode characters in Windows and Linux Inserting graphics in Twitter messages [...]</description>
		<content:encoded><![CDATA[<p>[...] Why Unicode is subtle The disappointing state of Unicode fonts Entering Unicode characters in Windows and Linux Inserting graphics in Twitter messages [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Johannes</title>
		<link>http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/comment-page-1/#comment-40871</link>
		<dc:creator>Johannes</dc:creator>
		<pubDate>Mon, 28 Jun 2010 23:23:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.johndcook.com/blog/2008/04/05/why-unicode-is-subtle/#comment-40871</guid>
		<description>Actually, Unicode is a 21-bit character set.

Then, the question about σ and ς is probably a hsitorical one. Unicode is committed on a 1:1 mapping of legacy character sets into Unicode. So every character that was once included in a character set has a corresponding code point in the UCS. This isn&#039;t nice from an idealist point of view but Unicode is essentially a tool that tries to work in the current (non-ideal) world.

Another thing to note about Unicode is that it&#039;s very complex as a whole. The character set as such is only a tiny part. The standard also includes collation rules for various languages, algorithms for bidirectional text display and handling and many more things.

Another point that&#039;s making dealing with it so complex is that the application working with the raw data is one thing, the proper rendering and fonts are a different beast. Latin is one of the easiest scripts to support and yet even there things can go horribly wrong. A layout engine (such as Uniscribe in Windows) has to to proper ligatures of characters in Indic or Arabic scripts, contextual glyphs are needed in the latter as well. Diacritical marks and other combining characters must be positioned properly, mixing writing directions due to script changes is also a fairly complex matter. Rules for embedding sinographs and Latin into Mongolic script (all of them use a different writing direction) exist. Things like those make the standard quite complex to begin with – a necessary but unfortunate consequence of supporting every written language that exists (or existed).</description>
		<content:encoded><![CDATA[<p>Actually, Unicode is a 21-bit character set.</p>
<p>Then, the question about σ and ς is probably a hsitorical one. Unicode is committed on a 1:1 mapping of legacy character sets into Unicode. So every character that was once included in a character set has a corresponding code point in the UCS. This isn&#8217;t nice from an idealist point of view but Unicode is essentially a tool that tries to work in the current (non-ideal) world.</p>
<p>Another thing to note about Unicode is that it&#8217;s very complex as a whole. The character set as such is only a tiny part. The standard also includes collation rules for various languages, algorithms for bidirectional text display and handling and many more things.</p>
<p>Another point that&#8217;s making dealing with it so complex is that the application working with the raw data is one thing, the proper rendering and fonts are a different beast. Latin is one of the easiest scripts to support and yet even there things can go horribly wrong. A layout engine (such as Uniscribe in Windows) has to to proper ligatures of characters in Indic or Arabic scripts, contextual glyphs are needed in the latter as well. Diacritical marks and other combining characters must be positioned properly, mixing writing directions due to script changes is also a fairly complex matter. Rules for embedding sinographs and Latin into Mongolic script (all of them use a different writing direction) exist. Things like those make the standard quite complex to begin with – a necessary but unfortunate consequence of supporting every written language that exists (or existed).</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.319 seconds -->

