Accessing Unicode characters by name

snowman U+2503

You can sometimes make code more readable by using names for characters rather than the characters themselves or their code points. Python and Perl both, for example, let you refer to a character by using its standard Unicode name inside \N{}.

For instance, \N{SNOWMAN} refers to Unicode character U+2603, shown at the top of the post. It’s also kinda hard to read ☃, and not many people would read \u2603 and immediately think “Ah yes, U+2603, the snowman.”

A few days ago I wrote about how to get one-liners to work on Windows and Linux. Shells and programming languages have different ways to quoting and escaping special characters, and sometimes these ways interfere with each other.

I said that one way to get around problems with literal quotes inside a quoted string is to use character codes for quotes. This may be overkill, but it works. For example,

    perl -e 'print qq{\x27hello\x27\n}'

and

    python -c "print('\x27hello\x27\n')"

both print 'hello', including the single quotes.

One problem with this is that you may not remember that U+0027 is a single quote. And even if you have that code point memorized [2], someone else reading your code might not.

The official Unicode name for a single quote is APOSTROPHE. So the Python one-liner above could be written

    python -c "print('\N{APOSTROPHE}hello\N{APOSTROPHE}\n')"

This is kinda artificial in a one-liner because such tiny programs optimize for brevity rather than readability. But in an ordinary program rather than on the command line, using character names could make code easier to read.

So how do you find out the name of a Unicode character? The names are standard, independent of any programming language, so you can look them up in any Unicode reference.

A programming language that lets you use Unicode names probably also has a way to let you look up Unicode names. For example, in Python you can use unicodedata.name.

    >>> from unicodedata import name
    >>> name('π')
    'GREEK SMALL LETTER PI'
    >>> name("\u05d0") # א
    >>> 'HEBREW LETTER ALEF'

In Perl you could write

    use charnames q{ :full };
    print charnames::viacode(0x22b4); # ⊴

which prints “NORMAL SUBGROUP OF OR EQUAL TO” illustrating that Unicode names can be quite long.

[1] How this renders varies greatly from platform to platform. Here are some examples.

Windows with Firefox:

iPad with Firefox:

iPad with Inoreader:

[2] Who memorizes Unicode code points?! Well, I’ve memorized a few landmarks. For example, I memorized where the first letters of the Latin, Greek, and Hebrew alphabets are, so in a pinch I can figure out the rest of the letters.

2 thoughts on “Accessing characters by name”

Rikk

19 December 2020 at 07:54

My feed reader has a quirk whereby any post containing a character from a right-to-left script is rendered right-to-left in its entirety. I normally see this behaviour from language blogs, but the aleph in this post triggered it. As a result, the block-quoted code snippets exhibit some behaviour I’ve never seen before.

The console prompts now lead from the right, and point left (i.e. “<<<"). Any trailing comments now "trail" from the right, and in the perl example, the terminating semicolon is to the left of the preceding code (but to the right of any comments). The code itself still reads left-to-right, and the whole effect is pretty trippy. It looks like you're just using tags, but it surprises me the browser is putting so much work into such a doomed venture.

My description probably isn’t doing this justice, so I’m going to link a screenshot and hope it doesn’t get eaten by your spam filter.

https://i.imgur.com/2DTbBW3.png
Jonathan

20 December 2020 at 09:12

Rikk, that was horrible and more appropriate for the end of October, not December.

I’m expecting some day to read that someone has proved that the problem of deciding how to render text that mixes RtL and LtR is either undecidable, NP hard, or Turing complete.

Comments are closed.

Related posts

2 thoughts on “Accessing characters by name”