Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

   
# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
     
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

5 thoughts on “Graphemes

  1. Ok, that’s it, everyone go home, switch off the computers on your way out.

    String comparison in unicode is rapidly becoming a version of graph isomorphism – is there a canonical way to decompose unicode characters so that you could get the same representation for all 3 of those, and thus permit even rudimentary comparison and sorting?

  2. .. but the graph in grapheme would be the same graph as in graphite and hence graphene, right?

  3. What’s the official difference between a grapheme and a glyph?

    I know “ffi” (three characters) can be rendered as one “font thing” in opentype when ligatures are supported. (Which can, to be annoying, also be represented using one codepoint as U+FB03 “ffi”.)

  4. APL folk have been calling them ‘overstruck’ characters for quite a while. There are a lot of them in Unicode page, or is it block, 35 (23h) which I could enumerate, given an hour.

Leave a Reply

Your email address will not be published. Required fields are marked *