Upper case, lower case, Unicode, and the command line

Converting text to all upper case or all lower case is a fairly common task.

One way to convert text to upper case would be to use the tr utility to replace the letters a through z with the letters A through Z. For example,

    $ echo Now is the time | tr '[a-z]' '[A-Z]'
    NOW IS THE TIME

You could convert to lower case by reversing the arguments to tr.

The approach above works if your text consists of only unadorned Roman letters. But it wouldn’t work, for example, if you gave it a jalapeño or π:

    $ echo jalapeño π | tr '[a-z]' '[A-Z]'
    JALAPEñO π

Using the character classes [:lower:] and [:upper:] won’t help either.

Tussling with Unicode

One alternative would be to use the uc command from the Unicode::Tussle package [1] I mentioned a few days ago. There’s also a lc counterpart, and a tc for title case. These utilities handle far more than Roman letters.

    $ echo jalapeño π | uc
    JALAPEÑO Π

Unicode capitalization rules are a black hole, but we’ll just look at one example and turn around quickly before we cross the event horizon.

Suppose you want to send all the letters in the Greek word σόφος to upper case.

    $ echo σόφος | uc
    ΣΌΦΟΣ

Greek has two lower case forms of sigma: ς at the end of a word and σ everywhere else. But there’s only one upper case sigma, so both get mapped to Σ. This means that if we convert the text to upper case and then to lower case, we won’t end up exactly where we started.

    $ echo σόφος | uc | lc
    σόφοσ

Note that the lc program chose σ as the lower case of Σ and didn’t take into account that it was at the end of a word.

[1] “Tussle” is an acronym for Tom [Christiansen]’s Unicode Scripts So Life is Easier.

3 thoughts on “Upper case, lower case, title case”

Diethard Michaelis

23 July 2021 at 01:58

Even within LATIN we have rather unexpected stuff like the “Turkish i problem” where lower case i doesn’t map to upper case I but to an I with dot on top.
https://en.wikipedia.org/wiki/Dotted_and_dotless_I
https://news.ycombinator.com/item?id=8892157

Apostolos Tsompanopoulos

23 July 2021 at 05:40

Just a minor correction, since I’m Greek :-)
%s/σόφος/σοφός/g

[FWIW, “σοφός” means “wise man”]

Frank Wilhoit

23 July 2021 at 16:26

This is a special case of machine translation.

Even if the stated requirement is to translate characters, the unstated requirement is to translate graphemes, which may also be morphemes.

In German, letter case is (potentially) semantic, as nouns are capitalized.

If lc ought to be able to spot word breaks in Greek, then ought it also to be able to spot beginnings of sentences, which are capitalized in many/most languages that use flavors or extensions of the Latin alphabet?

And therefore, ought any of these tools, if given full syntactical context, refuse to perform characterwise translations that would violate morphological rules? Or perhaps only if given a -strict flag?

And how does all of this intersect with typography, as the output of such tools is probably often used to drive typographical tools? What about display fonts that are all-lowercase or all-uppercase? Do any of the standards for font metadata allow a font to declare whether it forwards or suppresses the codepoints for the unimplemented case?

Pandora called. She wants her box back.

Comments are closed.

Tussling with Unicode

Related posts

3 thoughts on “Upper case, lower case, title case”