Why assign two Unicode characters to the same symbol?

Unicode often counts the same symbol (glyph) as two or more different characters. For example, Ω is U+03A9 when it represents the Greek letter omega and U+2126 when it represents Ohms, the unit of electrical resistance. Similarly, M is U+004D when it’s used as a Latin letter but U+216F when it’s used as the Roman numeral for 1,000.

The purpose of such distinctions is to capture semantic differences. One example of how this could be useful is increased accessibility. A text-to-speech reader should pronounce things the same way people do. When such software sees “a 25 Ω resistor” it should say “a twenty five Ohm resistor” and not “a twenty five uppercase omega resistor,” just as a person would. [1]

Making text more accessible to the blind helps everyone else as well. For example, it makes the text more accessible to search engines as well. As Elliotte Rusty Harold points out in Refactoring HTML:

Wheelchair ramps are far more commonly used by parents with strollers, students with bicycles, and delivery people with hand trucks than they are by people in wheelchairs. When properly done, increasing accessibility for the disabled increases accessibility for everyone.

However, there are practical limits to how many semantic distinctions Unicode can make without becoming impossibly large, and so the standard is full of compromises. It can be quite difficult to decide when two uses of the same glyph should correspond to separate characters, and no standard could satisfy everyone.

* * *

[1] Someone may discover that when I wrote “a 25 Ω resistor” above, I actually used an Omega (Ω, U+03A9) rather than an Ohm character (Ω, U+2126). That’s because font support for Unicode is disappointing. If I had used the technically correct Ohm character, some people would not be able to see it. Ironically, this would make the text less accessible.

On my Android phone, I can see Ω (Ohm) but I cannot see Ⅿ (Roman numeral M) because the installed fonts have a glyph for the former but not the latter.

* * *

This post first appeared on Symbolism, a blog that I’ve now shut down.

4 thoughts on “Why assign two characters to the same symbol?”

Luke Hutchison

14 December 2014 at 05:51

It’s disappointing that internationalization libraries and font rendering libraries don’t do a better job of using fallback glyph or fallback font support in cases like Ⅿ vs. M and Ω vs. Ω, so you can at least see something helpful rather than an empty box when a glyph is not available, even if it’s in the wrong font.

dghf

15 December 2014 at 16:27

One semantic distinction that would be useful but that Unicode doesn’t make is between the apostrophe and the single (closing) quotation mark: both are U+0027 (ugly typewriter-style) or U+2019 (typographic).

Johannes

2 January 2015 at 05:43

What you fail to capture here, is the fact that U+2126 should in fact not be used. It exists as a compatibility character for converting from legacy character sets. As such it must exist in Unicode, but U+03A9 is favored for new content. This may also explain why font support is so horrible.

The reasoning that scientific units get their own symbol in Unicode, even when they are represented by Latin or Greek letters, doesn’t hold up, because there are not that many more characters like U+2126. There is no separate encoding of m for meter, for example.

alex

17 February 2023 at 17:10

To follow up on Johannes’ remarks, the decomposition of U+2126 is actually the (single) codepoint U+03A9. There are legacy kilo- and megaohm symbols in Unicode, as well, and they similarly decompose to sequences containing U+03A9.

Assuming your software does canonicalization (which it probably needs to), you won’t even see a U+2126 codepoint, even if a user intentionally typed it that way.

Making text more semantically rich could make it more accessible to both people and software, which would be great, but that’s not what the Unicode Consortium decided to do here.

Comments are closed.