The disappointing state of Unicode fonts

Modern operating systems understand Unicode internally, but font support for Unicode is spotty. For an example of the problems this can cause, take a look at these screen shots of how the same Twitter message appears differently depending on what program is used to read it.

No font can display all Unicode characters. According to Wikipedia

… it would be impossible to create such a font in any common font format, as Unicode includes over 100,000 characters, while no widely-used font format supports more than 65,535 glyphs.

However, the biggest problem isn’t the number of characters a font can display. Most Unicode characters are quite rare. About 30,000 characters are enough to display the vast majority of characters in use in all the world’s languages as well as a generous selection of symbols. However Unicode fonts vary greatly in their support even for the more commonly used ranges of characters. See this comparison chart. The only range completely covered by all Unicode fonts in the chart is the 128 characters of Latin Extended-A.

Unifont supports all printable characters in the basic multilingual plane, characters U+0000 through U+FFFF. This includes the 30,000 characters mentioned above plus many more. Unifont isn’t pretty, but it’s complete. As far as I know, it’s the only font that covers the characters below U+FFFF.

Related posts:

Why Unicode is subtle

Entering Unicode characters in Windows, Linux

17 thoughts on “The disappointing state of Unicode fonts

  1. Unifont is useless when it comes to Arabic or Indic scripts. In my view it would be better to leave them out than to generate totally incorrect display.

    It’s not clear to me why anyone cares whether a single font covers everything. The OS could care less, and the user is much better off with individual fonts optimally designed for each script.

  2. Your OS should support it by font substitution. So all you need is really a set of serif’ed fonts that together cover the set you need, and a similar set for sans and possibly monospace (though many symbols and characters won’t need separate fonts for sans and/or monospace of course). Will they look identical? No. But then, they often don’t need to; a normal serif kanji font will more or less match any normal book-type roman font about equally well for instance.

  3. hi John, I’m agree with you but the ‘Arial Unicode MS’ Font Support Most of live languages glyph and youcan trust it. I never see a font like it, and itis available after Windows Vista OS.

    Do You try Arial Unicode MS ?

  4. @Janne I don’t know how font substitution algorithms work, but the screen shots I linked to in the post show missing characters on both Windows and Linux. I don’t know how Mac OS would do.

    @Nasser Arial Unicode MS is a high-quality TrueType font with a large number of characters. It may be the best Unicode font for a combination of quality and range. But there have been times when it was missing a character I wanted to display.

  5. @John, I have provided an open source software at http://unicode.codeplex.com that can ( or should ) display any character.

    Test it in Unicode Information section and tell me if its fit with your request or not.

    also if it has not able to displaythe chracter,please letme know what are you trying to display?

    in this softwae I have used Unicode Character Database that provided by unicode cncertium and it contain all of unicode 5.2 characters, (unicode Annex #44 )

  6. Anonymouse Cow-art

    urxvt uses different fonts for different characters picking them from a list of configured fonts to use, based on whether the needed glyph is available in the first, second, etc. font of the list.

  7. John, sorry – my point was simply that the OS will/should substitute with whatever fonts are available, so you don’t need any one single font to cover all characters. Instead you make sure you have fonts that collectively cover the entire range you use (perhaps installing extra fonts as needed – many of us don’t normally need tamil or sanskrit for instance).

    This does work pretty painlessly under Ubuntu – those missing character in the screenshot are probably only one package install away – and I would be very surprised if Windows did not provide a similar mechanism. Perhaps not all Windows applications do fonts the “right” way, and hardcode what font they use.

  8. Janne, that makes sense. Often I’d rather see a glyph from another font than to see nothing, though sometimes I’d rather see some indication that my font is missing a character.

    Windows shows a font-missing symbol, typically a square, if a font is missing a character, even if that character is available in another font installed on Windows. You’re saying that Ubuntu will borrow from another font if it can and only display a missing character indicator if no installed font can represent the character. That makes sense. I think I’ve heard that Macintosh works that way too. Is that right?

    I like the way Ubuntu shows you the code point for missing characters, but I could see how Microsoft wouldn’t do that since they appeal to a mass market and most users wouldn’t know what to do with a Unicode code point.

  9. John — That is correct, Mac’s normally substitute whatever font is needed to display a given character, so it makes sense to have fonts tailored for each script rather than one font for all of them. But is it really true that Windows does not do this? If you have your font set on Arial and try to type Chinese or Japanese you just get boxes and have to manually reset the font to one for those scripts?

  10. thg: I’ve done some experiments and I’m confused. I think Windows sometimes does font substitutions and sometimes does not. I was experimenting with Word 2007 and I don’t know what behavior to attribute to Windows and what to attribute to Word.

    When I paste a block of Unicode text into Word and ask it to set the font to Arial MS Unicode, some characters change to that font and some do not. When I set the text to Unifont and then back Arial MS Unicode, some characters stay in Unifont.

  11. John: Some of the odd behavior of Arial Unicode might result from the fact that it is missing any characters added to Unicode since version 2.1 of 1998.

  12. Well, Arial Unicode MS is, as all pan-Unicode fonts a *fallback* font. It isn’t intended to do pretty typesetting. It’s intended to show glyphs that aren’t available in any other font as a last-resort measure.

    There are also a few very good reasons why there is no font that supports everything. The first one you noted yourself – 65536 glyphs are enough for the BMP but nothing else.

    Then there is the whole problem of style. Latin has a categorization into Sans-serif and serif typefaces, which are usually available in Roman and Italic/Oblique. This may still work for Greek and Cyrillic but those styles are unknown in Arabic, Hebrew, Chinese, Japanese Indic scripts and many more. Since a font *has* to categorize itself into those broad categories this doesn’t quite fit. A set of fonts, covering specific scripts with their individual styles is much easier to work with and leads to better typographical results if done right.

    A third problem is that many more complex scripts require much more layout work than Latin. Simply providing the glyphs is only a very small part of the task. Both the layout engine and the font have to do some work to get some scripts to render correctly. Many pan-Unicode fonts neglect this – they’re only meant as a last-resort font or for individual glyphs to be displayed, not for typesetting actual text with them. Without the proper tables, combining marks and additional glyphs for ligatures in the font you’ll get unlayouted mess which can be essentially unreadable in certain scripts.

    Windows 7 comes with approximately 98 % of Unicode coverage in its preinstalled fonts which is quite good, I think. Essentially this allows the layout engine to choose a font for a glyph which is designed for that script or block. Sure, many things should be (and are) common to nearly all fonts. This includes Basic Latin, punctuation (including all kinds of spaces), maybe a few symbols commonly needed for typesetting text (arrows, &c.).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>