A valid character to represent an invalid character


You may have seen a web page with the symbol � scattered throughout the text, especially in older web pages. What is this symbol and why does it appear unexpected?

The symbol we’re discussing is a bit of a paradox. It’s the (valid) Unicode character to represent an invalid Unicode character. If you just read the first sentence of this post, without inspecting the code point values, you can’t tell whether the symbol appears because I’ve made a mistake, or whether I’ve deliberately included the symbol.

The symbol in question is U+FFFD, named REPLACEMENT CHARACTER, a perfectly valid Unicode character. But unlike this post, you’re most likely to see it when the author did not intend for you to see it. What’s going on?

It all has to do with character encoding.

If all you want to do is represent Roman letters and common punctuation marks, and control characters, ASCII is fine. There are 128 ASCII characters, so they fit into a single 8-bit byte. But as soon as you want to write façade, jalapeño, or Gödel you have a problem. And of course you have a bigger problem if your language doesn’t use the Roman alphabet at all.

ASCII wastes one bit per byte, so naturally people wanted to take advantage of that extra bit to represent additional characters, such as the ç, ñ, and ö above. One popular way of doing this was described in the standard ISO 8859-1.

Of course there are other ways of encoding characters. If your language is Russian or Hebrew or Chinese, you’re no happier with ISO 8859-1 than you are with ASCII.

Enter Unicode. Let’s represent all the word’s alphabets (and ideograms and common symbols and …) in a single system. Great idea, though there are a mind-boggling number of details to work out. Even so, once you’ve assigned a number to every symbol that you care about, there’s still more work to do.

You could represent every character with two bytes. Now you can represent 65,536 characters. That’s too much and too little. If you want to represent text that is essentially made of Roman letters plus occasional exotic characters, using two bytes per letter makes the text 100% larger.

And while 65,536 sounds like a lot of characters, it’s not enough to represent every Chinese character, much less all the characters in other languages. Turns out we need four bytes to do what Unicode was intended to do.

So now we have to deal with encodings. UTF-8 is a brilliant solution to this problem. It can handle all Unicode characters, but if text is just ASCII, it won’t be any longer: ASCII is a valid UTF-8 encoding of the subset of Unicode that belonged to ASCII.

But there were other systems before Unicode, like ISO 8859-1, and if your file is encoded as ISO 8859-1, but a web browser thinks its UTF-8 encoded Unicode, some characters could be invalid. Browsers will use the character � as a replacement for invalid text that it could not otherwise display. That’s probably what’s going on when you see �.

See the article on How UTF-8 works to understand why some characters are not just a different character than intended but illegal UTF-8 byte sequences.