Why “a caret, euro, trademark” â€™ in a file?

A with caret, euro, trademark

Why might you see â€™ in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see �, which I explained in the previous post. You might want to read that post first if you’re not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put â€™ in the opening sentence. But assuming it is an error, it’s likely an encoding error.

But it’s the opposite of the � error. The � occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as as UTF-8.

The â€™ sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text â€™.

Windows-1252 is a superset of IDO 8859-1, the error resulting in � could also be described as a Windows-1252 error. So a � means Windows-1252 text has been interpreted as UTF-8, and â€™ means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they’re not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

One thought on “Why “a caret, euro, trademark” â€™ in a file?”

Marnix Klooster

12 January 2024 at 00:57

Fun fact: Seeing this in my Android RSS reader app, literally my first thought was how again somewhere some UTF-8 vs ISO 8859-1 encoding got twisted… Thanks for posts like this one!

Comments are closed.