Almost ASCII

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 217, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 27, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Related posts

One thought on “Almost ASCII

  1. I recently contended with a file that started out as plain text but as it went along it became progressively more and more gibberish. Some analysis revealed that I was dealing with LZUTF8, a variant of LZ77 / LZSS which is designed for encoding UTF-8 files using the file itself as the dictionary.

    The interesting characteristic of LZUTF8 is that a valid UTF8 text file is also a valid LZUTF8 file (though the reverse is not true). It seems mostly to be used in web applications to compress large JSON documents because there is a nice JavaScript implementation.

    Because the algorithm works off of back-references, the file starts out clean. But then when a text string of longer than two bytes is encountered which has been encountered before, it is replaced by a back reference. The sliding window size is enforced by the binary format itself, as back references can be at most 15 bits long.

    In short: if you encounter something which seems to be a text file, quickly and progressively becomes binary data, and the UNIX file command identifies just as “data”, consider that it may be LZUTF8.

Leave a Reply

Your email address will not be published. Required fields are marked *