I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.
Say the CSV file looked like this:
foo,bar 1,2 3,4
I read the file into R with
df <- read.csv("foobar.csv", header=TRUE)
and could access the second column as df$bar
but could not access the first column as df$foo
. What’s going on?
When I ran names(df)
it showed me that the first column was named not foo
but ï..foo
. I opened the CSV file in a hex editor and saw this:
efbb bf66 6f6f 2c62 6172 0d0a 312c 320d
The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.
If you saw my post about Unicode the other day, you may have seen Daniel Lemire’s comment:
There are various byte-order masks like EF BB BF for UTF-8 (unused).
Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.
Now we can see where the funny characters in front of “foo” came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, “Latin Small Letter I with Diaeresis.” I don’t know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as
foo,bar
with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.
How to fix the encoding problem with R? The read.csv
function has an optional encoding
parameter. I tried setting this parameter to “utf-8” and “utf8”. Neither made any difference. I looked at the R documentation, and it seems I need to set it to “UTF-8”. When I did that, the name of the first column became X.U.FEFF.foo
[1]. I don’t know what’s up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.
Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you’ll first see CSV UTF-8 (that’s why I picked it) but if you go further down you’ll see an option that’s simply CSV, implicitly in ASCII.
Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don’t know, but I do know that the problem goes away when I stick to ASCII.
***
[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM"
to fix the problem. This works, though I didn’t see it in the documentation the first time. The read.csv
function takes an encoding
parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding
parameter. With enough persistence you’ll eventually find that "UTF-8-BOM"
is a possible value for fileEncoding
.
Notepad solves much. We had the same problem at one startup where I worked. The release engineer calls me at 4 am and expects me to drive in and resave all the ascii files. Just open them in Notepad and save them.
Every file has a header that tells you what application can open the file. At another place, we would open a range of addresses by appending a header and increment through our set of headers until the file made sense. The header might have been overwritten, so we added them back.
Deleting a file doesn’t delete it. The contents are eventually overwritten.
IMHO you are saving up problems John? With a global internet more and more users are settling on utf-8 as the C21 ASCII?
I think you’ve learned the lesson here, ready for next time BOM’s come to bug you?
Have you tried read.csv(…, fileEncoding = “UTF-8-BOM”)?
Usually this problem can be solved by using something like:
read.csv(…, fileEncoding = “UTF-8-BOM”)
This should remove the byte order mark I think.
Or you could have just installed the readxl package for R and read the worksheet directly. When something gets in my way, I go ’round it.
UTF-8 is “hard”. The fact that you can get around the problems, somehow… does not change the difficulty.
What can we do? One option is to avoid UTF-8 when it is not needed. This is actually a great idea. At the software-layer… processing ASCII can be *much* faster. Try to ingest UTF-8 data and then ASCII… typically, if the software is well written, pure ASCII will be processed faster.
Now… what if you need extended (non-English) characters? Then I would argue that UTF-8 is probably the best you can get right now. It is well supported. It is a direct extension of ASCII (if you represent ASCII as UTF-8… it should be byte-for-byte identical). It is relatively compact.
But things will still go wrong… why? It is an instance of the “all abstractions are leaky” rule. Our software abstracts away the character encodings so that we don’t need to think about it… but the abstraction unavoidably fails at some point… and then you are faced with our limited understanding of the infrastructure underneath.
Not knowledgeable in R specifically, but this seems to me to be a deficiency in R’s text reading. By default, when reading a text from a file, if the first 2-3 bytes are the BOM in UTF-16-LE, UTF-16-BE, or UTF-8, it should be stripped and the rest of the bytes read in that encoding. Otherwise, read it as ASCII (which implies that any bytes 0x80 and higher are errors). In the modern context, this is what “plain text” is. If any other encoding is needed, specify it explicitly.
Also I would respectfully disagree that UTF-8 is the best you can get “right now”. It’s a brilliant solution and the only way it could realistically be better would be if UTF-16 had never been contemplated. At this point all other encodings are a legacy inconvenience that one encounters occasionally; when designing new systems and interfaces one should always specify UTF-8 for text interchange. I defy anybody to come up with a counterexample.
Oops, I should have said if no BOM, interpret it as UTF-8. This includes the ASCII case as a subset while not requiring the BOM for UTF-8.
And there is something weird with the commenting. When I submit my comment, the article appears with only the 1st comment. Reload the same URL and it appears properly with all comments, including the one I just made.
This was very interesting (including the “decoy” *and* the solution).
I might also mention that R does some “processing” of the column names that I have seen before. I think I have seen it replace “unprintable” characters with dots. I think I have also seen it replace the special characters that would cause the parser to fail. And finally, I believe I have also seen it “fix-up” duplicate column names with dots and digits or something (to make them unique).
UTF-8 should have no BOM – there are no distinct UTF-8-LE and UTF-8-BE encodings, like there is for UTF-16.
Strangely enough, I see this problem only when using read.csv on windows, not on mac