I recently received two large data files from a client, with names like
foo.csv. Presumably these are redundant; the latter is probably an export of the former. I did a spot check and that seems to be the case.
Then I had a bright idea: use
pandas to make sure the two files are the same. It’s an elegant solution: import both files as data frames, then use the
compare() function to verify that they’re the same.
Except it didn’t work. I got a series of mysterious and/or misleading messages as I tried to track down the source of the problem, playing whack-a-mole with the data. There could be any number of reason why
compare() might not work on imported data: character encodings, inferred data types, etc.
So I used brute force. I exported the Excel file as CSV and compared the text files. This is low-tech, but transparent. It’s easier to compare text files than to plumb the depths of
One of the problems was that the data contained heights, such as
5'9". This causes problems with quoting, whether you enclose strings in single or double quotes. A couple quick
sed one-liners resolved most of the mismatches. (Though not all. Of course it couldn’t be that simple …)
It’s easier to work with data in a high-level environment like
pandas. But it’s also handy to be able to use low-level tools like
sed for troubleshooting.
I suppose someone could write a book on how to import CSV files. If all goes well, it’s one line of code. Then there are a handful of patterns that handle the majority of remaining cases. Then there’s the long tail of unique pathologies. As Tolstoy would say, happy data sets are all alike, but unhappy data sets are each unhappy in their own way.