Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools?

Why CSV?

In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that’s the way it is.

A minor variation on comma-separated values is tab-separated values [1].

Why standard utilities?

Why use standard Unix utilities? I’ll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

cut, sort, and awk

The tools I want to look at are cut, sort, and awk. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of sort will be named gsort in order to not conflict with the native Windows sort function. There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The sort command is simple and useful. There are just a few options you’ll need to know about. The utility sorts fields as text by default, but the -n tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that -t, is the option to tell sort that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the -k option.

The last utility, awk, is more than a utility. It’s a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with awk by reading Greg Grothous’ article Why you should learn just a little awk.

Inconsistencies

Now for the bad news: these programs are inconsistent in their options. The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

cut uses -d or --delimiter to specify the field delimiter and -f or --fields to specify fields. Makes sense.

sort uses -t or --field-separator to specify the field delimiter and -k or --key to specify the field. When you’re talking about sorting things, it’s common to call the fields keys, and so the way sort specifies fields makes sense in context. I see no reason for -t other than -f was already taken. (In sorting, you talk about folding upper case to lower case, so -f stands for fold.)

awk uses -F or --field-separator to specify the field delimiter. At least the verbose option is consistent with sort. Why -F for the short option instead of -f? The latter was already taken for file. To tell awk to read a program from a file rather than the command line you use the -f option.

awk handles fields differently than cut and sort. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: $1 holds the content of the first field, $2 the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

    |------+-----+-----|
    | cut  | -d, | -f2 |
    | sort | -t, | -k2 |
    | awk  | -F, | $2  |
    |------+-----+-----|

Trade-offs

Some will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using cut, sort, and awk. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

Related

[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.

11 thoughts on “Munging CSV files with standard Unix tools

  1. Omar Antolín-Camarena

    The best reason not to use cut and sort for CSV and also not to use awk naively is simply that it doesn’t work at all! None of those tools understand that

    hello,”how are you, John?”

    is only two fields.

  2. Yes, that’s one of the problems that can come up. But it’s been a long time since I’ve run into that personally. My data files are mostly numbers, or sometimes text that doesn’t contain punctuation.

    These tools are good enough for interactive exploration, but not production.

  3. Not sure why people say these tools do not work for CSV, maybe they need some practice. Sure, the options are inconsistent, but if you plan to use these tools well, you need to know regex, which makes the options pale in complexity.

    Sometimes you need to use imagination to rip apart the data and bring out the parts you need.

  4. csvtool never lets you down, very quick as it will call shell-functions for every row. Also, termsql allows you to do SQL on csv by importing it into a temporary sqlite database.

  5. When you need to select a proper tool outside traditional Unix utilities, try xsv. It is fast and solid, and it won’t be confused by commas between quotes!

  6. -> Omar Antolín-Camarena
    sed is a powerfull text formatter
    % echo hello,”how are you, John.” | sed ‘s/([^,]*)|(”[^”]*”)/&,/g;s/,$//’ | awk -F,, ‘{print $1, $2}’
    hello ”how are you, John.”
    %

  7. I appreciate the tips.

    So many times I have had people ask, “Why command line?” when there are graphical tools to do similar things, and the answers tend to boil down to scale and flexibility., and that may only apply because I work in IT. If you do not work in IT, use a spreadsheet to hack through CSV data, and use a spreadsheet to act as a database, and use a spreadsheet to manage your calendar, because it’s easy enough for most, and flexible for limited things, but when you want to do bigger things the time it takes to learn the command line pays for itself on the first project, and every other project is even faster. If you’re not using the CLI, you either don’t work in IT (managing data regularly), or you aren’t very good at the data manipulation tasks given to you.

    I know, that sounds elitist, but I think at worst it’s defeatist, meaning I’ve been defeated by the computer, so now I’m speaking its language rather than trying to get it to speak mine (or at least use tools that are more human-friendly than computer-friendly). Natural (human) languages work well with humans, and computers can even interact with them, but ultimately if you want the job done perfectly by a computer it makes sense to speak the same language as the computer, and that’s the command line; its difficulties are directly linked to its precision; if it were so simple anybody could do it with English, it would have all the imprecision of English (or any other natural language). Don’t know what I mean? Read a contract (any contract) in legalese, or read a social media post, or read a tweet, and see how much wording is either needed to get the meaning across (contract), or how much meaning can be assumed to be other than it is via short comments (social media, twitter, etc.).

    I use csvprintf sometimes to quickly format a CSV into another CSV. It handles quotes nicely, and I usually use it to convert from comma-separated to pipe-separated values, dropping quotes in the process, which leads to my last comment.

    If you are using commas as a delimiter either you are blessed to have comma-less data, or you are doing it wrong; just use a better delimiter, like maybe tabs (though I like them less because sometimes you could have tabs in data entered by humans) or pipes (if you’re not a computer person you may not know what those are, and that’s kind of the point; they’re almost never found in anything typed by a human; they’re what you get if you hold shift and press the backslash key on a US keyboard).

Leave a Reply

Your email address will not be published. Required fields are marked *