Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools?

Why CSV?

In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that’s the way it is.

A minor variation on comma-separated values is tab-separated values [1].

Why standard utilities?

Why use standard Unix utilities? I’ll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

cut, sort, and awk

The tools I want to look at are cut, sort, and awk. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of sort will be named gsort in order to not conflict with the native Windows sort function. There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The sort command is simple and useful. There are just a few options you’ll need to know about. The utility sorts fields as text by default, but the -n tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that -t, is the option to tell sort that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the -k option.

The last utility, awk, is more than a utility. It’s a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with awk by reading Greg Grothous’ article Why you should learn just a little awk.


Now for the bad news: these programs are inconsistent in their options. The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

cut uses -d or --delimiter to specify the field delimiter and -f or --fields to specify fields. Makes sense.

sort uses -t or --field-separator to specify the field delimiter and -k or --key to specify the field. When you’re talking about sorting things, it’s common to call the fields keys, and so the way sort specifies fields makes sense in context. I see no reason for -t other than -f was already taken. (In sorting, you talk about folding upper case to lower case, so -f stands for fold.)

awk uses -F or --field-separator to specify the field delimiter. At least the verbose option is consistent with sort. Why -F for the short option instead of -f? The latter was already taken for file. To tell awk to read a program from a file rather than the command line you use the -f option.

awk handles fields differently than cut and sort. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: $1 holds the content of the first field, $2 the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

    | cut  | -d, | -f2 |
    | sort | -t, | -k2 |
    | awk  | -F, | $2  |


Some will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using cut, sort, and awk. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.


[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.

10 thoughts on “Munging CSV files with standard Unix tools

  1. Omar Antolín-Camarena

    The best reason not to use cut and sort for CSV and also not to use awk naively is simply that it doesn’t work at all! None of those tools understand that

    hello,”how are you, John?”

    is only two fields.

  2. Yes, that’s one of the problems that can come up. But it’s been a long time since I’ve run into that personally. My data files are mostly numbers, or sometimes text that doesn’t contain punctuation.

    These tools are good enough for interactive exploration, but not production.

  3. Not sure why people say these tools do not work for CSV, maybe they need some practice. Sure, the options are inconsistent, but if you plan to use these tools well, you need to know regex, which makes the options pale in complexity.

    Sometimes you need to use imagination to rip apart the data and bring out the parts you need.

  4. csvtool never lets you down, very quick as it will call shell-functions for every row. Also, termsql allows you to do SQL on csv by importing it into a temporary sqlite database.

  5. When you need to select a proper tool outside traditional Unix utilities, try xsv. It is fast and solid, and it won’t be confused by commas between quotes!

  6. -> Omar Antolín-Camarena
    sed is a powerfull text formatter
    % echo hello,”how are you, John.” | sed ‘s/([^,]*)|(”[^”]*”)/&,/g;s/,$//’ | awk -F,, ‘{print $1, $2}’
    hello ”how are you, John.”

Leave a Reply

Your email address will not be published. Required fields are marked *