Unix one-liners working with data files at the command line

I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?

Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.

OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.

Working with data at the command line

There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.

I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.

Mythological scripts

Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?

When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.

Example: Voter registration data

Here’s an example. I was looking at Washington state voter registration data. There’s a file 20240201_VRDB_Extract.txt. What’s in there?

The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that: head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that: -n 1.

> head -n 1 20240201_VRDB_Extract.txt

StateVoterID|FName|MName|LName|NameSuffix|Birthyear|Gender|RegStNum|RegStFrac|RegStName|RegStType|RegUnitType|RegStPreDirection|RegStPostDirection|RegStUnitNum|RegCity|RegState|RegZipCode|CountyCode|PrecinctCode|PrecinctPart|LegislativeDistrict|CongressionalDistrict|Mail1|Mail2|Mail3|MailCity|MailZip|MailState|MailCountry|Registrationdate|LastVoted|StatusCode

Inserting line breaks

OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The sed tool let’s you, among other things, replace one string with another. The tiny sed program

s/|/\n/g

does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program s/foo/bar/ substitutes the first instance of foo with bar. If you want to replace all instances, you tack on a “g” on the end for “global.”

Eliminating temporary files

We could save our list of column headings to a file, and then run sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like temp1 and temp2. Then after a while you start to forget what you named each intermediary file.

It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to sed.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'

FName
MName
LName
NameSuffix
Birthyear
...

Scrolling and searching

This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to less.

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less

This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through grep to look for “zip.”

> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip

There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding -i (for case insensitive) finds the zip code columns.

RegZipCode
MailZip

Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.

A famous one-liner

When you see a more complicated one-liner like

tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.

One thought on “Natural one-liners”

Ross

26 February 2024 at 09:44

“… awk is essentially a command line spreadsheet…”

Brilliant.

I’ll say it again: “Thar’s a book in them thar blog posts, matey!”

Seriously, given the right editor and copious free time, creating a book from The Cook Corpus™ would provide one-of-a-kind practical resource.

Natural one-liners