I learned to use Unix in college—this was before Linux—but it felt a little mysterious. Clearly it was developed by really smart people, but what were the problems that motivated their design choices?
Some of these are widely repeated. For example, commands have terse names because you may have to transmit commands over a glacial 300 baud network connection.
OK, but why are there so many tools for munging text files, for example? That’s great if your job requires munging text files, but what about everything else. What I didn’t realize at the time was that nearly everything involves munging text files, or can be turned into a problem involving munging text files.
Working with data at the command line
There’s an old joke that Unix is user friendly, it’s just picky about who its friends are. I’d rephrase to say Unix makes more sense when you’re doing the kind of work the Unix developers were doing.
I was writing programs when I learned Unix, so some things about Unix made sense at the time. But I didn’t see the motivation for many of the standard command line tools until I started analyzing datasets years later. I thought
awk was cool—it was the first scripting language I encountered—but it wasn’t until years later that I realized
awk is essentially a command line spreadsheet. It was written for manipulating tabular data stored in text files.
Unix one-liners are impressive, but they can seem like a rabbit out of a hat. How would anyone think to do that?
When you develop your own one liners, one piece at a time, they seem much more natural. You get a feel for how the impressive one-liners you see on display were developed incrementally. They almost certainly did not pop into the world fully formed like Athena springing from the head of Zeus.
Example: Voter registration data
Here’s an example. I was looking at Washington state voter registration data. There’s a file
20240201_VRDB_Extract.txt. What’s in there?
The first line of a data file often contains column headers. Looking at just the first few lines of a file is a perennial task, so there’s a tool for that:
head. By default it shows the first 10 lines of a file. We just want to see the first line, and there’s an option for that:
> head -n 1 20240201_VRDB_Extract.txt
Inserting line breaks
OK, those look like column headers, but they’re hard to read. It would be nice if we could replace all the pipe characters used as field separators with line breaks. There’s a command for that too. The
sed tool let’s you, among other things, replace one string with another. The tiny
does just what we want. It may look cryptic, but it’s very straight forward. The “s” stands for substitute. The program
s/foo/bar/ substitutes the first instance of
bar. If you want to replace all instances, you tack on a “g” on the end for “global.”
Eliminating temporary files
We could save our list of column headings to a file, and then run
sed on the output, but that creates an unnecessary temporary file. If you do this very much, you get a lot of temporary files cluttering your working area, say with names like
temp2. Then after a while you start to forget what you named each intermediary file.
It would be nice if you could connect your processing steps together without having to create intermediary files. And that’s just what pipes do. So instead of saving our list of column headers to a file, we pipe it through to
> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g'
Scrolling and searching
This is much better. But it produces more output than you may be able to see in your terminal. You could see the list, one terminal window at a time, by piping the output to
> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | less
This file only has 33 columns, but it’s not uncommon for a data file to have hundreds of columns. Suppose there were more columns than you wanted to scan through, and you wanted to know whether one of the columns contained a zip code. You could do that by piping the output through
grep to look for “zip.”
> head -n 1 20240201_VRDB_Extract.txt | sed 's/|/\n/g' | grep -i zip
There are no column headings containing “zip”, but there are a couple containing “Zip.” Adding
-i (for case insensitive) finds the zip code columns.
Our modest little one-liner now has three segments separated by pipes. It might look impressive to someone new to working this way, but it’s really just stringing common commands together in a common way.
A famous one-liner
When you see a more complicated one-liner like
tr -cs A-Za-z '
tr A-Z a-z |
uniq -c |
sort -rn |
you can imagine how it grew incrementally. Incidentally, the one-liner above is somewhat famous. You can find the story behind it here.