The hard part in becoming a command line wizard

Posted on 18 February 2019 by John

I’ve long been impressed by shell one-liners. They seem like magical incantations. Pipe a few terse commands together, et voilà! Out pops the solution to a problem that would seem to require pages of code.

Source http://dilbert.com/strip/1995-06-24

Are these one-liners real or mythology? To some extent, they’re both. Below I’ll give a famous real example. Then I’ll argue that even though such examples do occur, they may create unrealistic expectations.

Bentley’s exercise

In 1986, Jon Bentley posted the following exercise:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth wrote an elegant program in response. Knuth’s program runs for 17 pages in his book Literate Programming.

McIlroy’s solution is short enough to quote below [1].

    tr -cs A-Za-z '
    ' |
    tr A-Z a-z |
    sort |
    uniq -c |
    sort -rn |
    sed ${1}q

McIlroy’s response to Knuth was like Abraham Lincoln’s response to Edward Everett at Gettysburg. Lincoln’s famous address was 50x shorter than that of the orator who preceded him [2]. (Update: There’s more to the story. See [3].)

Knuth and McIlroy had very different objectives and placed different constraints on themselves, and so their solutions are not directly comparable. But McIlroy’s solution has become famous. Knuth’s solution is remembered, if at all, as the verbose program that McIlroy responded to.

The stereotype of a Unix wizard is someone who could improvise programs like the one above. Maybe McIlroy carefully thought about his program for days, looking for the most elegant solution. That would seem plausible, but in fact he says the script was “written on the spot and worked on the first try.” He said that the script was similar to one he had written a year before, but it still counts as an improvisation.

Why can’t I write scripts like that?

McIlroy’s script was a real example of the kind of wizardry attributed to Unix adepts. Why can’t more people quickly improvise scripts like that?

The exercise that Bentley posed was the kind of problem that programmers like McIlroy solved routinely at the time. The tools he piped together were developed precisely for such problems. McIlroy didn’t see his solution as extraordinary but said “Old UNIX hands know instinctively how to solve this one in a jiffy.”

The traditional Unix toolbox is full of utilities for text manipulation. Not only are they useful, but they compose well. This composability depends not only on the tools themselves, but also the shell environment they were designed to operate in. (The latter is why some utilities don’t work as well when ported to other operating systems, even if the functionality is duplicated.)

Bentley’s exercise was clearly text-based: given a text file, produce a text file. What about problems that are not text manipulation? The trick to being productive from a command line is to turn problems into text manipulation problems. The output of a shell command is text. Programs are text. Once you get into the necessary mindset, everything is text. This may not be the most efficient approach to a given problem, but it’s a possible strategy.

The hard part

The hard part on the path to becoming a command line wizard, or any kind of wizard, is thinking about how to apply existing tools to your particular problems. You could memorize McIlroy’s script and be prepared next time you need to report word frequencies, but applying the spirit of his script to your particular problems takes work. Reading one-liners that other people have developed for their work may be inspiring, or intimidating, but they’re no substitute for thinking hard about your particular work.

Repetition

You get faster at anything with repetition. Maybe you don’t solve any particular kind of problem often enough to be fluent at solving it. If someone can solve a problem by quickly typing a one-liner in a shell, maybe they are clever, or maybe their job is repetitive. Or maybe both: maybe they’ve found a way to make semi-repetitive tasks repetitive enough to automate. One way to become more productive is to split semi-repetitive tasks into more creative and more repetitive parts.

20 thoughts on “The hard part in becoming a command line wizard”

Maher Hayek

18 February 2019 at 15:57

When you say “the traditional UNIX toolbox is full of utilities for text manipulation”, does this also apply to GNU/Linux?
John

18 February 2019 at 16:11

Yes, GNU/Linux copied (and often extended) all the standard Unix tools.
Brian

18 February 2019 at 17:16

Do you know of any books / resources to learn Unix better?
John

18 February 2019 at 17:30

I like William Shott’s book: https://www.johndcook.com/blog/2012/01/25/the-linux-command-line/

There’s a new edition coming out soon.
Nathan Hannon

18 February 2019 at 19:32

This seems similar to writing programs in a language like R, or in Python with packages like numpy. Once you are familiar with the tools available, it is possible to write many programs in a lot fewer lines of code than most people would expect.

Of course, we can debate whether such a program really counts as a short program. (You can write any program in one line, provided you’ve already created a function that implements it.)
Fred Bacon

18 February 2019 at 20:15

I maintain several software packages for both Linux and Windows. One of the things that bugs me about working with Visual Studio is the amount of clicking around in windows necessary to do something as simple as increment the version number of a program and change the Guid for the installation package.

I ended up writing a quick shell script on Linux named msi-chver to automate the process.

It’s called from the command line as

msi-chver basefilename x y z

Now I have a new installer for version x.y.z of my code that will update the program properly on Windows.

I check the new file into git, push it to my repo, and then pull it from my Windows machine (all of the development is done on Linux). I can then build and package the Windows version with very little extra work. I sometimes have to tweak code for windows, but I spend very little time there.

UUID=`uuidgen | tr [a-z] [A-Z]`

if [ -f “$1.vdproj” ]; then
sed “/\”ProductCode\” = \”8:{/s/{[^}]*}/{$UUID}/;/ProductVersion/s/:[^\”]*\”/:${2}.${3}.${4}\”/;/OutputFilename/s/CAPS-DAQ-[^\”]*-Installer/CAPS-DAQ-${2}.${3}.${4}-Installer/;/PostBuildEvent/s/CAPS-DAQ [^\\\\\”]*/CAPS-DAQ ${2}.${3}.${4}/;/Title/s/:CAPS-DAQ-.*-Installer/:CAPS-DAQ-${2}.${3}.${4}-Installer/” “$1.vdproj” > tmp.vdproj
mv tmp.vdproj “$1.vdproj”
else
echo “Couldn’t file the file $1.vdproj”
fi

I just switched to using the WiX Tool Set for building Windows installation packages. Now I have to rewrite my script to work with the new files. *sigh* However, it’s well worth the time to ewwrite the script. It replaces a lot of tedious and error prone GUI manipulation. I get all of the errors out of the way once, and then everything works smoothly for years.
BobC

18 February 2019 at 22:39

The shell command prompt is quite beguiling: Why open an IDE to solve a task I can handle with a one-liner?

Well, it soon became a few wrapped lines of nested ‘for’ loops, so I echo’ed the monstrosity to a text file, to then be opened with trivial text editor.

After a suitable amount of enhancement and evolution, I know something has gone badly/sadly wrong when I catch myself* lamenting the absence of multi-dimensional arrays in Bash.

*Yes, this happened to me last week.
Saurish Chakrabarty

18 February 2019 at 22:42

Nice!
Where do k (number of words) and the filename go in McIlroy’s script?
I have put them in the following way to get it working (had to remove the sed):

k=5
cat file.txt | tr -cs A-Za-z ‘
‘ | tr A-Z a-z | sort | uniq -c | sort -rn | head -$k
Doug Quale

18 February 2019 at 23:54

(Answer to Saurish’s question)
Apparently the McIlroy pipeline is intended to be put in a command script file. The number of words, $1, is then the first argument to the script. The script acts as a filter, so you should use redirection or a pipeline to provide the input. For example, if the pipeline is saved to a file called top_n_words and that script is made executable with chmod +x top_n_words, you could invoke it like this:
top_n_words 5 <file.txt

You can take your code and make it also operate on files provided on the command line with a small amount of extra work. This lacks error checking, but shows the idea

k=$1
shift
cat "$*" | tr -cs A-Za-z ‘\n‘ | tr A-Z a-z | sort | uniq -c | sort -rn | head -$k

and now top_n_words 5 file.txt will work without needing a redirection, but the redirection also still works.

The best book for learning Unix, the Unix command line and shell scripting is still "The UNIX Programming Environment" by Kernighan and Pike. Unfortunately it hasn't been updated since the first edition in 1983. At that time Unix was only about 13 years old (and was spelled "UNIX"), today Unix is almost 50. Almost everything in the book still works 35+ years later. Most Linux users will need more, in particular distribution specific information about how to install and update software and perform administration tasks, but those things can be found on the internet.

As an example in the chapter on filters, in section 4.2 Kernighan and Pike present a simple pipeline to print the 10 most frequent words with code equivalent to the McIlroy pipeline. Really for someone familiar with the basic Unix toolset including sort and uniq, the only tricky part of the is figuring out how to get each word on a separate line using tr with the squeeze and complement flags to replace runs of non-letters with a newline. After that sort | uniq -c | sort -n is just a standard Unix idiom that comes up repeatedly in command line work. The earliest Unix's did not have the head command, hence McIlroy used sed ${1}q instead of head -n $1. (Some graybeards disdain head, pointing out that sed 5q is less typing than head -n 5.)
Saurish Chakrabarty

20 February 2019 at 01:13

@Doug Quale: Thank you. So far I had only used sed for replacing strings.
Michael Hunter

28 February 2019 at 12:09

FWIR somewhere in the AWK book they say something to the effect of “this will take you longer to do something the first time but work will pay off”. When I learned the command line the number of tools to solve these problems were fewer. It was a c program, some shell script, and awk. Nowadays people often come to this tool set with bits of python, perl, or some other scripting language in their toolbelt. That and the time it takes to ramp up on using the orthogonal tool set vs the “batteries included” tools is friction that is hard to justify. I still think using the command line toolset helps make you more productive at the command line and also helps you understand when reaching for another scripting language that you don’t have to solve it all in one go. Write the bit that is hard using the existing toolset and pipe it together. Now you have more general tool to extend your toolset.
Kalle Hallivuori

28 February 2019 at 12:37

You can learn bash, sed, and awk, or be lazy and just learn perl and get those multidimensional structures as a bonus – but please keep it to one-liners!
Elmo Todurov

28 February 2019 at 13:30

@Doug, my solution for the exercise eats newlines and spaces in a different way:

(for i in `tr A-Z a-z < a`;do echo $i;done)|sort | uniq -c| sort -rn | head -n5
Glynnec

28 February 2019 at 13:51

Awk is the standard tool that was designed for this sort of thing.
Create an executable bash script with the following two lines:

# The first argument is an integer limit, the remaining args are the file names
k=$1; shift; cat “$@” | awk ‘{x[$1]++} END{for(y in x) printf(“%d : [%s]\n”, x[y], y)}’ | sort -rn | head -$k
Bhavik Patel

28 February 2019 at 13:54

cat file.txt| awk ‘{words[$0]+=1} END{for (word in words) printf(“%s\t %d\n”, word, words[word])}’|sort -nr -k 2|head -$k
Vasudev Ram

28 February 2019 at 14:41

I had posted these two solutions to the problem your post is about – which I called Bentley-Knuth problem – here, a while ago:

The Bentley-Knuth problem and solutions

https://jugad2.blogspot.com/2012/07/the-bentley-knuth-problem-and-solutions.html

I had also reviewed the book you recommend, The Linux Command Line, for O’Reilly, here, again a while ago:

https://jugad2.blogspot.com/2012/08/oreilly-book-review-linux-command-line.html

The link in that post to the original review on the O’Reilly site is now broken, after they reorganized their site and removed all reviews since they are now mainly focusing on “selling” their books via their Safari Online product, or via Amazon.

But the same review is available at the post I linked above.
David Ecker

28 February 2019 at 15:02

The best book IMO for learning UNIX/Linux is Harley Hahn’s Guide to Linux and Unix (http://www.harley.com/books/sg3.html). Worth every penny.
Harley Hahn

28 February 2019 at 21:34

The book David mentioned is now out of print and costs a lot on the used-book market (although, I agree with him that it is worth it :-) However, I got back the rights to the book from McGraw-Hill (they stopped selling books like that), and I’d like to have it printed again.

However, I do have an electronic version ready that I would like to sell some day. If any of you have any ideas. please let me know

If you want one of my books, here’s one that is still be in print:

https://www.apress.com/us/book/9781484217023

https://www.amazon.com/Harley-Hahns-Emacs-Field-Guide/dp/1484217020

There’s a very good introduction to Unix/Linux in the first part of the book.

– Harley Hahn
Sean Walsh

28 February 2019 at 22:04

Unix/Linux shells are truly wonderful. Here is a general purpose utility that I developed some time ago using the Bourne Shell, very useful if you have an operation that you want to perform in the same way for a list of files, but it can’t be done with simple wildcarding. The file name is simply “wl”.

#!/bin/sh
j=`echo “$@” | sed “s/__/\&/g”`
sed “s~.*~$j~g” | csh -v

#-SFW Generalised wildcard utility.
#-SFW
#-SFW Written sometime around 1990 whilst working at Moldflow Pty Ltd.
#-SFW Try to figure out how it works. It’s a good brain teaser.
#-SFW
#-SFW Documentation written 18 Jan 2012 (see, it only took me 22 years).
#-SFW
#-SFW Usage: Create a list of text entries (can be anything), then construct
#-SFW a command that you want to issue to process all of these text entries.
#-SFW In the command where you would insert the text entry, use “__”
#-SFW (double underscore). Pipe the list of text entries to this utility,
#-SFW and give the command as a string argument. Each instance of “__”
#-SFW will be replaced by one of the lines coming via the pipe, and
#-SFW the command will be executed, then it will proceed to the next line.
#-SFW
#-SFW This utility is especially helpful when you want to process a bunch
#-SFW of files in a special way, e.g. rename them all to something different.
#-SFW
#-SFW Examples:
#-SFW
#-SFW [1] Compile all .f files
#-SFW
#-SFW ls -1 *.f | wl ‘gfortran -c __’
#-SFW
#-SFW
#-SFW [2] Compare all .txt files with equivalent files in the directory above
#-SFW
#-SFW ls -1 *.txt | wl ‘diff __ ../__’
#-SFW
#-SFW
#-SFW [3] Rename all files beginning with AAA to make them begin with BBB
#-SFW
#-SFW ls -1 | sed ‘s/^AAA//’ | wl ‘mv AAA__ BBB__’
Martin Ingram

28 February 2019 at 22:15

This may be great if you’re a UNIX guru, but personally, I tend to find writing a little python script much less mind-bending and more readable to most. Peter Norvig had an answer here (see comments):
https://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/
It’s four lines of python.

Comments are closed.

Bentley’s exercise

Why can’t I write scripts like that?

The hard part

Repetition

More command line posts

20 thoughts on “The hard part in becoming a command line wizard”