Unix via etymology

There are similarities across Unix tools that I’ve seldom seen explicitly pointed out. For example, the dollar sign $ refers to the end of things in multiple contexts. In regular expressions, it marks the end of a string. In sed, it refers to last line of a file. In vi it is the command to move to the end of a line.

Think of various Unix utilities lined up in columns: a column for grep, a column for less, a column for awk, etc. I’m thinking about patterns that cut horizontally across the tools.

Tools are usually presented vertically, one tool at a time, and for obvious reasons. If you need to use grep, for example, then you want to learn grep, not study a dozen tools simultaneously, accumulating horizontal cross sections like plastic laid down by a 3D printer, until you finally know what you need to get your job done.

On the other hand, I think it would be useful to point out some of the similarities across tools that experienced users pick up eventually but seldom verbalize. At a minimum, this would make an interesting blog post. (That’s not what this post is; I’m just throwing out an idea.) There could be enough material to make a book or a course.

I used the word etymology in the title, but that’s not exactly what I mean. Maybe I should have said commonality. I’m more interested in pedagogy than history. There are books on Unix history, but that’s not quite what I have in mind. I have in mind a reader who would appreciate help mentally organizing a big pile of syntax, not someone who is necessarily a history buff.

If the actual etymology isn’t too complicated, then history and pedagogy can coincide. For example, it’s helpful to know that sed and vi have common syntax that they inherited from ed. But pseudo-history, a sort of historical fiction, might be more useful than factually correct history with all its rabbit trails and dead ends. This would be a narrative that tells a story of how things could have developed, even while acknowledging that the actual course of events was more complicated.

Related post: Command option patterns

tcgrep: grep rewritten in Perl

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility grep that he calls tcgrep. You don’t have to know Perl to use tcgrep, but you can send it Perl regular expressions.

Why not grep with PCRE?

You can get basically the same functionality as tcgrep by using grep with its PCRE option -P. Since tcgrep searches directories recursively, a more direct comparison would be

    grep -R -P

However, your version of grep might not support -P. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The man page for grep on my machine says

    -P, --perl-regexp
        Interpret the pattern as a Perl-compatible regular 
        expression (PCRE). This is experimental and grep -P 
        may warn of unimplemented features.

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of grep on your system supports the -P option and is adequately Perl-compatible, it will run faster than tcgrep. But if you find yourself on a computer that has Perl but not a recent version of grep, you may find tcgrep handy.

Installation

tcgrep is included as part of the Unicode::Tussle Perl module; since tcgrep is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install tcgrep (and several more utilities) with

    cpan Unicode::Tussle

This worked for me on Linux without any issues but the install failed on Windows.

I installed tcgrep on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of %Compress to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

Consistency

Another reason to use tcgrep is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like sed will have different features on different platforms, and it will not support the same regular expressions as another tool such as awk.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

Related posts

Opening Windows files from bash and eshell

I often work in a sort of amphibious environment, using Unix software on Windows. As you can well imagine, this causes headaches. But I’ve found such headaches are generally more manageable than the headaches from alternatives I’ve tried.

On the Windows command line, you can type the name of a file and Windows will open the file with the default application associated with its file extension. For example, typing foo.docx and pressing Enter will open the file by that name using Microsoft Word, assuming that is your default application for .docx files.

Unix shells don’t work that way. The first thing you type at the command prompt must be a command, and foo.docx is not a command. The Windows command line generally works this way too, but it makes an exception for files with recognized extensions; the command is inferred from the extension and the file name is an argument to that command.

WSL bash

When you’re running bash on Windows, via WSL (Windows Subsystem for Linux), you can run the Windows utility start which will open a file according to its extension. For example,

    cmd.exe /C start foo.pdf

will open the file foo.pdf with your default PDF viewer.

You can also use start to launch applications without opening a particular file. For example, you could launch Word from bash with

    cmd.exe /C start winword.exe

Emacs eshell

Eshell is a shell written in Emacs Lisp. If you’re running Windows and you do not have access to WSL but you do have Emacs, you can run eshell inside Emacs for a Unix-like environment.

If you try running

    start foo.pdf

that will probably not work because eshell does not use the windows PATH environment.

I got around this by creating a Windows batch file named mystart.bat and put it in my path. The batch file simply calls start with its argument:

    start %

Now I can open foo.pdf from eshell with

    mystart foo.pdf

The solution above for bash

    cmd.exe /C start foo.pdf

also works from eshell.

(I just realized I said two contradictory things: that eshell does not use your path, and that it found a bash file in my path. I don’t know why the latter works. I keep my batch files in c:/bin, which is a Unix-like location, and maybe eshell looks there, not because it’s in my Windows path, but because it’s in what it would expect to be my path based on Unix conventions. I’ve searched the eshell documentation, and I don’t see how to tell what it uses for a path.)

More shell posts

Computational survivalist

survival gear

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

More computational tool posts

What sticks in your head

This morning I read an article by Dennis Felsing about his impressive/intimidating Linux desktop setup. He uses a lot of tools that are not the easiest way to get things done immediately but are long-term productivity investments.

Remembrance of syntax past

Felsing apparently is able to remember the syntax of scores of tools and programming languages. I cannot. Part of the reason is practice. I cannot remember the syntax of any software I don’t use regularly. It’s tempting to say that’s the end of the story: use it or lose it. Everybody has their set of things they use regularly and remember.

But I don’t think that’s all. I remember bits of math that I haven’t used in 30 years. Math fits in my head and sticks. Presumably software syntax sticks in the heads of people who use a lot of software tools.

There is some software syntax I can remember, however, and that’s software closely related to math. As I commented here, it was easy to come back to Mathematica and LaTeX after not using them for a few years.

Imprinting

Imprinting has something to do with this too: it’s easier to remember what we learn when we’re young. Felsing says he started using Linux in 2006, and his site says he graduated college in 2012, so presumably he was a high school or college student when he learned Linux.

When I was a student, my software world consisted primarily of Unix, Emacs, LaTeX, and Mathematica. These are all tools that I quit using for a few years, later came back to, and use today. I probably remember LaTeX and Mathematica syntax in part because I used it when I was a student. (I also think Mathematica in particular has an internal consistency that makes its syntax easier to remember.)

Picking your memory battles

I see the value in Felsing’s choice of tools. For example, the xmonad window manager. I’ve tried it, and I could imagine that it would make you more productive if you mastered it. But I don’t see myself mastering it.

I’ve learned a few tools with lots of arbitrary syntax, e.g. Emacs. But since I don’t have a prodigious memory for such things, I have to limit the number of tools I try to keep loaded in memory. Other things I load as needed, such as a language a client wants me to use that I haven’t used in a while.

Revisiting a piece of math doesn’t feel to me like revisiting a programming language. Brushing up on something from differential equations, for example, feels like pulling a book off a mental shelf. Brushing up on C# feels like driving to a storage unit, bringing back an old couch, and struggling to cram it in the door.

Middle ground

There are things you use so often that you remember their syntax without trying. And there are things you may never use again, and it’s not worth memorizing their syntax just in case. Some things in the middle, things you don’t use often enough to naturally remember, but often enough that you’d like to deliberately remember them. Some of these are what I call bicycle skills, things that you can’t learn just-in-time. For things in this middle ground, you might try something like Anki, a flashcard program with spaced repetition.

However, this middle ground should be very narrow, at least in my experience/opinion. For the most part, if you don’t use something often enough to keep it loaded in memory, I’d say either let it go or practice using it regularly.

Related posts

The hard part in becoming a command line wizard

I’ve long been impressed by shell one-liners. They seem like magical incantations. Pipe a few terse commands together, et voilà! Out pops the solution to a problem that would seem to require pages of code.

Source http://dilbert.com/strip/1995-06-24

Are these one-liners real or mythology? To some extent, they’re both. Below I’ll give a famous real example. Then I’ll argue that even though such examples do occur, they may create unrealistic expectations.

Bentley’s exercise

In 1986, Jon Bentley posted the following exercise:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth wrote an elegant program in response. Knuth’s program runs for 17 pages in his book Literate Programming.

McIlroy’s solution is short enough to quote below [1].

    tr -cs A-Za-z '
    ' |
    tr A-Z a-z |
    sort |
    uniq -c |
    sort -rn |
    sed ${1}q

McIlroy’s response to Knuth was like Abraham Lincoln’s response to Edward Everett at Gettysburg. Lincoln’s famous address was 50x shorter than that of the orator who preceded him [2]. (Update: There’s more to the story. See [3].)

Knuth and McIlroy had very different objectives and placed different constraints on themselves, and so their solutions are not directly comparable. But McIlroy’s solution has become famous. Knuth’s solution is remembered, if at all, as the verbose program that McIlroy responded to.

The stereotype of a Unix wizard is someone who could improvise programs like the one above. Maybe McIlroy carefully thought about his program for days, looking for the most elegant solution. That would seem plausible, but in fact he says the script was “written on the spot and worked on the first try.” He said that the script was similar to one he had written a year before, but it still counts as an improvisation.

Why can’t I write scripts like that?

McIlroy’s script was a real example of the kind of wizardry attributed to Unix adepts. Why can’t more people quickly improvise scripts like that?

The exercise that Bentley posed was the kind of problem that programmers like McIlroy solved routinely at the time. The tools he piped together were developed precisely for such problems. McIlroy didn’t see his solution as extraordinary but said “Old UNIX hands know instinctively how to solve this one in a jiffy.”

The traditional Unix toolbox is full of utilities for text manipulation. Not only are they useful, but they compose well. This composability depends not only on the tools themselves, but also the shell environment they were designed to operate in. (The latter is why some utilities don’t work as well when ported to other operating systems, even if the functionality is duplicated.)

Bentley’s exercise was clearly text-based: given a text file, produce a text file. What about problems that are not text manipulation? The trick to being productive from a command line is to turn problems into text manipulation problems.  The output of a shell command is text. Programs are text. Once you get into the necessary mindset, everything is text. This may not be the most efficient approach to a given problem, but it’s a possible strategy.

The hard part

The hard part on the path to becoming a command line wizard, or any kind of wizard, is thinking about how to apply existing tools to your particular problems. You could memorize McIlroy’s script and be prepared next time you need to report word frequencies, but applying the spirit of his script to your particular problems takes work. Reading one-liners that other people have developed for their work may be inspiring, or intimidating, but they’re no substitute for thinking hard about your particular work.

Repetition

You get faster at anything with repetition. Maybe you don’t solve any particular kind of problem often enough to be fluent at solving it. If someone can solve a problem by quickly typing a one-liner in a shell, maybe they are clever, or maybe their job is repetitive. Or maybe both: maybe they’ve found a way to make semi-repetitive tasks repetitive enough to automate. One way to become more productive is to split semi-repetitive tasks into more creative and more repetitive parts.

More command line posts

[1] The odd-looking line break is a quoted newline.

[2] Everett’s speech contained 13,607 words while Lincoln’s Gettysburg Address contained 272, a ratio of almost exactly 50 to 1.

[3] See Hillel Wayne’s post Donald Knuth was Framed. Here’s an excerpt:

Most of the “eight pages” aren’t because Knuth is doing LP [literate programming], but because he’s Donald Knuth:

  • One page is him setting up the problem (“what do we mean by ‘word’? What if multiple words share the same frequency?”) and one page is just the index.
  • Another page is just about working around specific Pascal issues no modern language has, like “how do we read in an integer” and “how do we identify letters when Pascal’s character set is poorly defined.”
  • Then there’s almost four pages of handrolling a hash trie.

The “eight pages” refers to the length of the original publication. I described the paper as 17 pages because that the length in the book where I found it.

Improving on the Unix shell

Yesterday I ran across Askar Safin’s blog post The Collapse of the UNIX Philosophy. Two quotes from the post stood out. One was from Rob Pike about the Unix ideal of little tools that each do one job:

Those days are dead and gone and the eulogy was delivered by Perl.

The other was a line from James Hague:

… if you romanticize Unix, if you view it as a thing of perfection, then you lose your ability to imagine better alternatives and become blind to potentially dramatic shifts in thinking.

This brings up something I’ve long wondered about: What did the Unix shell get right that has made it so hard to improve on? It has some truly awful quirks, and yet people keep coming back to it. Alternatives that seem more rational don’t work so well in practice. Maybe it’s just inertia, but I don’t think so. There are other technologies from the 1970’s that had inertia behind them but have been replaced. The Unix shell got something so right that it makes it worth tolerating the flaws. Maybe some of the flaws aren’t even flaws but features that serve some purpose that isn’t obvious.

(By the way, when I say “the Unix shell” I have in mind similar environments as well, such as the Windows command line.)

On a related note, I’ve wondered why programming languages and shells work so differently. We want different things from a programming language and from a shell or REPL. Attempts to bring a programming language and shell closer together sound great, but they inevitably run into obstacles. At some point, we have different expectations of languages and shells and don’t want the two to be too similar.

Anthony Scopatz and I discussed this in an interview a while back in the context of xonsh, “a Python-powered, cross-platform, Unix-gazing shell language and command prompt.” While writing this post I went back to reread Anthony’s comments and appreciate them more now than I did then.

Maybe the Unix shell is near a local optimum. It’s hard to make much improvement without making big changes. As Anthony said, “you quickly end up where many traditional computer science people are not willing to go.”

Related postWhat’s your backplane?

Three views of Windows and Unix

Rob Pike gave a presentation in 2001 entitled “The Good, the Bad, and the Ugly: The Unix Legacy.” His main point is that diversity has been bad for Unix. He opens his presentation with a couple of quotes to set this up.

‘‘The number of UNIX installations has grown to 10, with more expected.’’ — The UNIX Programmer’s Manual, 2nd Edition, June, 1972.

The number of UNIX variants has grown to dozens, with more expected.

He discusses much more than diversity, and I believe the more interesting parts of his talk are on other topics, but he begins and ends with diversity. One of his last slides says

Microsoft succeeds not because it’s good, but because there’s only one of them. … Unixes of the World, Unite!

Joel Spolsky has a different take on the differences between the operating systems in his article Biculturalism. Spolsky says that Unix software is programmer-friendly but Windows software is user-friendly for the vast majority of users who are not programmers. But Spolsky does touch on the diversity issue that Pike raised.

For example, Unix has a value of separating policy from mechanism which, historically, came from the designers of X. This directly led to a schism in user interfaces; nobody has ever quite been able to agree on all the details of how the desktop UI should work, and they think this is OK, because their culture values this diversity, but for Aunt Marge it is very much not OK to have to use a different UI to cut and paste in one program than she uses in another.

Just to throw in my two cents worth, I’ll mention my blog post Where the Unix philosophy breaks down. The Unix philosophy is to write little programs that do one thing well, then sew these little programs together to do your work. The problem is that many people lack the desire or skill to do the sewing. They want to avoid the transaction costs of switching software applications. Pike alludes to this problem, dismissively saying that people want “hand-holding” rather than pipes.

I don’t think this desire for integrated applications is necessarily a problem for Unix, only for the Unix philosophy that Unix doesn’t follow too strictly. The emphasis on orthogonal programs is a laudable ideal. It just needs to be tempered a bit for the convenience of mortal users.

Software knowledge shelf life

In my experience, software knowledge has a longer useful shelf life in the Unix world than in the Microsoft world. (In this post Unix is a shorthand for Unix and Linux.)

A pro-Microsoft explanation would say that Microsoft is more progressive, always improving their APIs and tools, and that Unix is stagnant.

A pro-Unix explanation would say that Unix got a lot of things right the first time, that it is more stable, and that Microsoft’s technology turn-over is more churn than progress.

Pick your explanation. But for better or worse, change comes slower on the Unix side. And when it comes, it’s less disruptive.

At least that’s how it seems to me. Although I’ve used Windows and Unix, I’ve done different kinds of work on the two platforms. Maybe the pace of change relates more to the task than the operating system. Also, I have more experience with Windows and so perhaps I’m more aware of the changes there. But nearly everything I knew about Unix 20 years ago is still useful, and much of what I knew about Windows 10 years ago is not.

More programming posts

Awk one-liners

Peteris Krumins has written a fine little book Awk One-Liners Explained. It’s just 58 pages, and it’s an easy read.

As I commented here, I typically try to master the languages I use. But for some languages, like awk and sed, it makes sense to learn just a small, powerful subset. (The larger a language is, the harder it can be to just learn part of it because the features intertwine.) Krumins’ book would be good for someone looking to learn just a little awk rather than wanting to explore every dark corner of the language.

Awk One-Liners Explained is exactly what title would lead you to expect. It has 70 awk one-liners along with a commentary on each. Some of the one-liners solve common specific problems, such as converting between Windows and Unix line endings. Most of the one-liners are solutions to general types of problems rather than code anyone is likely to run verbatim. For example, one of the one-liners is

Change “scarlet” or “ruby” or “puce” to “red.”

I doubt anybody has ever had to solve that exact problem, but it’s not hard to imagine wanting to do something similar.

Because the book is entirely about one-line programs, it doesn’t cover how to write complex programs in awk. That’s perfect for me. If something takes more than one line of awk, I probably don’t want to use awk. I use awk for quick file filtering. If a task requires writing several lines of code, I’d use Python.

You can get an idea of the style of the book by reading the author’s blog post Famous Awk One-Liners Explained, Part I: File Spacing, Numbering and Calculations.

* * *

If you’d like to learn the basics sed and awk by receiving one tip per day, you can follow @SedAwkTip on Twitter.