Working with wide text files at the command line

Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like

    head data.csv

to look at the first few lines of the file and got this back:

screen shot of data scrolling past what you want to see

That was not at all helpful. The part I was interested was at the beginning, but that part scrolled off the screen quickly. To see just how wide the lines are I ran

    head -n 1 data.csv | wc

and found that the first line of the file is 4822 characters long.

How can you see just the first part of long lines? Use the cut command. It comes with Linux systems and you can download it for Windows as part of GOW.

You can see the first 30 characters of the first few lines by piping the output of head to cut.

    head data.csv | cut -c -30

This shows

"GEO_ID","NAME","DP05_0001E","
"id","Geographic Area Name","E
"8600000US01379","ZCTA5 01379"
"8600000US01440","ZCTA5 01440"
"8600000US01505","ZCTA5 01505"
"8600000US01524","ZCTA5 01524"
"8600000US01529","ZCTA5 01529"
"8600000US01583","ZCTA5 01583"
"8600000US01588","ZCTA5 01588"
"8600000US01609","ZCTA5 01609"

which is much more useful. The syntax -30 says to show up to the 30th character. You could do the opposite with 30- to show everything starting with the 30th character. And you can show a range, such as 20-30 to show the 20th through 30th characters.

You can also use cut to pick out fields with the -f option. The default delimiter is tab, but our file is delimited with commas so we need to add -d, to tell it to split fields on commas.

We could see just the second column of data, for example, with

    head data.csv | cut -d, -f 2

This produces

"NAME"
"Geographic Area Name"
"ZCTA5 01379"
"ZCTA5 01440"
"ZCTA5 01505"
"ZCTA5 01524"
"ZCTA5 01529"
"ZCTA5 01583"
"ZCTA5 01588"
"ZCTA5 01609"

You can also specify a range of fields, say by replacing 2 with 3-4 to see the third and fourth columns.

The humble cut command is a good one to have in your toolbox.

Related

 

Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. For example,

    shuf -n 10 foo.txt

would select 10 lines from foo.txt.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

    shuf -r -n 10 foo.txt

to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing

    alpha
    beta
    gamma

I got the output

    beta
    gamma
    gamma
    beta
    alpha
    alpha
    gamma
    gamma
    beta

I don’t know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

Related

The hard part in becoming a command line wizard

I’ve long been impressed by shell one-liners. They seem like magical incantations. Pipe a few terse commands together, et voilà! Out pops the solution to a problem that would seem to require pages of code.

Source http://dilbert.com/strip/1995-06-24

Are these one-liners real or mythology? To some extent, they’re both. Below I’ll give a famous real example. Then I’ll argue that even though such examples do occur, they may create unrealistic expectations.

Bentley’s exercise

In 1986, Jon Bentley posted the following exercise:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth wrote an elegant program in response. Knuth’s program runs for 17 pages in his book Literate Programming.

McIlroy’s solution is short enough to quote below [1].

    tr -cs A-Za-z '
    ' |
    tr A-Z a-z |
    sort |
    uniq -c |
    sort -rn |
    sed ${1}q

McIlroy’s response to Knuth was like Abraham Lincoln’s response to Edward Everett at Gettysburg. Lincoln’s famous address was 50x shorter than that of the orator who preceded him [2]. (Update: There’s more to the story. See [3].)

Knuth and McIlroy had very different objectives and placed different constraints on themselves, and so their solutions are not directly comparable. But McIlroy’s solution has become famous. Knuth’s solution is remembered, if at all, as the verbose program that McIlroy responded to.

The stereotype of a Unix wizard is someone who could improvise programs like the one above. Maybe McIlroy carefully thought about his program for days, looking for the most elegant solution. That would seem plausible, but in fact he says the script was “written on the spot and worked on the first try.” He said that the script was similar to one he had written a year before, but it still counts as an improvisation.

Why can’t I write scripts like that?

McIlroy’s script was a real example of the kind of wizardry attributed to Unix adepts. Why can’t more people quickly improvise scripts like that?

The exercise that Bentley posed was the kind of problem that programmers like McIlroy solved routinely at the time. The tools he piped together were developed precisely for such problems. McIlroy didn’t see his solution as extraordinary but said “Old UNIX hands know instinctively how to solve this one in a jiffy.”

The traditional Unix toolbox is full of utilities for text manipulation. Not only are they useful, but they compose well. This composability depends not only on the tools themselves, but also the shell environment they were designed to operate in. (The latter is why some utilities don’t work as well when ported to other operating systems, even if the functionality is duplicated.)

Bentley’s exercise was clearly text-based: given a text file, produce a text file. What about problems that are not text manipulation? The trick to being productive from a command line is to turn problems into text manipulation problems.  The output of a shell command is text. Programs are text. Once you get into the necessary mindset, everything is text. This may not be the most efficient approach to a given problem, but it’s a possible strategy.

The hard part

The hard part on the path to becoming a command line wizard, or any kind of wizard, is thinking about how to apply existing tools to your particular problems. You could memorize McIlroy’s script and be prepared next time you need to report word frequencies, but applying the spirit of his script to your particular problems takes work. Reading one-liners that other people have developed for their work may be inspiring, or intimidating, but they’re no substitute for thinking hard about your particular work.

Repetition

You get faster at anything with repetition. Maybe you don’t solve any particular kind of problem often enough to be fluent at solving it. If someone can solve a problem by quickly typing a one-liner in a shell, maybe they are clever, or maybe their job is repetitive. Or maybe both: maybe they’ve found a way to make semi-repetitive tasks repetitive enough to automate. One way to become more productive is to split semi-repetitive tasks into more creative and more repetitive parts.

More command line posts

[1] The odd-looking line break is a quoted newline.

[2] Everett’s speech contained 13,607 words while Lincoln’s Gettysburg Address contained 272, a ratio of almost exactly 50 to 1.

[3] See Hillel Wayne’s post Donald Knuth was Framed. Here’s an excerpt:

Most of the “eight pages” aren’t because Knuth is doing LP [literate programming], but because he’s Donald Knuth:

  • One page is him setting up the problem (“what do we mean by ‘word’? What if multiple words share the same frequency?”) and one page is just the index.
  • Another page is just about working around specific Pascal issues no modern language has, like “how do we read in an integer” and “how do we identify letters when Pascal’s character set is poorly defined.”
  • Then there’s almost four pages of handrolling a hash trie.

The “eight pages” refers to the length of the original publication. I described the paper as 17 pages because that the length in the book where I found it.

Windows command line tips

I use Windows, Mac, and Linux, each for different reasons. When I run Windows, I like to have a shell that works sorta like bash, but doesn’t run in a subsystem. That is, I like to have the utility programs and command editing features that I’m used to from bash on Mac or Linux, but I want to run native Windows code and not a little version of Linux hosted by Windows. [1]

It took a while to find something I’m happy with. It’s easier to find Linux subsystems like Cygwin. But cmder does exactly what I want. It’s a bash-like shell running on Windows. I also use Windows ports of common Unix utilities. Since these are native Windows programs, I can run them and other Windows applications in the same environment. No error messages along the lines of “I know it looks like you’re running Windows, but you’re not really. So you can’t open that Word document.”

I’ve gotten Unix-like utilities for Windows from several sources. GOW (Gnu on Windows) is one source. I’ve also collected utilities from other miscellaneous sources.

Tab completion and file association

There’s one thing that was a little annoying about cmder: tab completion doesn’t work if you want to enter a file name. For example, if you want to open a Word document foo.docx from the basic Windows command prompt cmd.exe, you can type fo followed by the tab key and the file will open if foo.docx is the first file in your working directory that begins with “fo.”

In cmder, tab completion works for programs first, and then for file names. If you type in fo followed by the tab key, cmder will look for an application whose name starts with “fo.” If you wanted to move foo.docx somewhere, you could type mv fo and tab. In this context, cmder knows that “fo” is the beginning of a file name.

On Mac, you use the open command to open a file with its associated application. For example, on the Mac command line you’d type open foo.docx rather than just foo.docx to open the file with Word.

If there were something like open on Windows, then tab completion wold work in cmder. And there is! It’s the start command. In fact, if you’re accustomed to using open on Mac, you could alias start to open on Windows [2]. So in cmder, you can type start fo and hit tab, and get tab completion for the file name.

Miscellaneous

The command assoc shows you which application is associated with a file extension. (Include the “.” when using this command. So you’d time assoc .docx rather than assoc docx.

You can direct Windows shell output to the clip command to put the output onto the Windows clipboard.

The control command opens the Windows control panel.

This post shows how to have branching logic in an Emacs config file so you can use the same config file across operating systems.

More MS Windows posts

[1] Usually on Windows I want to run Windows. But if I do want to run Linux without having to go to another machine, I use WSL (Windows Subsystem for Linux) and I can use it from cmder. Since cmder supports multiple tabs, I can have one tab running ordinary cmd.exe and another tab running bash on WSL.

[2] In the directory containing cmder.exe, edit the file config/user-aliases.cmd. Add a line open=start $1.

Perl as a better grep

I like Perl’s pattern matching features more than Perl as a programming language. I’d like to take advantage of the former without having to go any deeper than necessary into the latter.

The book Minimal Perl is useful in this regard. It has chapters on Perl as a better grep, a better awk, a better sed, and a better find. While Perl is not easy to learn, it might be easier to learn a minimal subset of Perl than to learn each of the separate utilities it could potentially replace. I wrote about this a few years ago and have been thinking about it again recently.

Here I want to zoom in on Perl as a better grep. What’s the minimum Perl you need to know in order to use Perl to search files the way grep would?

By using Perl as your grep, you get to use Perl’s more extensive pattern matching features. Also, you get to use one regex syntax rather than wondering about the specifics of numerous regex dialects supported across various programs.

Let RE stand for a generic regular expression. To search a file foo.txt for lines containing the pattern RE, you could type

    perl -ln -e "/RE/ and print;" foo.txt

The Perl one-liner above requires more typing than using grep would, but you could wrap this code in a shell script if you’d like.

If you’d like to print lines that don’t match a regex, change the and to or:

    perl -ln -e "/RE/ or print;" foo.txt

By learning just a little Perl you can customize your search results. For example, if you’d like to just print the part of the line that matched the regex, not the entire line, you could modify the code above to

    perl -ln -e "/RE/ and print $&;" foo.txt

because $& is a special variable that holds the result of the latest match.

Update: If you’d like to use Perl regular expressions but you’d rather not write Perl code, you might like tcgrep. It uses Perl regular expressions but has an interface like grep.

Improving on the Unix shell

Yesterday I ran across Askar Safin’s blog post The Collapse of the UNIX Philosophy. Two quotes from the post stood out. One was from Rob Pike about the Unix ideal of little tools that each do one job:

Those days are dead and gone and the eulogy was delivered by Perl.

The other was a line from James Hague:

… if you romanticize Unix, if you view it as a thing of perfection, then you lose your ability to imagine better alternatives and become blind to potentially dramatic shifts in thinking.

This brings up something I’ve long wondered about: What did the Unix shell get right that has made it so hard to improve on? It has some truly awful quirks, and yet people keep coming back to it. Alternatives that seem more rational don’t work so well in practice. Maybe it’s just inertia, but I don’t think so. There are other technologies from the 1970’s that had inertia behind them but have been replaced. The Unix shell got something so right that it makes it worth tolerating the flaws. Maybe some of the flaws aren’t even flaws but features that serve some purpose that isn’t obvious.

(By the way, when I say “the Unix shell” I have in mind similar environments as well, such as the Windows command line.)

On a related note, I’ve wondered why programming languages and shells work so differently. We want different things from a programming language and from a shell or REPL. Attempts to bring a programming language and shell closer together sound great, but they inevitably run into obstacles. At some point, we have different expectations of languages and shells and don’t want the two to be too similar.

Anthony Scopatz and I discussed this in an interview a while back in the context of xonsh, “a Python-powered, cross-platform, Unix-gazing shell language and command prompt.” While writing this post I went back to reread Anthony’s comments and appreciate them more now than I did then.

Maybe the Unix shell is near a local optimum. It’s hard to make much improvement without making big changes. As Anthony said, “you quickly end up where many traditional computer science people are not willing to go.”

Related postWhat’s your backplane?

Anthony Scopatz on xonsh and shells in general

Anthony Scopatz did an interview for Podcast.__init__ recently talking about xonsh, a command shell that blends Python and some traditions from bash. One line from the interview jumped out at me:

… thinking very critically about what shells get used for and what they’re actually good at and what they’re not good at.

I’ve wondered about this but never reached any satisfying conclusions. I was curious to hear Anthony’s ideas, so I asked him for another interview. (I interviewed Anthony and his co-author Katy Huff regarding their book Effective Computation in Physics.)

* * *

JC: If your shell speaks your programming language, then what else does it need to do?

AS: It’s an interesting question. People have tried to use Python as a shell for years and years, and they came up with a bunch of different potential solutions, but none of them quite worked because the language wasn’t built around that idea. It ended up being more verbose than people want from a shell. The main purpose of the shell, in my opinion, is to run other code and to glue things together. Python does that really well for libraries and functions, but it doesn’t do that so well for executables. Bash deals with executables really well, but it’s terrible for dealing with even simple conditional logic. Like a lot of people, I wanted something that would do all these things simultaneously and do them all well. But you quickly end up where many traditional computer science people are not willing to go: context-sensitive parsing. It’s something they teach you to be afraid of in school .

JC: But you do it all the time. How can you get away from it?

AS: You can’t, but people want to avoid it in their core languages. The major programming languages keep it out. You’ll find it quarantined to domain-specific languages where the damage is small.

JC: So you have something in mind like Perl? There the behavior of a function can depend entirely on whether it’s being used in a scalar context or an array context.

AS: That’s right. Perl does some of this. The language Forth is completely built around this. It’s all context-sensitive.

You brought up something interesting [in a previous email] about the overlap between shells and editors. Those things are completely separate in my mind, but for a lot of people they get merged very quickly. For instance, Emacs has the ability to run a shell inside the editor, and people use that all the time.

JC: The way I work is that I start something at the command line, then it gets a little complicated, and I switch over to writing a script and regret not having done that sooner. I especially do that with something like R. This is just going to be a few quick calculations, so I’ll do it right from the REPL. Then things get more complicated …

AS: IPython sorta has that too, the old IPython readline shell. You just wanted to do something simple that bash couldn’t do quickly or easily, so you open up the IPython command line. Inevitably it ends up taking more lines than you wanted it to.  That is part of why the Jupyter notebook is so great.

JC: One thing I noticed about PowerShell was that system administrators were ecstatic when it came out and would say how much they loved the command line. Then Microsoft put out this ISE, sort of an IDE for PowerShell, and everyone moved there. So they’re not really using the command line anymore. They’re excited about PowerShell as a programming language, not as an interactive shell per se.

In Bruce Payette’s PowerShell book he fields questions asking why PowerShell did something some way they find odd and his answer is always “Because it’s a shell.”

AS: Do you have any examples?

JC: For example, functions don’t use parentheses around their arguments or commas between their arguments because that’s not what people expect from a shell. You expect to type something like ls, not ls() with parentheses at the end. There were more subtle examples than this, but they’re not fresh on my mind.

AS: That’s where I think that tools like Python plumbum are lacking. It’s an all-Python environment, so you have to use Python syntax even when it’s cumbersome. It prevents you from having to import subprocess and worry about that all the time, but it doesn’t do much more than that.

JC: When you were writing xonsh, where there times you wished you could change the Python language? Or things you’d do differently in the shell if you weren’t aiming for 100% Python compatibility?

AS: That’s interesting. Python is deceptively simple. It has a lot of little pieces to it. It’s very natural and intuitive to use, but re-implementing the parser for Python was more work than I expected. There are a lot of little gotchas in the parser. I spent a lot of time on tuples and function argument grouping. The way they’re handled looks very similar but they’re handled completely differently for no reason that’s readily apparent.

There’s also this ambiguity between Python commands and shell commands if you’re trying to do both simultaneously, and that’s frustrating. That’s the hard part, figuring out when you’re in a subprocess and when you’re in Python mode.

JC: It’s hard for you as an implementer, but hopefully users can be blissfully ignorant of the issues and it just does what they expect.

I guess you’re walking a fine line, because as soon as you say you want the shell to infer what people mean, you start getting into the kinds of complications you have in Perl where things depend so heavily on context, and that sort of thing is contrary to the spirit of Python.

AS:  Yeah, exactly! After going through this exercise, there is one thing I’d like to change about Python. Python is white space-sensitive at the beginning of a line, but not after the first non-white space character. For example, you can put as many spaces around a binary operator as you like, or none at all. That’s really, really frustrating. If you enforced PEP 8, requiring exactly one white space around every binary operator, you’d be able to resolve these currently ambiguous cases between subprocess mode and Python mode very naturally. But I can’t imagine a world in which people would agree to this.

JC: What shell would you use if you weren’t using xonsh?

AS: I probably would use bash. Fish is really nice in some ways, and things like zsh have nice features too. What I used to do is go back and forth between working in an IPython shell and a bash shell, and between those two I could pretty much get the job done.

JC: Do you use Emacs?

AS: No, I don’t use Emacs or Vim or any of those editors. I use an editor I wrote, kinda like nano. I’ve used Emacs and Vim, but they got in my way too much, so I wanted something else. This is sort of the same thing as xonsh; I want my tools to get out of my way. I want the barrier to entry to doing what I want to be basically zero. You can spend years and years becoming a master of some of these tools and then you’re really effective, but I want to just open up the editor and start typing text. The same thing with the shell. I just want to open it up and get to work and not have to keep going back to the documentation.

Unix-like shells on Windows

This post gives some notes on ways to create a Unix-like command line experience on Windows, without using a virtual machine like VMWare or a quasi-virtual machine like Cygwin.

Finding Windows ports of Unix utilities is easy. The harder part is finding a shell that behaves as expected. (Of course “as expected” depends on your expectations!)

There have been many projects to port Unix utilities to Windows, particularly GnuWin32 and Gow. Some of the command shells I’ve tried are:

  • Cmd
  • PowerShell
  • Eshell
  • Bash
  • Clink

I’d recommend the combination of Gow and Clink for most people. If you’re an Emacs power user you might like Eshell.

Cmd

The built-in command line on Windows is cmd. It’s sometimes called the “DOS prompt” though that’s misleading. DOS died two decades ago and the cmd shell has improved quite a bit since then.

cmd has some features you might not expect, such as pushd and popd. However, I don’t believe it has anything analogous to dirs to let you see the directory stack.

PowerShell

PowerShell is a very sophisticated scripting environment, but the interactive shell itself (e.g. command editing functionality) is basically cmd. (I haven’t kept up with PowerShell and that may have changed.) This means that writing a PowerShell script is completely different from writing a batch file, but the experience of navigating the command line is essentially the same as cmd.

Eshell

You can run shells inside Emacs. By default, M-x shell brings up a cmd prompt inside an Emacs buffer. You can also use Emacs’ own shell with the command M-x eshell.

Eshell is a shell implemented in Emacs Lisp. Using Eshell is very similar across platforms. On a fresh Windows machine, with nothing like Gow installed, Eshell provides some of the most common Unix utilities. You can use the which command to see whether you’re using a native executable or Emacs Lisp code. For example, if you type which ls into Eshell, you get the response

    eshell/ls is a compiled Lisp function in `em-ls.el'

The primary benefit of Eshell is that provides integration with Emacs. As the documentation says

Eshell is not a replacement for system shells such as bash or zsh. Use Eshell when you want to move text between Emacs and external processes …

Eshell does not provide some of the command editing features you might expect from bash. But the reason for this is clear: if you’re inside Emacs, you’d want to use the full power of Emacs editing, not the stripped-down editing features of a command line. For example, you cannot use ^foo^bar to replace foo with bar in the previous command. Instead, you could retrieve the previous command and edit it just as you edit any other line inside Emacs.

In bash you can use !^ to recall the first argument of the previous command and !$!$ using $_ instead. Many of the other bash shortcuts that begin with ! work as expected: !foo, !!, !-3, etc. Directory navigation commands like cd -, pushd, and popd work as in bash.

Bash

Gow comes with a bash shell, a Windows command line program that creates a bash-like environment. I haven’t had much experience with it, but it seems to be a faithful bash implementation with few compromises for Windows, for better and for worse. For example, it doesn’t understand backslashes as directory separators.

There are other implementations of bash on Windows, but I either haven’t tried them (e.g. win-bash) or have had bad experience with them (e.g. Cygwin).

Clink

Clink is not a shell per se but an extension to cmd. It adds the functionality of the Gnu readline library to the Windows command line and so you can use all the Emacs-like editing commands that you can with bash: Control-a to move to the beginning of a line, Control-k to delete the rest of a line, etc.

Clink also gives you Windows-like behavior that Windows itself doesn’t provide, such as being able to paste text onto the command line with Control-v.

I’ve heard that Clink will work with PowerShell, but I was not able to make it work.

The command editing and history shortcuts beginning with ! mentioned above all work with Clink, as do substitutions like ^foo^bar.

Conclusion

In my opinion, the combination of Gow and Clink gives a good compromise between a Windows and Unix work environment. And if you’re running Windows, a compromise is probably what you want. Otherwise, simply run a (possibly virtual) Linux machine. Attempts to make Windows too Unix-like run down an uncanny valley where it’s easy to waste a lot of time.

Bringing bash and PowerShell a little closer together

I recently ran across PSReadLine, a project that makes the PowerShell console act more like a bash shell. I’ve just started using it, but it seems promising. I’m switching between Linux and Windows frequently these days and it’s nice to have a little more in common between the two.

I’d rather write a PowerShell script than a bash script, but I’d rather use the bash console interactively. The PowerShell console is essentially the old cmd.exe console. (I haven’t kept up with PowerShell in a while, so maybe there have been some improvements, but it’s my impression that the scripting language has moved forward and the console has not.) PSReadLine adds some bash-like console conveniences such as Emacs-like editing at the command prompt.

Update: Thanks to Will for pointing out Clink in the comments. Clink sounds like it may be even better than PSReadLine.

PowerShell logo

Shell != REPL

A shell is not the same as a REPL (Read Evaluate Print Loop). They look similar, but they have deep differences.

Shells are designed for one-line commands, and they’re a little awkward when used as programming languages.

Scripting languages are designed for files of commands, and they’re a little awkward to use from a REPL.

IPython is an interesting hybrid. You could think of it as a Python REPL with shell-like features added. Eshell is another interesting compromise, a shell implemented in Emacs Lisp that also works as a Lisp REPL. These hybrids are evidence that as much as people like their programming languages, they appreciate additions to a pure language REPL.