# Computing pi with bc

I wanted to stress test the `bc` calculator a little and so I calculated π to 10,000 digits a couple different ways.

First I ran

`    time bc -l <<< "scale=10000;4*a(1)"`

which calculates π as 4 arctan(1). This took 2 minutes and 38 seconds.

I imagine `bc` is using some sort of power series to compute arctan, and so smaller arguments should converge faster. So next I used a formula due to John Machin (1680–1752).

`    time bc -l <<< "scale=10000;16*a(1/5) - 4*a(1/239)"`

This took 52 seconds.

Both results were correct to 9,998 decimal places.

When you set the `scale` variable to n, `bc` doesn’t just carry calculations out to n decimal places; it uses more and tries to deliver n correct decimal places in the final result.

## Why bc

This quirky little calculator is growing on me. For one thing, I like its limitations. If I need to do something that isn’t easy to do with bc, that probably means that I should write a script rather than trying to work directly at the command line.

Another thing I like about it is that it launches instantly. It doesn’t give you a command prompt, and so if you launch it in quiet mode you could think that it’s still loading when in fact it’s waiting on you. And if you send `bc` code with a here-string as in the examples above, you don’t even have to launch it per se.

If you want to try `bc`, I’d recommend launching it with the options `-lq`. You might even want to alias `bc` to `bc -lq`. The `-l` option loads math libraries. You’d think that would be the default for a calculator, but `bc` was written in a more resource-constrained time when you didn’t load much by default. The `-l` option also sets `scale` to 20, i.e. you get twenty decimal places of precision; the default is zero!

The `-q` option isn’t necessary, but it starts `bc` in quiet mode, suppressing three lines of copyright and warranty announcements.

As part of its minimalist design, `bc` only includes a few math functions, and you have to bootstrap the rest. For example, it includes sine and cosine but not tangent. More on how to use the built-in functions to compute more functions here.

# Splitting lines and numbering the pieces

As I mentioned in my computational survivalist post, I’m working on a project where I have a dedicated computer with little more than basic Unix tools, ported to Windows. It’s given me new appreciation for how the standard Unix tools fit together; I’ve had to rely on them for tasks I’d usually do a different way.

I’d seen the `nl` command before for numbering lines, but I thought “Why would you ever want to do that? If you want to see line numbers, use your editor.” That way of thinking looks at the tools one at a time, asking what each can do, rather than thinking about how they might work together.

Today, for the first time ever, I wanted to number lines from the command line. I had a delimited text file and wanted to see a numbered list of the column headings. I’ve written before about how you can extract columns using cut, but you have to know the number of a column to select it. So it would be nice to see a numbered list of column headings.

The data I’m working on is proprietary, so I downloaded a PUMS (Public Use Microdata Sample) file named `ss04hak.csv` from the US Census to illustrate instead. The first line of this file is

`RT,SERIALNO,DIVISION,MSACMSA,PMSA,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,FSP,GRNTP,GRPIP,HHL,HHT,HINCP,HUPAC,LNGI,MV,NOC,NPF,NRC,OCPIP,PSF,R18,R65,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,FTAXP,FTELP,FTENP,FVACSP,FVALP,FVEHP,FWATP,FYBLP`

I want to grab the first line of this file, replace commas with newlines, and number the results. That’s what the following one-liner does.

`    head -n 1 ss04hak.csv | sed "s/,/\n/g" | nl`

The output looks like this:

```     1  RT
2  SERIALNO
3  DIVISION
4  MSACMSA
5  PMSA
...
100  FWATP
101  FYBLP
```

Now if I wanted to look at a particular field, I could see the column number without putting my finger on my screen and counting. Then I could use that column number as an argument to `cut -f`.

# File character counts

Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:

`    fold -w1 file | sort | uniq -c`

The `fold` function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the `-w` option. Setting that to 1 makes each character its own line. Then `sort` prepares the input for `uniq`, and the `-c` option causes `uniq` to display counts.

This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.

```import collections

count = collections.Counter()
file = open("myfile", "r", encoding="utf8")
for c in line.strip("\n"):
count[ord(c)] += 1

for p in sorted(list(count)):
print(chr(p), hex(p), count[p])
```

# Computational survivalist

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

## Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use `grep -c`, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use `grep -o` and pipe it to `wc -l`.

The `-o` option tells `grep` to output the regex matches, one per line. So counting the number of lines with `wc -l` gives the number of matches.

## Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

# Quiet mode

When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python.

```    Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
```

The version number is a good reminder. I’m used to the command `python` bringing up Python 3+, so seeing the text above would remind me that on that computer I need to type `python3` rather than simply `python`.

But if you’re working at the command line and jumping over to Python for a quick calculation, the start up verbiage separates your previous work from your current work by a few lines. This isn’t such a big deal with Python, but it is with R:

```    R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
```

By the time you see all that, your previous work may have scrolled out of sight.

There’s a simple solution: use the option `-q` for quiet mode. Then you can jump in and out of your REPL with a minimum of ceremony and keep your previous work on screen.

For example, the following shows how you can use Python and bc without a lot of wasted vertical space.

```    > python -q
>>> 3+4
7
>>> quit()

> bc -q
3+4
7
quit

```

Python added the `-q` option in version 3, which the example above uses. Python 2 does not have an explicit quiet mode option, but Mike S points out a clever workaround in the comments. You can open a Python 2 REPL in quiet mode by using the following.

`    python -ic ""`

The combination of the `-i` and `-c` options tells Python to run the following script and enter interpreter mode. In this case the script is just the empty string, so Python does nothing but quietly enter the interpreter.

R has a quiet mode option, but by default R has the annoying habit of asking whether you want to save a workspace image when you quit.

```    > R.exe -q
> 3+4
[1] 7
> quit()
Save workspace image? [y/n/c]: n

```

I have never wanted R to save a workspace image; I just don’t work that way. I’d rather keep my state in scripts. I set R to an alias that launches R with the `--no-save` option.

So if you launch R with `-q` and `--no-save` it takes up no more vertical space than Python or bc.

# Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools?

## Why CSV?

In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that’s the way it is.

A minor variation on comma-separated values is tab-separated values [1].

## Why standard utilities?

Why use standard Unix utilities? I’ll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

## cut, sort, and awk

The tools I want to look at are `cut`, `sort`, and `awk`. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of `sort` will be named `gsort` in order to not conflict with the native Windows `sort` function. There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The `sort` command is simple and useful. There are just a few options you’ll need to know about. The utility sorts fields as text by default, but the `-n` tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that `-t,` is the option to tell `sort` that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the `-k` option.

The last utility, `awk`, is more than a utility. It’s a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with `awk` by reading Greg Grothous’ article Why you should learn just a little awk.

## Inconsistencies

Now for the bad news: these programs are inconsistent in their options. The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

`cut` uses `-d` or `--delimiter` to specify the field delimiter and `-f` or `--fields` to specify fields. Makes sense.

`sort` uses `-t` or `--field-separator` to specify the field delimiter and `-k` or `--key` to specify the field. When you’re talking about sorting things, it’s common to call the fields keys, and so the way `sort` specifies fields makes sense in context. I see no reason for `-t` other than `-f` was already taken. (In sorting, you talk about folding upper case to lower case, so `-f` stands for fold.)

`awk` uses `-F` or `--field-separator` to specify the field delimiter. At least the verbose option is consistent with `sort`. Why `-F` for the short option instead of `-f`? The latter was already taken for file. To tell `awk` to read a program from a file rather than the command line you use the `-f` option.

`awk` handles fields differently than `cut` and `sort`. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: `\$1` holds the content of the first field, `\$2` the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

```    |------+-----+-----|
| cut  | -d, | -f2 |
| sort | -t, | -k2 |
| awk  | -F, | \$2  |
|------+-----+-----|
```

Some will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using `cut`, `sort`, and `awk`. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

## Related

[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.

# Working with wide text files at the command line

Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like

`    head data.csv`

to look at the first few lines of the file and got this back:

That was not at all helpful. The part I was interested was at the beginning, but that part scrolled off the screen quickly. To see just how wide the lines are I ran

`    head -n 1 data.csv | wc`

and found that the first line of the file is 4822 characters long.

How can you see just the first part of long lines? Use the `cut` command. It comes with Linux systems and you can download it for Windows as part of GOW.

You can see the first 30 characters of the first few lines by piping the output of `head` to `cut`.

`    head data.csv | cut -c -30`

This shows

```"GEO_ID","NAME","DP05_0001E","
"id","Geographic Area Name","E
"8600000US01379","ZCTA5 01379"
"8600000US01440","ZCTA5 01440"
"8600000US01505","ZCTA5 01505"
"8600000US01524","ZCTA5 01524"
"8600000US01529","ZCTA5 01529"
"8600000US01583","ZCTA5 01583"
"8600000US01588","ZCTA5 01588"
"8600000US01609","ZCTA5 01609"
```

which is much more useful. The syntax `-30` says to show up to the 30th character. You could do the opposite with `30-` to show everything starting with the 30th character. And you can show a range, such as 20-30 to show the 20th through 30th characters.

You can also use `cut` to pick out fields with the `-f` option. The default delimiter is tab, but our file is delimited with commas so we need to add `-d,` to tell it to split fields on commas.

We could see just the second column of data, for example, with

`    head data.csv | cut -d, -f 2`

This produces

```"NAME"
"Geographic Area Name"
"ZCTA5 01379"
"ZCTA5 01440"
"ZCTA5 01505"
"ZCTA5 01524"
"ZCTA5 01529"
"ZCTA5 01583"
"ZCTA5 01588"
"ZCTA5 01609"
```

You can also specify a range of fields, say by replacing 2 with 3-4 to see the third and fourth columns.

The humble `cut` command is a good one to have in your toolbox.

# Random sampling from a file

I recently learned about the Linux command line utility `shuf` from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, `shuf` randomly permutes the lines of the file.

With the option `-n` you can specify how many lines to return. So it’s doing sampling without replacement. For example,

`    shuf -n 10 foo.txt`

would select 10 lines from `foo.txt`.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the `-n` option is ignored.

You can also sample with replacement using the `-r` option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

`    shuf -r -n 10 foo.txt`

to select 10 lines drawn with replacement from `foo.txt`, regardless of how many lines `foo.txt` has. For example, when I ran the command above on a file containing

```    alpha
beta
gamma
```

I got the output

```    beta
gamma
gamma
beta
alpha
alpha
gamma
gamma
beta
```

I don’t know how `shuf` seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

# The hard part in becoming a command line wizard

I’ve long been impressed by shell one-liners. They seem like magical incantations. Pipe a few terse commands together, et voilà! Out pops the solution to a problem that would seem to require pages of code.

Are these one-liners real or mythology? To some extent, they’re both. Below I’ll give a famous real example. Then I’ll argue that even though such examples do occur, they may create unrealistic expectations.

## Bentley’s exercise

In 1986, Jon Bentley posted the following exercise:

Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth wrote an elegant program in response. Knuth’s program runs for 17 pages in his book Literate Programming.

McIlroy’s solution is short enough to quote below [1].

```    tr -cs A-Za-z '
' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed \${1}q
```

McIlroy’s response to Knuth was like Abraham Lincoln’s response to Edward Everett at Gettysburg. Lincoln’s famous address was 50x shorter than that of the orator who preceded him [2]. (Update: There’s more to the story. See [3].)

Knuth and McIlroy had very different objectives and placed different constraints on themselves, and so their solutions are not directly comparable. But McIlroy’s solution has become famous. Knuth’s solution is remembered, if at all, as the verbose program that McIlroy responded to.

The stereotype of a Unix wizard is someone who could improvise programs like the one above. Maybe McIlroy carefully thought about his program for days, looking for the most elegant solution. That would seem plausible, but in fact he says the script was “written on the spot and worked on the first try.” He said that the script was similar to one he had written a year before, but it still counts as an improvisation.

## Why can’t I write scripts like that?

McIlroy’s script was a real example of the kind of wizardry attributed to Unix adepts. Why can’t more people quickly improvise scripts like that?

The exercise that Bentley posed was the kind of problem that programmers like McIlroy solved routinely at the time. The tools he piped together were developed precisely for such problems. McIlroy didn’t see his solution as extraordinary but said “Old UNIX hands know instinctively how to solve this one in a jiffy.”

The traditional Unix toolbox is full of utilities for text manipulation. Not only are they useful, but they compose well. This composability depends not only on the tools themselves, but also the shell environment they were designed to operate in. (The latter is why some utilities don’t work as well when ported to other operating systems, even if the functionality is duplicated.)

Bentley’s exercise was clearly text-based: given a text file, produce a text file. What about problems that are not text manipulation? The trick to being productive from a command line is to turn problems into text manipulation problems.  The output of a shell command is text. Programs are text. Once you get into the necessary mindset, everything is text. This may not be the most efficient approach to a given problem, but it’s a possible strategy.

## The hard part

The hard part on the path to becoming a command line wizard, or any kind of wizard, is thinking about how to apply existing tools to your particular problems. You could memorize McIlroy’s script and be prepared next time you need to report word frequencies, but applying the spirit of his script to your particular problems takes work. Reading one-liners that other people have developed for their work may be inspiring, or intimidating, but they’re no substitute for thinking hard about your particular work.

## Repetition

You get faster at anything with repetition. Maybe you don’t solve any particular kind of problem often enough to be fluent at solving it. If someone can solve a problem by quickly typing a one-liner in a shell, maybe they are clever, or maybe their job is repetitive. Or maybe both: maybe they’ve found a way to make semi-repetitive tasks repetitive enough to automate. One way to become more productive is to split semi-repetitive tasks into more creative and more repetitive parts.

## More command line posts

[1] The odd-looking line break is a quoted newline.

[2] Everett’s speech contained 13,607 words while Lincoln’s Gettysburg Address contained 272, a ratio of almost exactly 50 to 1.

[3] See Hillel Wayne’s post Donald Knuth was Framed. Here’s an excerpt:

Most of the “eight pages” aren’t because Knuth is doing LP [literate programming], but because he’s Donald Knuth:

• One page is him setting up the problem (“what do we mean by ‘word’? What if multiple words share the same frequency?”) and one page is just the index.
• Another page is just about working around specific Pascal issues no modern language has, like “how do we read in an integer” and “how do we identify letters when Pascal’s character set is poorly defined.”
• Then there’s almost four pages of handrolling a hash trie.

The “eight pages” refers to the length of the original publication. I described the paper as 17 pages because that the length in the book where I found it.

# Windows command line tips

I use Windows, Mac, and Linux, each for different reasons. When I run Windows, I like to have a shell that works sorta like bash, but doesn’t run in a subsystem. That is, I like to have the utility programs and command editing features that I’m used to from bash on Mac or Linux, but I want to run native Windows code and not a little version of Linux hosted by Windows. [1]

It took a while to find something I’m happy with. It’s easier to find Linux subsystems like Cygwin. But cmder does exactly what I want. It’s a bash-like shell running on Windows. I also use Windows ports of common Unix utilities. Since these are native Windows programs, I can run them and other Windows applications in the same environment. No error messages along the lines of “I know it looks like you’re running Windows, but you’re not really. So you can’t open that Word document.”

I’ve gotten Unix-like utilities for Windows from several sources. GOW (Gnu on Windows) is one source. I’ve also collected utilities from other miscellaneous sources.

## Tab completion and file association

There’s one thing that was a little annoying about cmder: tab completion doesn’t work if you want to enter a file name. For example, if you want to open a Word document `foo.docx` from the basic Windows command prompt cmd.exe, you can type `fo` followed by the tab key and the file will open if `foo.docx` is the first file in your working directory that begins with “fo.”

In cmder, tab completion works for programs first, and then for file names. If you type in `fo` followed by the tab key, `cmder` will look for an application whose name starts with “fo.” If you wanted to move `foo.docx` somewhere, you could type `mv fo` and tab. In this context, `cmder` knows that “fo” is the beginning of a file name.

On Mac, you use the `open` command to open a file with its associated application. For example, on the Mac command line you’d type `open foo.docx` rather than just `foo.docx` to open the file with Word.

If there were something like `open` on Windows, then tab completion wold work in `cmder`. And there is! It’s the start command. In fact, if you’re accustomed to using `open` on Mac, you could alias `start` to `open` on Windows [2]. So in `cmder`, you can type `start fo` and hit tab, and get tab completion for the file name.

## Miscellaneous

The command assoc shows you which application is associated with a file extension. (Include the “.” when using this command. So you’d time `assoc .docx` rather than `assoc docx`.

You can direct Windows shell output to the clip command to put the output onto the Windows clipboard.

The control command opens the Windows control panel.

This post shows how to have branching logic in an Emacs config file so you can use the same config file across operating systems.

## More MS Windows posts

[1] Usually on Windows I want to run Windows. But if I do want to run Linux without having to go to another machine, I use WSL (Windows Subsystem for Linux) and I can use it from cmder. Since cmder supports multiple tabs, I can have one tab running ordinary cmd.exe and another tab running bash on WSL.

[2] In the directory containing cmder.exe, edit the file `config/user-aliases.cmd`. Add a line `open=start \$1`.