A shell one-liner to search directories

I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.

So I’m splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.

I want to explore LaTeX files, so I’ll start by using find to find such files.

    find . -name "*.tex"

This searches for files ending in .tex, starting with the current directory (hence .) and searching recursively into subdirectories. The find command explores subdirectories by default; you have to tell it not to if that’s not what you want.

Next, I want to use grep to search the LaTeX files. If I pipe the output of find to grep it will search the file names, but I want it to search the file contents. The xargs command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.

    find . -name "*.tex" | xargs grep ...

LaTeX commands have the form of a backslash followed by letters, so the regular expression I’ll pass is \\[a-z]+. This says to look for a literal backslash followed by one or more letters.

I’ll give grep four option flags. I’ll use -i to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I’ll use -E to tell it I want to use extended regular expressions [1].

I’m after just the commands, not the lines containing commands, and so I use the -o option to tell grep to return just the commands, one per line. But that’s not enough. I would be enough if we were only search one file, but since we’re searching multiple files, the default behavior is for grep to return the file name as well. The -h option tells it to only return the matches, no file names.

So now we’re up to this:

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'

Next I want to count how many times each command occurs, and I need to sort the output first so that uniq will count correctly.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c

And finally I want to sort the output by frequency, in descending order. The -n option tells sort to sort numerically, and -r says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less to view it one screen at a time.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less

That’s my one-liner. In the next post I’ll look at the results.

More command line posts

[1] I learned regular expressions from writing Perl long ago. What I think of a simply a regular expression is what grep calls “extended” regular expressions, so adding the -E option keeps me out of trouble in case I use a feature that grep considers an extension. You could use egrep instead, which is essentially the same as grep -E.

Doing a database join with CSV files

relational database

It’s easy to manipulate CSV files with basic command line tools until you need to do a join. When your data is spread over two different files, like two tables in a normalized database, joining the files is more difficult unless the two files have the same keys in the same order. Fortunately, the xsv utility is just the tool for the job. Among other useful features, xsv supports database-like joins.

Suppose you want to look at weights broken down by sex, but weights are in one file and sex is in another. The weight file alone doesn’t tell you whether the weights belong to men or women.

Suppose a file weight.csv has the following rows:

    ID,weight
    123,200
    789,155
    999,160

and a file person.csv has the following:

    ID,sex
    123,M
    456,F
    789,F

Note that the two files have different ID values: 123 and 789 are in both files, 999 is only in weight.csv and 456 is only in person.csv. We want to join the two tables together, analogous to the JOIN command in SQL.

The command

    xsv join ID person.csv ID weight.csv

does just this, producing

    ID,sex,ID,weight
    123,M,123,200
    789,F,789,155

by joining the two tables on their ID columns.

The command includes ID twice, once for the field called ID in person.csv and once for the field called ID in weight.csv. The fields could have different names. For example, if the first column of person.csv were renamed Key, then the command

    xsv join Key person.csv ID weight.csv

would produce

    Key,sex,ID,weight
    123,M,123,200
    789,F,789,155

We’re not interested in the ID columns per se; we only want to use them to join the two files. We could suppress them in the output by asking xsv to select the second and fourth columns of the output

    xsv join Key person.csv ID weight.csv | xsv select 2,4

which would return

    sex,weight
    M,200
    F,155

We can do other kinds of joins by passing a modifier to join. For example, if we do a left join, we will include all rows in the left file, person.csv, even if there isn’t a match in the right file, weight.csv. The weight will be missing for such records, and so

    $ xsv join --left Key person.csv ID weight.csv

produces

    Key,sex,ID,weight
    123,M,123,200
    456,F,,
    789,F,789,155

Right joins are analogous, including every record from the second file, and so

    xsv join --right Key person.csv ID weight.csv

produces

    Key,sex,ID,weight
    123,M,123,200
    789,F,789,155
    ,,999,160

You can also do a full join, with

    xsv join --full Key person.csv ID weight.csv

producing

    Key,sex,ID,weight
    123,M,123,200
    456,F,,
    789,F,789,155
    ,,999,160

More data wrangling posts

Exporting Excel files to CSV with in2csv

This post shows how to export an Excel file to a CSV file using in2csv from the csvkit package.

You could always use Excel itself to export an Excel file to CSV but there are several reasons you might not want to. First and foremost, you might not have Excel. Another reason is that you might want to work from the command line in order to automate the process. Finally, you might not want the kind of CSV format that Excel exports.

For illustration I made a tiny Excel file. In order to show how commas are handled, populations contain commas but areas do not.

See post content for text dump of data

When I ask Excel to export the file I get

    State,Population,Area
    CA,"39,500,000",163695
    TX,"28,300,000",268596
    FL,"31,000,000",65758

Note that areas are exported as plain integers, but populations are exported as quoted strings containing commas.

Using csvkit

Now install csvkit and run in2csv.

    $ in2csv states.xlsx
    State,Population,Area
    CA,39500000,163695
    TX,28300000,268596
    FL,31000000,65758

The output goes to standard out, though of course you could redirect the output to a file. All numbers are exported as numbers, with no thousands separators. This makes the output easier to use from a program that does crude parsing [1]. For example, suppose we save states.xlsx to states.csv using Excel then ask cut for the second column. Then we don’t get what we want.

    $ cut -d, -f2 states.csv
    Population
    "39
    "28
    "31

But if we use in2csv to create states.csv then we get what we’d expect.

    cut -d, -f2 states.csv
    Population
    39500000
    28300000
    31000000

Multiple sheets

So far we’ve assumed our Excel file has a single sheet. I added a second sheet with data on US territories. The sheet doesn’t have a header row just to show that the header row isn’t required.

PR\t3,300,000\t5325\nGuam\t161,700\t571

I named the two sheets “States” and “Territories” respectively.

States, Territories

Now if we ask in2csv to export our Excel file as before, it only exports the first sheet. But if we specify the Territories sheet, it will export that.

    $ in2csv --sheet Territories states.xlsx
    PR,3300000,5325
    Guam,161700,571

More CSV posts

[1] The cut utility isn’t specialized for CSV files and doesn’t understand that commas inside quotation marks are not field separators. A specialized utility like csvtool would make this distinction. You could extract the second column with

    csvtool col 2 states.csv

or

    csvtool namedcol Population states.csv

This parses the columns correctly, but you still have to remove the quotation marks and thousands separators.

Minimizing context switching between shell and Python

Sometimes you’re in the flow using the command line and you’d like to briefly switch over to Python without too much interruption. Or it could be the other way around: you’re in the Python REPL and need to issue a quick shell command.

One solution would be to run your shell and Python session in different terminals, but let’s assume that for whatever reason you’re working in just one terminal. For example, maybe you want the output of a shell command to be visually close when you run Python, or vice versa.

Calling Python from shell

You can run a Python one-liner from the shell by calling Python with the -c option. For example,

    $ python -c "print(3*7)"
    21

I hardly ever do this because I want to run more than a one-liner. What I find more useful is to launch Python with the -q option to suppress all the start-up verbiage and simply bring up a prompt.

    $ python -q
    >>>>

More on this in my post on quiet modes.

Calling shell from Python

If you run Python with the ipython command rather than the default python you get much better shell integration. IPython let’s you type a shell command at any point simply by preceding it with a !. For example, the following command tells us this is the 364th day of the year.

    In [1]: ! date +%j
    364 

You can run some of the most common shell commands, such as cd and ls without even a bang prefix. These are “magic” commands that do what you’d expect if you forgot for a moment that you’re in a Python REPL rather than a command shell.

    In [2]: cd ..
    Out[2]: '/mnt/c/Users'

IPython also supports other forms of shell integration such as capturing the output of a shell command as a Python variable, or using a Python variable as an argument to a shell command.

More on context switching and shells

Why can’t grep find negative numbers?

Suppose you’re looking for instances of -42 in a file foo.txt. The command

    grep -42 foo.txt

won’t work. Instead you’ll get a warning message like the following.

    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

Putting single or double quotes around -42 won’t help. The problem is that grep interprets 42 as a command line option, and doesn’t have such an option. This is a problem if you’re searching for negative numbers, or any pattern that beings with a dash, such as -able or --version.

The solution is to put -e in front of a regular expression containing a dash. That tells grep that the next token at the command line is a regular expression, not a command line option. So

    grep -e -42 foo.txt

will work.

You can also use -e several times to give grep several regular expressions to search for. For example,

    grep -e cat -e dog foo.txt

will search for “cat” and “dog.”

See the previous post for another example of where grep doesn’t seem to work. By default grep supports a restricted regular expression syntax and may need to be told to use “extended” regular expressions.

Why doesn’t grep work?

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.

    [0-9]{3}\.?[0-9]{0,2}

This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.

Similarly,

    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"

and

    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually

    \d{3}\.?\d{0,2}

rather than

    [0-9]{3}\.?[0-9]{0,2}

I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.

Related

Set theory at the command line

Often you have two lists, and you want to know what items belong to both lists, or which things belong to one list but not the other. The command line utility comm was written for this task.

Given two files, A and B, comm returns its output in three columns:

  1. AB
  2. BA
  3. AB.

Here the minus sign means set difference, i.e. AB is the set of things in A but not in B.

Venn diagram of comm parameters
The numbering above will be correspond to command line options for comm. More on that shortly.

Difference and intersection

Here’s an example. Suppose we have a file states.txt containing a list of US states

    Alabama
    California
    Georgia
    Idaho
    Virginia

and a file names.txt containing female names.

    Charlotte
    Della
    Frances
    Georgia
    Margaret
    Marilyn
    Virginia

Then the command

    comm states.txt names.txt

returns the following.

    Alabama
    California
            Charlotte
            Della
            Frances
                    Georgia
    Idaho
            Margaret
            Marilyn
                    Virginia

The first column is states which are not female names. The second column is female names which are not states. The third column is states which are also female names.

Filtering output

The output of comm is tab-separated. So you could pull out one of the columns by piping the output through cut, but you probably don’t want to do that for a couple reasons. First, you might get unwanted spaces. For example,

    comm states.txt names.txt | cut -f 1

returns

Alabama
California




Idaho

Worse, if you ask cut to return the second column, you won’t get what you expect.

Alabama
California
Charlotte
Della
Frances

Idaho
Margaret
Marilyn

This is because although comm uses tabs, it doesn’t produce a typical tab-separated file.

The way to filter comm output is to tell comm which columns to suppress.

If you only want the first column, states that are not names, use the option -23, i.e. to select column 1, tell comm not to print columns 2 or 3. So

    comm -23 states.txt names.txt

returns

    Alabama
    California
    Idaho

with no blank lines. Similarly, if you just want column 2, use comm -13. And if you want the intersection, column 3, use comm -12.

Sorting

The comm utility assumes its input files are sorted. If they’re not, it will warn you.

If your files are not already sorted, you could sort them and send them to comm in a one-liner [1].

    comm <(sort states.txt) <(sort names.txt)

Multisets

If your input files are truly sets, i.e. no item appears twice, comm will act as you expect. If you actually have a multiset, i.e. items may appear more than once, the output may surprise you at first, though on further reflection you may agree it does what you’d hope.

Suppose you have a file places.txt of places

    Cincinnati
    Cleveland
    Washington
    Washington

and a file of US Presidents.

    Adams
    Adams
    Cleveland
    Cleveland
    Jefferson
    Washington

Then

    comm places.txt presidents.txt

produces the following output.

            Adams
            Adams
    Cincinnati
                    Cleveland
            Cleveland
            Jefferson
                    Washington
    Washington

Cleveland was on the list of presidents twice. (Grover Cleveland was the 22nd and the 24th president, the only US president to serve non-consecutive terms.) The output of comm lists Cleveland once in the intersection of places and presidents (say 22nd president and a city in Ohio), but also lists him once as a president not corresponding to a place. That is because there are two Clevelands on our list of presidents, there is only one in our list of places.

Suppose we had included five places named Cleveland (There are cities named Cleveland in several states). Then comm would list Cleveland 3 times as a place not a president, and 2 times as a place and a president.

In general, comm utility follows the mathematical conventions for multisets. Suppose an item x appears m times in multiset A, and n times in a multiset B. Then x appears max(mn, 0) times in AB, max(nm, 0) times in BA, and min(m, n) times in AB.

Union

To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”).

    cat file1 file2 > multiset_union_file

To find the union of two files as sets, first find the union as multisets, then remove duplicates.

    cat file1 file2 | sort | uniq > set_union_file

Note that uniq, like comm, assumes files are sorted.

Related

[1] The commutility is native to Unix-like systems, but has been ported to Windows. The examples in this post will work on Windows with ports of the necessary utilities, except for where we sort a file before sending it on to comm with <(sort file). That’s not a feature of sort or comm but of the shell, bash in my case. The Windows command line doesn’t support this syntax. (But bash ported to Windows would.)

Formatting numbers at the command line

The utility numfmt, part of Gnu Coreutils, formats numbers. The main uses are grouping digits and converting to and from unit suffixes like k for kilo and M for mega. This is somewhat useful for individual invocations, but like most command line utilities, the real value is using it as part of pipeline.

The --grouping option will separate digits according to the rules of your locale. So on my computer

    numfmt --grouping 123456789

returns 123,456,789. On a French computer, it would return 123.456.789 because the French use commas as decimal separators and use periods to group digits [1].

You can also use numfmt to convert between ordinary numbers and numbers with units. Unfortunately, there’s some ambiguity regarding what units like kilo and mega mean. A kilogram is 1,000 grams, but a kilobyte is 210 = 1,024 bytes. (Units like kibi and mebi were introduced to remove this ambiguity, but the previous usage is firmly established.)

If you want to convert 2M to an ordinary number, you have to specify whether you mean 2 × 106 or 2 × 220. For the former, use --from=si (for Système international d’unités) and for the latter use --from=iec (for International Electrotechnical Commission).

    $ numfmt --from=si 2M
    2000000
    $ numfmt --from=iec 2M
    2097152 

One possible gotcha is that the symbol for kilo is capital K rather than lower case k; all units from kilo to Yotta use a capital letter. Another is that there must be no space between the numerals and the suffix, e.g. 2G is legal but 2 G is not.

You can use Ki for kibi, Mi for mebi etc. if you use --from=iec-i.

    $ numfmt --from=iec-i 2Gi  
    2147483648   

To convert from ordinary numbers to numbers with units use the --to option.

    $ numfmt --to=iec 1000000 
    977K  

Related links

[1] I gave a presentation in France years ago. Much to my chagrin, the software I was demonstrating had a hard-coded assumption that the decimal separator was a period. I had tested the software on a French version of Windows, but had only entered integers and so I didn’t catch the problem.

To make matters worse, there was redundant input validation. Entering 3.14 rather than 3,14 would satisfy the code my team wrote, but the input was first validated by a Windows API which rejected 3.14 as invalid in the user’s locale.

Computing pi with bc

I wanted to stress test the bc calculator a little and so I calculated π to 10,000 digits a couple different ways.

First I ran

    time bc -l <<< "scale=10000;4*a(1)"

which calculates π as 4 arctan(1). This took 2 minutes and 38 seconds.

I imagine bc is using some sort of power series to compute arctan, and so smaller arguments should converge faster. So next I used a formula due to John Machin (1680–1752).

    time bc -l <<< "scale=10000;16*a(1/5) - 4*a(1/239)"

This took 52 seconds.

Both results were correct to 9,998 decimal places.

When you set the scale variable to n, bc doesn’t just carry calculations out to n decimal places; it uses more and tries to deliver n correct decimal places in the final result.

Why bc

This quirky little calculator is growing on me. For one thing, I like its limitations. If I need to do something that isn’t easy to do with bc, that probably means that I should write a script rather than trying to work directly at the command line.

Another thing I like about it is that it launches instantly. It doesn’t give you a command prompt, and so if you launch it in quiet mode you could think that it’s still loading when in fact it’s waiting on you. And if you send bc code with a here-string as in the examples above, you don’t even have to launch it per se.

If you want to try bc, I’d recommend launching it with the options -lq. You might even want to alias bc to bc -lq. The -l option loads math libraries. You’d think that would be the default for a calculator, but bc was written in a more resource-constrained time when you didn’t load much by default. The -l option also sets scale to 20, i.e. you get twenty decimal places of precision; the default is zero!

The -q option isn’t necessary, but it starts bc in quiet mode, suppressing three lines of copyright and warranty announcements.

As part of its minimalist design, bc only includes a few math functions, and you have to bootstrap the rest. For example, it includes sine and cosine but not tangent. More on how to use the built-in functions to compute more functions here.