Why can’t grep find negative numbers?

Suppose you’re looking for instances of -42 in a file foo.txt. The command

    grep -42 foo.txt

won’t work. Instead you’ll get a warning message like the following.

    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

Putting single or double quotes around -42 won’t help. The problem is that grep interprets 42 as a command line option, and <codegrep doesn’t have such an option. This is a problem if you’re searching for negative numbers, or any pattern that beings with a dash, such as -able or --version.

The solution is to put -e in front of a regular expression containing a dash. That tells grep that the next token at the command line is a regular expression, not a command line option. So

    grep -e -42 foo.txt

will work.

You can also use -e several times to give grep several regular expressions to search for. For example,

    grep -e cat -e dog foo.txt

will search for “cat” or “dog.”

See the previous post for another example of where grep doesn’t seem to work. By default grep supports a restricted regular expression syntax and may need to be told to use “extended” regular expressions.

Why doesn’t grep work?

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.

    [0-9]{3}\.?[0-9]{0,2}

This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.

Similarly,

    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"

and

    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually

    \d{3}\.?\d{0,2}

rather than

    [0-9]{3}\.?[0-9]{0,2}

I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.

Related

Set theory at the command line

Often you have two lists, and you want to know what items belong to both lists, or which things belong to one list but not the other. The command line utility comm was written for this task.

Given two files, A and B, comm returns its output in three columns:

  1. AB
  2. BA
  3. AB.

Here the minus sign means set difference, i.e. AB is the set of things in A but not in B.

Venn diagram of comm parameters
The numbering above will be correspond to command line options for comm. More on that shortly.

Difference and intersection

Here’s an example. Suppose we have a file states.txt containing a list of US states

    Alabama
    California
    Georgia
    Idaho
    Virginia

and a file names.txt containing female names.

    Charlotte
    Della
    Frances
    Georgia
    Margaret
    Marilyn
    Virginia

Then the command

    comm states.txt names.txt

returns the following.

    Alabama
    California
            Charlotte
            Della
            Frances
                    Georgia
    Idaho
            Margaret
            Marilyn
                    Virginia

The first column is states which are not female names. The second column is female names which are not states. The third column is states which are also female names.

Filtering output

The output of comm is tab-separated. So you could pull out one of the columns by piping the output through cut, but you probably don’t want to do that for a couple reasons. First, you might get unwanted spaces. For example,

    comm states.txt names.txt | cut -f 1

returns

Alabama
California




Idaho

Worse, if you ask cut to return the second column, you won’t get what you expect.

Alabama
California
Charlotte
Della
Frances

Idaho
Margaret
Marilyn

This is because although comm uses tabs, it doesn’t produce a typical tab-separated file.

The way to filter comm output is to tell comm which columns to suppress.

If you only want the first column, states that are not names, use the option -23, i.e. to select column 1, tell comm not to print columns 2 or 3. So

    comm -23 states.txt names.txt

returns

    Alabama
    California
    Idaho

with no blank lines. Similarly, if you just want column 2, use comm -13. And if you want the intersection, column 3, use comm -12.

Sorting

The comm utility assumes its input files are sorted. If they’re not, it will warn you.

If your files are not already sorted, you could sort them and send them to comm in a one-liner [1].

    comm <(sort states.txt) <(sort names.txt)

Multisets

If your input files are truly sets, i.e. no item appears twice, comm will act as you expect. If you actually have a multiset, i.e. items may appear more than once, the output may surprise you at first, though on further reflection you may agree it does what you’d hope.

Suppose you have a file places.txt of places

    Cincinnati
    Cleveland
    Washington
    Washington

and a file of US Presidents.

    Adams
    Adams
    Cleveland
    Cleveland
    Jefferson
    Washington

Then

    comm places.txt presidents.txt

produces the following output.

            Adams
            Adams
    Cincinnati
                    Cleveland
            Cleveland
            Jefferson
                    Washington
    Washington

Cleveland was on the list of presidents twice. (Grover Cleveland was the 22nd and the 24th president, the only US president to serve non-consecutive terms.) The output of comm lists Cleveland once in the intersection of places and presidents (say 22nd president and a city in Ohio), but also lists him once as a president not corresponding to a place. That is because there are two Clevelands on our list of presidents, there is only one in our list of places.

Suppose we had included five places named Cleveland (There are cities named Cleveland in several states). Then comm would list Cleveland 3 times as a place not a president, and 2 times as a place and a president.

In general, comm utility follows the mathematical conventions for multisets. Suppose an item x appears m times in multiset A, and n times in a multiset B. Then x appears max(mn, 0) times in AB, max(nm, 0) times in BA, and min(m, n) times in AB.

Union

To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”).

    cat file1 file2 > multiset_union_file

To find the union of two files as sets, first find the union as multisets, then remove duplicates.

    cat file1 file2 | sort | uniq > set_union_file

Note that uniq, like comm, assumes files are sorted.

Related

[1] The commutility is native to Unix-like systems, but has been ported to Windows. The examples in this post will work on Windows with ports of the necessary utilities, except for where we sort a file before sending it on to comm with <(sort file). That’s not a feature of sort or comm but of the shell, bash in my case. The Windows command line doesn’t support this syntax. (But bash ported to Windows would.)

Formatting numbers at the command line

The utility numfmt, part of Gnu Coreutils, formats numbers. The main uses are grouping digits and converting to and from unit suffixes like k for kilo and M for mega. This is somewhat useful for individual invocations, but like most command line utilities, the real value is using it as part of pipeline.

The --grouping option will separate digits according to the rules of your locale. So on my computer

    numfmt --grouping 123456789

returns 123,456,789. On a French computer, it would return 123.456.789 because the French use commas as decimal separators and use periods to group digits [1].

You can also use numfmt to convert between ordinary numbers and numbers with units. Unfortunately, there’s some ambiguity regarding what units like kilo and mega mean. A kilogram is 1,000 grams, but a kilobyte is 210 = 1,024 bytes. (Units like kibi and mebi were introduced to remove this ambiguity, but the previous usage is firmly established.)

If you want to convert 2M to an ordinary number, you have to specify whether you mean 2 × 106 or 2 × 220. For the former, use --from=si (for Système international d’unités) and for the latter use --from=iec (for International Electrotechnical Commission).

    $ numfmt --from=si 2M
    2000000
    $ numfmt --from=iec 2M
    2097152 

One possible gotcha is that the symbol for kilo is capital K rather than lower case k; all units from kilo to Yotta use a capital letter. Another is that there must be no space between the numerals and the suffix, e.g. 2G is legal but 2 G is not.

You can use Ki for kibi, Mi for mebi etc. if you use --from=iec-i.

    $ numfmt --from=iec-i 2Gi  
    2147483648   

To convert from ordinary numbers to numbers with units use the --to option.

    $ numfmt --to=iec 1000000 
    977K  

Related links

[1] I gave a presentation in France years ago. Much to my chagrin, the software I was demonstrating had a hard-coded assumption that the decimal separator was a period. I had tested the software on a French version of Windows, but had only entered integers and so I didn’t catch the problem.

To make matters worse, there was redundant input validation. Entering 3.14 rather than 3,14 would satisfy the code my team wrote, but the input was first validated by a Windows API which rejected 3.14 as invalid in the user’s locale.

Computing pi with bc

I wanted to stress test the bc calculator a little and so I calculated π to 10,000 digits a couple different ways.

First I ran

    time bc -l <<< "scale=10000;4*a(1)"

which calculates π as 4 arctan(1). This took 2 minutes and 38 seconds.

I imagine bc is using some sort of power series to compute arctan, and so smaller arguments should converge faster. So next I used a formula due to John Machin (1680–1752).

    time bc -l <<< "scale=10000;16*a(1/5) - 4*a(1/239)"

This took 52 seconds.

Both results were correct to 9,998 decimal places.

When you set the scale variable to n, bc doesn’t just carry calculations out to n decimal places; it uses more and tries to deliver n correct decimal places in the final result.

Why bc

This quirky little calculator is growing on me. For one thing, I like its limitations. If I need to do something that isn’t easy to do with bc, that probably means that I should write a script rather than trying to work directly at the command line.

Another thing I like about it is that it launches instantly. It doesn’t give you a command prompt, and so if you launch it in quiet mode you could think that it’s still loading when in fact it’s waiting on you. And if you send bc code with a here-string as in the examples above, you don’t even have to launch it per se.

If you want to try bc, I’d recommend launching it with the options -lq. You might even want to alias bc to bc -lq. The -l option loads math libraries. You’d think that would be the default for a calculator, but bc was written in a more resource-constrained time when you didn’t load much by default. The -l option also sets scale to 20, i.e. you get twenty decimal places of precision; the default is zero!

The -q option isn’t necessary, but it starts bc in quiet mode, suppressing three lines of copyright and warranty announcements.

As part of its minimalist design, bc only includes a few math functions, and you have to bootstrap the rest. For example, it includes sine and cosine but not tangent. More on how to use the built-in functions to compute more functions here.

Splitting lines and numbering the pieces

As I mentioned in my computational survivalist post, I’m working on a project where I have a dedicated computer with little more than basic Unix tools, ported to Windows. It’s given me new appreciation for how the standard Unix tools fit together; I’ve had to rely on them for tasks I’d usually do a different way.

I’d seen the nl command before for numbering lines, but I thought “Why would you ever want to do that? If you want to see line numbers, use your editor.” That way of thinking looks at the tools one at a time, asking what each can do, rather than thinking about how they might work together.

Today, for the first time ever, I wanted to number lines from the command line. I had a delimited text file and wanted to see a numbered list of the column headings. I’ve written before about how you can extract columns using cut, but you have to know the number of a column to select it. So it would be nice to see a numbered list of column headings.

The data I’m working on is proprietary, so I downloaded a PUMS (Public Use Microdata Sample) file named ss04hak.csv from the US Census to illustrate instead. The first line of this file is

RT,SERIALNO,DIVISION,MSACMSA,PMSA,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,FSP,GRNTP,GRPIP,HHL,HHT,HINCP,HUPAC,LNGI,MV,NOC,NPF,NRC,OCPIP,PSF,R18,R65,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,FTAXP,FTELP,FTENP,FVACSP,FVALP,FVEHP,FWATP,FYBLP

I want to grab the first line of this file, replace commas with newlines, and number the results. That’s what the following one-liner does.

    head -n 1 ss04hak.csv | sed "s/,/\n/g" | nl

The output looks like this:

     1  RT 
     2  SERIALNO 
     3  DIVISION  
     4  MSACMSA
     5  PMSA
...
   100  FWATP
   101  FYBLP

Now if I wanted to look at a particular field, I could see the column number without putting my finger on my screen and counting. Then I could use that column number as an argument to cut -f.

File character counts

Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:

    fold -w1 file | sort | uniq -c

The fold function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w option. Setting that to 1 makes each character its own line. Then sort prepares the input for uniq, and the -c option causes uniq to display counts.

This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.

import collections

count = collections.Counter()
file = open("myfile", "r", encoding="utf8")
for line in file.readlines():
    for c in line.strip("\n"):
        count[ord(c)] += 1

for p in sorted(list(count)):
    print(chr(p), hex(p), count[p])

Computational survivalist

survival gear

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m not often in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

More computational tool posts

Quiet mode

When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python.

    Python 2.7.6 (default, Nov 23 2017, 15:49:48)
    [GCC 4.8.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.

The version number is a good reminder. I’m used to the command python bringing up Python 3+, so seeing the text above would remind me that on that computer I need to type python3 rather than simply python.

But if you’re working at the command line and jumping over to Python for a quick calculation, the start up verbiage separates your previous work from your current work by a few lines. This isn’t such a big deal with Python, but it is with R:

    R version 3.6.1 (2019-07-05) -- "Action of the Toes"
    Copyright (C) 2019 The R Foundation for Statistical Computing
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.

      Natural language support but running in an English locale

    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.

    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.

By the time you see all that, your previous work may have scrolled out of sight.

There’s a simple solution: use the option -q for quiet mode. Then you can jump in and out of your REPL with a minimum of ceremony and keep your previous work on screen.

For example, the following shows how you can use Python and bc without a lot of wasted vertical space.

    > python -q
    >>> 3+4
    7
    >>> quit()

    > bc -q
    3+4
    7
    quit

Python added the -q option in version 3, which the example above uses. Python 2 does not have an explicit quiet mode option, but Mike S points out a clever workaround in the comments. You can open a Python 2 REPL in quiet mode by using the following.

    python -ic ""

The combination of the -i and -c options tells Python to run the following script and enter interpreter mode. In this case the script is just the empty string, so Python does nothing but quietly enter the interpreter.

R has a quiet mode option, but by default R has the annoying habit of asking whether you want to save a workspace image when you quit.

    > R.exe -q
    > 3+4
    [1] 7
    > quit()
    Save workspace image? [y/n/c]: n

I have never wanted R to save a workspace image; I just don’t work that way. I’d rather keep my state in scripts. I set R to an alias that launches R with the --no-save option.

So if you launch R with -q and --no-save it takes up no more vertical space than Python or bc.

Related posts

Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools?

Why CSV?

In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that’s the way it is.

A minor variation on comma-separated values is tab-separated values [1].

Why standard utilities?

Why use standard Unix utilities? I’ll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don’t want to use something else.

Maybe you already know the standard utilities and don’t think that learning more specialized tools is worth the effort.

Maybe you’re already at the command line and in a command line state of mind, and don’t want to interrupt your work flow by doing something else.

Maybe you’re on a computer where you don’t have the ability to install any software and so you need to work with what’s there.

Whatever your reasons, we’ll go with the assumption that we’re committed to using commands that have been around for decades.

cut, sort, and awk

The tools I want to look at are cut, sort, and awk. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of sort will be named gsort in order to not conflict with the native Windows sort function. There’s no need to rename the other two utilities since they don’t have counterparts that ship with Windows.

The sort command is simple and useful. There are just a few options you’ll need to know about. The utility sorts fields as text by default, but the -n tells it to sort numerically.

Since we’re talking about CSV files, you’ll need to know that -t, is the option to tell sort that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the -k option.

The last utility, awk, is more than a utility. It’s a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It’s very common to pipe output to an awk program that’s only a few characters long.

You can get started quickly with awk by reading Greg Grothous’ article Why you should learn just a little awk.

Inconsistencies

Now for the bad news: these programs are inconsistent in their options. The two most common things you’ll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

cut uses -d or --delimiter to specify the field delimiter and -f or --fields to specify fields. Makes sense.

sort uses -t or --field-separator to specify the field delimiter and -k or --key to specify the field. When you’re talking about sorting things, it’s common to call the fields keys, and so the way sort specifies fields makes sense in context. I see no reason for -t other than -f was already taken. (In sorting, you talk about folding upper case to lower case, so -f stands for fold.)

awk uses -F or --field-separator to specify the field delimiter. At least the verbose option is consistent with sort. Why -F for the short option instead of -f? The latter was already taken for file. To tell awk to read a program from a file rather than the command line you use the -f option.

awk handles fields differently than cut and sort. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: $1 holds the content of the first field, $2 the second, etc.

The following compact table summarizes how you tell each utility that you’re working with comma-separated files and that you’re interested in the second field.

    |------+-----+-----|
    | cut  | -d, | -f2 |
    | sort | -t, | -k2 |
    | awk  | -F, | $2  |
    |------+-----+-----|

Trade-offs

Some will object that the inconsistencies documented above are a good example of why you shouldn’t work with CSV files using cut, sort, and awk. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

Related

[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.