Why doesn’t grep work?

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.

    [0-9]{3}\.?[0-9]{0,2}

This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.

Similarly,

    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"

and

    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually

    \d{3}\.?\d{0,2}

rather than

    [0-9]{3}\.?[0-9]{0,2}

I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.

Related

Top cryptography posts of 2019

Toward the end of each year I write a post or two listing the most popular posts by category. This year the categories will be a little different. I’ll start by listing my most popular posts about cryptography this year.

The next categories will be command line tools, privacy, and math.

(When I wrote this, I started with crypto because I didn’t think I’d write any more posts on the topic. The the announcement about RSA-240 came out and so I wrote something about it yesterday.)

New RSA factoring challenge solved

How hard is it to factor large numbers? And how secure are encryption methods based on the difficulty of factoring large numbers?

The RSA factoring challenges were set up to address these questions. Last year RSA-230 was factored, and this week RSA-240 was factored. This is a 240 digit (795 bit) number, the product of two primes.

Researchers solved two related problems at the same time, factoring RSA-240 and solving a discrete logarithm problem. Together these problems took about 4,000 core-years to solve. It’s not clear from the announcement how long it would have taken just to factor RSA-240 alone.

If you were to rent the computing power used, I imagine the cost would be somewhere in the six figures.

This makes 2048-bit and 3072-bit RSA keys look very conservative. However, the weakest link in RSA encryption is implementation flaws, not the ability to factor big numbers.

Assume for a moment that breaking RSA encryption requires factoring keys. (This may not be true in theory [*] or in practice.) How long would it take to factor a 2048 or 3072 bit key?

The time required to factor a number n using the number field sieve is proportional to

\exp\left( \left(\sqrt[3]{\frac{64}{9}} + o(1)\right)(\ln n)^{\frac{1}{3}}(\ln \ln n)^{\frac{2}{3}}\right)

Here o(1) roughly means terms that go away as n gets larger. (More on the notation here.) For simplicity we’ll assume we can ignore these terms.

This suggests that factoring a 2048-bit key is 12 orders of magnitude harder than factoring RSA-240, and that factoring a 3072-bit key is 18 orders of magnitude harder.

However, I don’t think anyone believes that breaking RSA with 2048-bit keys would require a quadrillion core-years. If the NSA believed this, they wouldn’t be recommending that everyone move to 3072-bit keys.

Why such a large discrepancy? Here are a few reasons. As mentioned above, RSA encryption often has exploitable implementation flaws. And even if implemented perfectly, there is no proof that breaking RSA encryption is as hard as factoring. And there could be breakthroughs in factoring algorithms. And large-scale quantum computers may become practical, in which case factoring would become much easier.

***

[*] Factoring is sufficient to break RSA, but there’s no proof that it’s necessary. Michael Rabin’s variation on RSA is provably as hard to break as factoring: decryption would enable you to factor the key. But as far as I know, Rabin’s method isn’t used anywhere. Even if you know your method is as hard as factoring, maybe factoring isn’t as hard as it seems. Lower bounds on computational difficulty are much harder to obtain than upper bounds.

Distracted by the hard part

Last night I was helping my daughter with calculus homework. I told her that a common mistake was to forget what the original problem was after getting absorbed in sub-problems that have to be solved. I saw this over and over when I taught college.

Then a few minutes later, we both did exactly what I warned her against. She took the answer to a difficult sub-problem to be the final answer. I checked her work and confirmed that it was correct, until I saw we hadn’t actually answered the original question.

As I was waking up this morning, I realized I was about to make the same mistake on a client’s project. The goal was to write software to implement a function f which is a trivial composition of two other functions g and h. These two functions took a lot of work, including a couple levels of code generation. I felt I was done after testing g and h, but I forgot to write tests for f, the very thing I was asked to deliver.

This is a common pattern that goes beyond calculus homework and software development. It’s why checklists are so valuable. We resist checklists because they insult our intelligence, and yet they greatly reduce errors. Experienced people in every field can skip a step, most likely a simple step, without some structure to help them keep track.

Related posts