If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep
seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.
As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:
Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.
Let’s start with the following regular expression.
[0-9]{3}\.?[0-9]{0,2}
This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since .
is a special character in regular expressions, we have to use a backslash to literally match a period.)
The regular expression above will work with Perl or Python, but not with grep
or sed
by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep
and sed
support basic regular expressions (BRE) by default.
Basic regular expressions would use \{3\}
rather than {3}
to match a pattern three times. So, for example,
echo 123 | grep "[0-9]\{3\}"
would return 123
, but
echo 123 | grep "[0-9]{3}"
would return nothing.
Similarly,
echo 123 | sed -n "/[0-9]\{3\}/p"
would return 123
but
echo 123 | sed -n "/[0-9]{3}/p"
returns nothing.
(The -n
option to sed
tells it not to print every line by default. The p
following the regular expression tells sed
to print those lines that match the pattern. Here there’s only one line, the output of echo
, but typically grep
and sed
would be use on files with multiple lines.)
Turning on ERE support
You can tell grep
and sed
that you want to use extended regular expressions by giving either one the -E
option. So, for example, both
echo 123 | grep -E "[0-9]{3}"
and
echo 123 | sed -E -n "/[0-9]{3}/p"
will print 123
.
You can use egrep
as a synonym for grep -E
, at least with Gnu implementations.
Incidentally, awk
uses extended regular expressions, and so
echo 123 | awk "/[0-9]{3}/"
will also print 123
.
Going back to our full regular expression, using \.?
for an optional period works with grep
and sed
if we ask for ERE support. The following commands all print 123.4
.
echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}" echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p" echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"
Without the -E
option, grep
and sed
will not return a match.
This doesn’t fix everything
At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually
\d{3}\.?\d{0,2}
rather than
[0-9]{3}\.?[0-9]{0,2}
I used the shortcut \d
to denote a digit. Python, Perl, and Awk will understand this, but grep
will not, even with the -E
option.
grep
will understand \d
if instead you use the -P
option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep
supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep
support PCRE. And sed
does not have an option to support PCRE.
s/backspace/backslash/