Awk's regular expression features

These notes describe awk’s regular expression flavor.

Awk supports POSIX extended regular expressions (ERE). From the perspective of someone like myself accustomed to Perl-compatible regular expressions (PCRE), “extended” regular expressions are restricted regular expressions, and POSIX “basic” regular expressions (BRE) are restricted regular expressions requiring a lot of extra backslashes.

What works

I believe gawk supports POSIX extended regular expressions, but not necessarily POSIX “enhanced” regular expression features. (See the re_format man page for more on distinctions between POSIX regex features.)

If you’re coming from Perl, C#, Java, etc. then the following features of awk regular expressions work as expected:

.
*
^
$
[…]
[^…]
+
?
(…)
|
{n}, {n,}, {n, m}

The character classes \w, \W, \s, and \S work in gawk by default. They do not work if you add the -c or -P compatibility flags. (The -c flag restricts gawk to the features in Brian Kernighan’s version of awk. The -P flag restricts gawk to POSIX compatibility.)

What doesn’t work

I’ve read that gawk supports GNU regular expressions, but gawk does not support word boundary anchors \b and \B. It does, however, support \< for beginning of a word and \> for the end of a word.

The only character class shortcuts are the ones listed above. For example, \d for digits is not supported in gawk.

There are no regex modifiers, such as /i to make a regex case-insensitive. Perl patterns that overload the question mark, such as look-arounds or comments, are not supported.

Relation to sed and grep

I believe Gnu’s version of sed and grep, each with the -E flag, support all the features of gawk. So you could think of awk’s regex flavor as a lowest common denominator of these tools.