These notes describe awk’s regular expression flavor.
Awk supports POSIX extended regular expressions (ERE). From the perspective of someone like myself accustomed to Perl-compatible regular expressions (PCRE), “extended” regular expressions are restricted regular expressions, and POSIX “basic” regular expressions (BRE) are restricted regular expressions requiring a lot of extra backslashes.
What works
I believe gawk supports POSIX extended regular expressions, but not necessarily POSIX “enhanced” regular expression features. (See the re_format
man page for more on distinctions between POSIX regex features.)
If you’re coming from Perl, C#, Java, etc. then the following features of awk regular expressions work as expected:
- .
- *
- ^
- $
- […]
- [^…]
- +
- ?
- (…)
- |
- {n}, {n,}, {n, m}
The character classes \w
, \W
, \s
, and \S
work in gawk by default. They do not work if you add the -c
or -P
compatibility flags. (The -c
flag restricts gawk to the features in Brian Kernighan’s version of awk. The -P
flag restricts gawk to POSIX compatibility.)
What doesn’t work
I’ve read that gawk supports GNU regular expressions, but gawk does not support word boundary anchors \b
and \B
. It does, however, support \<
for beginning of a word and \>
for the end of a word.
The only character class shortcuts are the ones listed above. For example, \d
for digits is not supported in gawk.
There are no regex modifiers, such as /i
to make a regex case-insensitive. Perl patterns that overload the question mark, such as look-arounds or comments, are not supported.
Relation to sed and grep
I believe Gnu’s version of sed and grep, each with the -E
flag, support all the features of gawk. So you could think of awk’s regex flavor as a lowest common denominator of these tools.