Regular expressions that cross lines

One of the fiddly parts of regular expressions is how to handle line breaks. Should regular expression searches be applied one line at a time, or should an entire file be treated as a single line?

This morning I was trying to track down a LaTeX file that said “discussed in the Section” rather than simply “discussed in Section.” I wanted to search on “the Section” to see whether I had a similar error in other files.

Line breaks don’t matter to LaTeX [1], so “the” could be at the end of one line and “Section” at the beginning of another. I found what I was after by using

    grep -Pzo "the\s+Section" foo.tex

Here -P tells grep to use Perl regular expressions. That’s not necessary here, but I imprinted on Perl regular expressions long ago, and I use PCRE (Perl compatible regular expressions) whenever possible so I don’t have to remember the annoying little syntax differences between various regex implementations.

The -z option says to treat the entire file as one long string. This eliminates the line break issue.

The -o option says to output only what the regular expression matches. Otherwise grep will return the matching line. Ordinarily that wouldn’t be so bad, but because of the -z option, the matching line is the entire file.

The \s+ characters between the and Section represent one or more whitespace characters, such as a space or a newline.

The -P flag is a Gnu feature, so it works on Linux. But macOS ships with BSD-derived versions of its utilities, and its version grep does not support the -P option. On my Macbook I have ggrep mapped to the Gnu version of grep.

Another option is to use ripgrep rather than grep. It uses Perl-like regular expressions, and so there is no need for anything like the -P flag. The analog of -z in ripgrep is -U, so the counterpart of the command above would be

    ripgrep -Uo "the\s+Section" foo.tex

Usually regular expression searches are so fast that execution time doesn’t matter. But when it does matter, ripgrep can be an order of magnitude faster than grep.

Related posts

[1] LaTeX decides how to break lines in the output independent of line breaks in the input. This allows you to arrange the source file logically rather than aesthetically.

2 thoughts on “Regular expressions that cross lines

  1. From this post I’ve learned the -z option. Had never seen it before, but it looks so useful for multiline grepping.

    In your example, it might also work using the boundary matching \b with the standard -E (extended regexp), e.g.:

    grep -Eo ‘\bthe\s+Section\b’ foo.tex

    The above would work equally on any Debian based and also Busybox based distros (like Alpine). Personally I use Alpine a lot on servers, and get very limited on unusual options on many linux utilities (e.g. no support for -P nor -z on grep on Busybox distros right now).

Comments are closed.