Efficiency of regular expressions

I’ve never optimized a regular expression.  I typically use regular expressions in scripts where efficiency doesn’t matter. And sometimes I do some regular expression processing as part of a larger program in which the bottleneck is somewhere else. But I’ve never worried about the efficiency of a regular expression.

Regular expression efficiency can matter. There are some regular expressions that can be astonishingly slow to match with some regular expression implementations. Russ Cox gives an example of a regular expression that takes Perl a minute to match against a string that’s only 29 characters long. Another regular expression implementation does the same match six orders of magnitude faster.

The example is contrived to show the difference between the approaches. I’m not sure whether I’ve ever run into a horribly inefficient regular expression on accident. Maybe once. I imagine people run into efficiency issues in practical applications, though I don’t.

I’d suggest worrying about regex efficiency if and when it becomes a problem. If your experience matches mine, that need may never come.

But if you do have a regex efficiency problem, or simply find regex implementation details interesting, I’d recommend Russ Cox’s article.

Related links

Notes on using regular expressions

6 thoughts on “Efficiency of regular expressions

  1. Obviously, for an adhoc program with a short handwritten regex, if it works; it works.

    However, I have encountered these corner cases in practice, and they’re surprisingly hard to work around. Frankly, it’s a pretty shameful state of affairs; this is a region of computer science that’s actually well understood and has a decent solution – albeit one that Perl’s syntax extensions make impossible to fully implement and onerous to mostly implement. It’s not surprising such mistakes were made; it’s surprising major software vendors such as microsoft and oracle/sun have perpetuated this problem by standardizing it in their libraries.

    In any case, it’s a real problem when Regexes grow – it’s not at all easy to understand which segment of a regex is responsible for a slowdown when one occurs in a large regex. And if you attempt to machine-generate regexes you’re pretty much out of luck – it’s probably less work to implement your own FSA matcher than it is to reliably avoid all the corner cases; particularly if there’s a risk of hostile user input.

  2. Aleksander Balicki

    Learn about automata and regular languages theory, all the questions will be answered. You can’t put all regexes from perl in one box, because the whole class of perl regexps recognizes recursively enumerable languages, but if you limit it to certain subset (concatenation, star,alternative and all the things you can create from it like +, [sets] etc) you’ll get a regular language, which is linear time and constant space solvable (very fast, as fast as input reading)

  3. I think there’s an opportunity on the web for a “help me write this regex” service. There are so many time when I would like nothing better than to pay someone 99c to write a regex for me.

  4. I ran across a regex problem years ago using one the email modes within Emacs. On occasion Emacs would appear to hang when displaying a single short email message. It turned out regular expressions were used to parse the email and if the message contained a line of hyphens it would bog down in the regex engine. I experimented and found that fewer that 10 hyphens was tolerable, but if you had 20 it would take a minute and if you had 80, then who knows. I didn’t wait around to find out. If you’re using a late 90’s Emacs mail mode to read this blog, then good luck!
    ——————————————————————————–
    Alex

Comments are closed.