Why are regular expressions difficult?

Regular expressions are challenging, but not for the reasons commonly given.

Non-reasons

Here are some reasons given for the difficulty of regular expressions that I don’t agree with.

Cryptic syntax

I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If that were the only difficulty, you could easily learn Greek in a couple days. No, Greek is difficult for English speakers to learn because it is a very different language than English. The differences go much deeper than the alphabet, and in fact that alphabets are not entirely different.

The basic symbol choices for regular expressions — . to match any character, ? to denote that something is optional, etc. — were arbitrary, but any choice would be. As it is, the chosen symbols are sometimes mnemonic, or at least consistent with notation conventions in other areas.

Density

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code. But if a typical pattern described by 100 characters of regular expression were expanded into English prose or C code, the result would be hard to read as well, not because it is dense but because it is verbose.

Crafting expressions

The heart of using regular expressions is looking at a pattern and crafting a regular expression to match that pattern. I don’t think this is difficult, especially when you have some error tolerance. Very often in applications of regular expressions it’s OK to have a few false positives, as long as a human scans the output. For example, see this post on looking for ICD-10 codes.

I suspect that many people who think that writing regular expressions is difficult actually find some peripheral issue difficult, not the core activity of describing patterns per se.

Reasons

Now for what I believe are reasons why regular expressions are.

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse. People wanting to add new features to regular expressions ‐ look-behind, comments, named matches, etc. — had to come up with syntax that wouldn’t conflict with earlier usage, which meant strange combinations of symbols that would have previously been illegal.

Dialects

If you use regular expressions in multiple programming languages, you’ll run into numerous slight variations. Can I write \d for a digit, or do I need to write [0-9]? If I want to group a subexpression, do I need to put a backslash in front of the parentheses? Can I write non-capturing groups?

These variations are difficult to remember, not because they’re completely different, but because they’re so similar. It reminds me of my French teacher saying “Does literature have a double t in English and one t in French, or the other way around? I can never remember.”

Use

It’s difficult to remember the variations on expression syntax in various programming languages, but I find it even more difficult to remember how to use the expressions. If you want to replace all instances of some regular expression with a string, the way to do that could be completely different in two languages, even if the languages use the exact same dialect of regular expressions.

Resources

Here are notes on regular expressions I’ve written over the years, largely for my own reference.

3 thoughts on “Why are regular expressions difficult?”

Steve

20 June 2019 at 09:32

I think the main reason many people find them hard is the combination of many of the reasons you gave above: cryptic syntax, density, etc. COMBINED WITH the fact that (in general,) you don’t use them all that often. This leads to going back and re-learning whenever you need to craft a R.E., because you forgot it all in the interim.
Joost

29 September 2021 at 14:40

The article is spot on, except for one more reason: there are at least three different variations of regards expressions. For systems integrators, who don’t really have a favourite language/system to work with, this is a terrible source of headaches. There is GNU, perl, posix, original and what more…
Lawrence San

20 July 2022 at 15:07

Strongly agree/identify with Steve’s comment (way back in 2019). I have had to relearn regexps six times, after forgetting them each time, simply because there are such long gaps between my uses. Each time, it gets easier to relearn, however.

Comments are closed.