Regular expressions and successive approximation

Regular expressions can do a lot of tasks in practice that they cannot do in theory. That’s because a particular application of regular expressions comes with context and with error tolerance.

For example, much has been said about how regular expressions cannot parse HTML. This is strictly true, but it says nothing about how well your particular regular expression might work for your task.

Maybe you’re searching HTML generated by a particular person or program, not all potential HTML documents, and a regular expression can find what you’re looking for in your idiomatic subset of HTML.

Types of errors

If you’re searching for a pattern, you can err by finding something that isn’t a match or by failing to find something that is a match, false positives and false negatives.

False positives are often not a problem. A very common use for regular expressions is to filter a huge amount of text down to a small amount of text to visually inspect. Maybe your first attempt at using a regular expression to find a needle in a haystack returns a smaller haystack. Then you refine your regular expression until it returns a small number of sharp pointy things in addition to the needle you’re after.

False negatives are more of a problem. If you can’t find what you’re looking for, you generalize your regular expression until you then have a false positive problem. If you still can’t find what you’re looking for, then you have to wonder whether the thing you’re after exists. Proving a negative is hard in general.

But even false negatives may not be a problem. Maybe there are a dozen needles in your haystack, but you don’t need to find all of them; you just need to find one of them.

In my work in data privacy I’ve needed to test whether personal information is somewhere it’s not supposed to be. If my regular expression finds any personal information, then I can tell my client they have a problem. If I don’t find any, I have more work to do.

Example: call signs

The FCC has fairly complicated rules for assigning amateur radio call signs. For starters,

Each call sign has a one letter prefix (K, N, W) or a two letter prefix (AA-AL, KA-KZ, NA-NZ, WA-WZ) and a one, two, or three letter suffix separated by a numeral (0-9) indicating the geographic region.

This is enough information to craft a regular expression has no false negatives, but may have some false positives. For example, you might start with

    [A-Z]+[0-9][A-Z]+

which matches a string of capital letters, followed by a digit, followed by another string of capital letters. Maybe that’s good enough. But you could get more specific, such as

    [AKNW][A-Z]?[0-9][A-Z]{1,3}

which comes closer to duplicating the quotation above from the FCC rules: one of A, K, N, or W, optionally followed by another letter, followed by a digit, followed by between one and three letters. But you could do better:

    \b(A[A-L]|[KNW][A-Z])[0-9][A-Z]{1,3}\b

This adds the restriction that our pattern appear on word boundaries, and captures the restriction on letters that can follow ‘A’.

You could keep going, crafting ever more complicated regular expressions that reduce your false positive rate. The FCC has a lot more restrictions than the ones quoted above, and you could incorporate some of these into your expression.

But false positives are inevitable. The ultimate determinant of what is a valid call sign is the FCC database of call signs, which changes continually. You refine your expression until you hit diminishing return. If you need more precision than that, don’t use regular expressions.

Prototype vs Perfect

I get pushback every time I write about regular expressions. I try to make it clear that I’m talking about quick and useful approximations, but perhaps I should do more to emphasize that. I suppose critics of regular expressions are reacting to enthusiasts who want to use regular expressions for everything.

It’s often possible to create a 99% solution in 1/1000 of the time it would take to create a 100% solution. Whether or not 99% is good enough depends on context.

More regex posts