Bringing regex modifiers into the regex

Suppose you’re using a program that takes a regular expression as an argument. You didn’t get the match you expected, then you realize you’d like your search to be case-insensitive.

If you were using grep you’d go back and add a -i flag.

If you were writing a Perl script, you could add a /i at the end of the regex.

If you were using Python, you could add re.IGNORECASE as a function argument.

But the premise of the post isn’t that you’re using grep or Perl or Python. The premise is that you are using a program that takes a regular expression as an argument, and regular expression modifiers are not regular expressions per se.

However, you can incorporate regular expression modifiers into a regular expression, if your regular expression implementation supports it. In particular, you can add (?i) to a regex to indicate that the remainder of the search pattern is to be interpreted as case-insensitive. You can also use (?-i) to turn case sensitivity back on.

For example, the regex

   foo(?i)baz(?-i)quz

will make the baz portion of the expression case insensitive but the rest is case sensitive. For example, the expression will match fooBaZqux but not foobazQux.

You can also use things like (?s) and (?m) where you would use /s and /m in Perl, or re.S and re.M in Python.

These scoped pattern modifiers are not supported everywhere. They were introduced in Perl and have been adopted in other languages and applications.

Related posts

Regex to match ICD-11 code

ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)

The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)

One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.

Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.

What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.

Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.

import re

icd11_re = re.compile(r"""
    ^                  # beginning of string
    [A-HJ-NP-Z0-9]     # alphanumeric
    [A-HJ-NP-Z]        # alphabetic
    [0-9]              # digit
    [A-HJ-NP-Z0-9]     # alphanumeric
    ((\.               # optional starting with .
    [A-HJ-NP-Z0-9])    # alphanumeric
    [A-HJ-NP-Z0-9]?)?  # optional further refinement
    $                  # end of string
    """, re.VERBOSE)

good = [
    "ND52",   # fracture of arm, level unspecified
    "9D00.3", # presbyopia
    "8B60.Y", # other specified increased intercranial pressure
    "DB98.7Z" # portal hypertension, unspecified
]

bad = [
    "ABCD",    # third character must be digit
    "AB3D.",   # dot must be followed by alphanumeric
    "9D0O.3",  # letter 'O' should be number 0
    "DB9872",  # missing dot
    "AB3",     # too short
    "DB90.123" # too long
]

for g in good:
    assert(icd11_re.match(g))
for b in bad:
    assert(icd11_re.match(b) == None)

Related posts

Regular expressions and successive approximation

Regular expressions can do a lot of tasks in practice that they cannot do in theory. That’s because a particular application of regular expressions comes with context and with error tolerance.

For example, much has been said about how regular expressions cannot parse HTML. This is strictly true, but it says nothing about how well your particular regular expression might work for your task.

Maybe you’re searching HTML generated by a particular person or program, not all potential HTML documents, and a regular expression can find what you’re looking for in your idiomatic subset of HTML.

Types of errors

If you’re searching for a pattern, you can err by finding something that isn’t a match or by failing to find something that is a match, false positives and false negatives.

False positives are often not a problem. A very common use for regular expressions is to filter a huge amount of text down to a small amount of text to visually inspect. Maybe your first attempt at using a regular expression to find a needle in a haystack returns a smaller haystack. Then you refine your regular expression until it returns a small number of sharp pointy things in addition to the needle you’re after.

False negatives are more of a problem. If you can’t find what you’re looking for, you generalize your regular expression until you then have a false positive problem. If you still can’t find what you’re looking for, then you have to wonder whether the thing you’re after exists. Proving a negative is hard in general.

But even false negatives may not be a problem. Maybe there are a dozen needles in your haystack, but you don’t need to find all of them; you just need to find one of them.

In my work in data privacy I’ve needed to test whether personal information is somewhere it’s not supposed to be. If my regular expression finds any personal information, then I can tell my client they have a problem. If I don’t find any, I have more work to do.

Example: call signs

The FCC has fairly complicated rules for assigning amateur radio call signs. For starters,

Each call sign has a one letter prefix (K, N, W) or a two letter prefix (AA-AL, KA-KZ, NA-NZ, WA-WZ) and a one, two, or three letter suffix separated by a numeral (0-9) indicating the geographic region.

This is enough information to craft a regular expression has no false negatives, but may have some false positives. For example, you might start with

    [A-Z]+[0-9][A-Z]+

which matches a string of capital letters, followed by a digit, followed by another string of capital letters. Maybe that’s good enough. But you could get more specific, such as

    [AKNW][A-Z]?[0-9][A-Z]{1,3}

which comes closer to duplicating the quotation above from the FCC rules: one of A, K, N, or W, optionally followed by another letter, followed by a digit, followed by between one and three letters. But you could do better:

    \b(A[A-L]|[KNW][A-Z])[0-9][A-Z]{1,3}\b

This adds the restriction that our pattern appear on word boundaries, and captures the restriction on letters that can follow ‘A’.

You could keep going, crafting ever more complicated regular expressions that reduce your false positive rate. The FCC has a lot more restrictions than the ones quoted above, and you could incorporate some of these into your expression.

But false positives are inevitable. The ultimate determinant of what is a valid call sign is the FCC database of call signs, which changes continually. You refine your expression until you hit diminishing return. If you need more precision than that, don’t use regular expressions.

Prototype vs Perfect

I get pushback every time I write about regular expressions. I try to make it clear that I’m talking about quick and useful approximations, but perhaps I should do more to emphasize that. I suppose critics of regular expressions are reacting to enthusiasts who want to use regular expressions for everything.

It’s often possible to create a 99% solution in 1/1000 of the time it would take to create a 100% solution. Whether or not 99% is good enough depends on context.

More regex posts

Word problems, logic, and regular expressions

Word problems

Suppose you have a sequence of symbols and a set of rewriting rules for replacing some patterns of symbols with others. Now you’re given two such sequences. Can you tell whether there’s a way to turn one of them into the other?

This is known as the word problem, and in general it’s undecidable. In general the problem cannot be solved by a program, but some instances can. We’ll look at a word problem that can be solved with a few regular expressions.

Modal logic

Basic modal logic has two symbols, □ (“box”) and ◇ (“diamond”), and concatenations of these symbols. In general, there are infinitely many non-equivalent sequences of boxes and diamonds, depending on the axioms of your modal logic.

In the axiom system S4, every non-empty sequence of boxes and diamonds is equivalent to one of six possibilities:

  • □◇
  • ◇□
  • □◇□
  • ◇□◇

An arbitrary sequence of boxes and diamonds can be reduced to one of the forms above by applying the following rules:

  • □ □ → □
  • ◇ ◇ → ◇
  • □◇□◇ → □◇
  • ◇□◇□ → ◇□

Regular expressions

We can apply the reduction rules above using regular expressions with the following Perl code.

    use utf8;

    $_ = "□□◇□◇◇◇◇□□";

    s/□+/□/g;
    s/◇+/◇/g; 
    s/(□◇)+/□◇/g; 
    s/(◇□)+/◇□/g;

    print;

The directive use utf8; tells Perl to be prepared for non-ASCII characters, namely boxes and diamonds. In Perl, $_ is the implicit variable; all the following substitution commands will modify this variable, and the print statement will output the final value of this variable.

The first substitution replaces one or more consecutive boxes with one box and the second does the analogous substitution for consecutive diamonds. The third and fourth substitution commands replace repetitions of □◇ or ◇□ with a single instance.

The script above outputs

□◇□

meaning that

□□◇□◇◇◇◇□□p ⟷ □◇□p

is a theorem in S4.

Word problems can’t always be solved using regular expressions, or any other programming technique, but this one could.

Related posts

Corner quotes in Unicode

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
    | Char | Code   | Name                |
    |------+--------+---------------------|
    | ⌜    | U+231C | TOP LEFT CORNER     |
    | ⌝    | U+231D | TOP RIGHT CORNER    |
    | ⌞    | U+231E | BOTTOM LEFT CORNER  |
    | ⌟    | U+231F | BOTTOM RIGHT CORNER |
    |------+--------+---------------------|

Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

Related posts

Is fast grep faster?

The grep utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, fgrep may be much faster than grep. Or so I’ve heard. I did some benchmarks to see.

Strictly speaking I used grep -F rather than fgrep. On Linux, if you ask for the man (manual) page for fgrep you’ll be taken to the man page for grep which says

In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility.

I was working on a project for a client where I had to search for a long list of words in a long list of files [1]. This is the kind of task where fgrep (“fast grep”) is supposed to be much faster than grep. It was a tiny bit faster, not enough to notice. When I timed it the difference was on the order of 1%.

I ran an analogous search on my own computer with different data and got similar results [2]. There may be instances where fgrep is much faster than grep, but I haven’t seen one first hand.

I suspect that the performance difference between fgrep and grep used to be larger, but the latter has gotten more efficient. Now grep is smart enough to search for strings quickly without having to be told explicitly via -F that the regular expressions are in fact strings. Maybe it scans the regular expression(s) before searching and effectively sets the -F flag itself if appropriate.

Related posts

[1] I used the -f to tell grep the name of a file containing the terms to search for, not to be confused with the additional flag -F to tell grep that the search terms are simply strings.

[2] I got similar results when I was using Linux (WSL) on my computer. When I used grep from GOW the -F flag made the search 24 times faster. Because the GOW project provides light-weight ports of Gnu tools to Windows, it’s understandable that it would not include some of the optimizations in Gnu’s implementation of grep.

tcgrep: grep rewritten in Perl

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility grep that he calls tcgrep. You don’t have to know Perl to use tcgrep, but you can send it Perl regular expressions.

Why not grep with PCRE?

You can get basically the same functionality as tcgrep by using grep with its PCRE option -P. Since tcgrep searches directories recursively, a more direct comparison would be

    grep -R -P

However, your version of grep might not support -P. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The man page for grep on my machine says

    -P, --perl-regexp
        Interpret the pattern as a Perl-compatible regular 
        expression (PCRE). This is experimental and grep -P 
        may warn of unimplemented features.

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of grep on your system supports the -P option and is adequately Perl-compatible, it will run faster than tcgrep. But if you find yourself on a computer that has Perl but not a recent version of grep, you may find tcgrep handy.

Installation

tcgrep is included as part of the Unicode::Tussle Perl module; since tcgrep is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install tcgrep (and several more utilities) with

    cpan Unicode::Tussle

This worked for me on Linux without any issues but the install failed on Windows.

I installed tcgrep on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of %Compress to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

Consistency

Another reason to use tcgrep is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like sed will have different features on different platforms, and it will not support the same regular expressions as another tool such as awk.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

Related posts

Finding pi in pi with Perl

Here’s a frivolous problem whose solution illustrates three features of Perl:

  1. Arbitrary precision floating point
  2. Lazy quantifiers in regular expressions
  3. Returning the positions of matched groups.

Our problem is to look for the digits 3, 1, 4, and 1 in the decimal part of π.

First, we get the first 100 digits of π after the decimal as a string. (It turns out 100 is enough, but if it weren’t we could try again with more digits.)

    use Math::BigFloat "bpi";

    $x = substr bpi(101)->bstr(), 2;

This loads Perl’s extended precision library Math::BigFloat, gets π to 101 significant figures, converts the result to a string, then lops off the first two characters “3.” at the beginning leaving “141592…”.

Next, we want to search our string for a 3, followed by some number of digits, followed by a 1, followed by some number of digits, followed by a 4, followed by some number of digits, and finally another 1.

A naive way to search the string would be to use the regex /3.*1.*4.*1/. But the star operator is greedy: it matches as much as possible. So the .* after the 3 would match as many characters as possible before backtracking to look for a 1. But we’d like to find the first 1 after a 3 etc.

The solution is simple: add a ? after each star to make the match lazy rather than greedy. So the regular expression we want is

   /3.*?1.*?4.*?1/

This will tell us whether our string contains the pattern we’re after, but we’d like to also know where the string contains the pattern. So we make each segment a captured group.

   /(3.*?)(1.*?)(4.*?)(1)/

Perl automatically populates an array @- with the positions of the matches, so it has the information we’re looking for. Element 0 of the array is the position of the entire match, so it is redundant with element 1. The advantage of this bit of redundancy is that the starting position of group $1 is in the element with index 1, the starting position of $2 is at index 2, etc.

We use the shift operator to remove the redundant first element of the array. Since shift modifies its argument, we can’t apply it directly to the constant array @-, so we apply it to a copy.

    if ($x =~ /(3.*?)(1.*?)(4.*?)(1)/) {
        @positions = @-;
        shift  @positions;
        print "@positions\n";
    }

This says that our pattern appears at positions 8, 36, 56, and 67. Note that these are array indices, and so they are zero-based. So if you count from 1, the first 3 appears in the 9th digit etc.

To verify that the digits at these indices are 3, 1, 4, and 1 respectively, we make the digits into an array, and slice the array by the positions found above.

    @digits = split(//, $x);
    print "@digits[@positions]\n";

This prints 3 1 4 1 as expected.

Python triple quote strings and regular expressions

There are several ways to quote strings in Python. Triple quotes let strings span multiple lines. Line breaks in your source file become line break characters in your string. A triple-quoted string in Python acts something like “here doc” in other languages.

However, Python’s indentation rules complicate matters because the indentation becomes part of the quoted string. For example, suppose you have the following code outside of a function.

x = """\
abc
def
ghi
"""

Then you move this into a function foo and change its name to y.

def foo():
    y = """\
    abc
    def
    ghi
    """

Now x and y are different strings! The former begins with a and the latter begins with four spaces. (The backslash after the opening triple quote prevents the following newline from being part of the quoted string. Otherwise x and y would begin with a newline.) The string y also has four spaces in front of def and four spaces in front of ghi. You can’t push the string contents to the left margin because that would violate Python’s formatting rules. (Update: Oh yes you can! See Aaron Meurer’s comment below.)

We now give three solutions to this problem.

Solution 1: textwrap.dedent

There is a function in the Python standard library that will strip the unwanted space out of the string y.

import textwrap 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = textwrap.dedent(y)

This works, but in my opinion a better approach is to use regular expressions [1].

Solution 2: Regular expression with a flag

We want to remove white space, and the regular expression for a white space character is \s. We want to remove one or more white spaces so we add a + on the end. But in general we don’t want to remove all white space, just white space at the beginning of a line, so we stick ^ on the front to say we want to match white space at the beginning of a line.

import re 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = re.sub("^\s+", "", y)

Unfortunately this doesn’t work. By default ^ only matches the beginning of a string, not the beginning of a line. So it will only remove the white space in front of the first line; there will still be white space in front of the following lines.

One solution is to add the flag re.MULTILINE to the substitution function. This will signal that we want ^ to match the beginning of every line in our multi-line string.

    y = re.sub("^\s+", "", y, re.MULTILINE)

Unfortunately that doesn’t quite work either! The fourth positional argument to re.sub is a count of how many substitutions to make. It defaults to 0, which actually means infinity, i.e. replace all occurrences. You could set count to 1 to replace only the first occurrence, for example. If we’re not going to specify count we have to set flags by name rather than by position, i.e. the line above should be

    y = re.sub("^\s+", "", y, flags=re.MULTILINE)

That works.

You could also abbreviate re.MULTILINE to re.M. The former is more explicit and the latter is more compact. To each his own. There’s more than one way to do it. [2]

Solution 3: Regular expression with a modifier

In my opinion, it is better to modify the regular expression itself than to pass in a flag. The modifier (?m) specifies that in the rest of the regular the ^ character should match the beginning of each line.

    y = re.sub("(?m)^\s+", "", y)

One reason I believe this is better is that moves information from a language-specific implementation of regular expressions into a regular expression syntax that is supported in many programming languages.

For example, the regular expression

    (?m)^\s+

would have the same meaning in Perl and Python. The two languages have the same way of expressing modifiers [3], but different ways of expressing flags. In Perl you paste an m on the end of a match operator to accomplish what Python does with setting flasgs=re.MULTILINE.

One of the most commonly used modifiers is (?i) to indicate that a regular expression should match in a case-insensitive manner. Perl and Python (and other languages) accept (?i) in a regular expression, but each language has its own way of adding modifiers. Perl adds an i after the match operator, and Python uses

    flags=re.IGNORECASE

or

    flags=re.I

as a function argument.

More on regular expressions

[1] Yes, I’ve heard the quip about two problems. It’s funny, but it’s not a universal law.

[2] “There’s more than one way to do it” is a mantra of Perl and contradicts The Zen of Python. I use the line here as a good-natured jab at Python. Despite its stated ideals, Python has more in common with Perl than it would like to admit and continues to adopt ideas from Perl.

[3] Python’s re module doesn’t support every regular expression modifier that Perl supports. I don’t know about Python’s regex module.

Searching Greek and Hebrew with regular expressions

According to the Python Cookbook, “Mixing Unicode and regular expressions is often a good way to make your head explode.” It is thus with fear and trembling that I dip my toe into using Unicode with Greek and Hebrew.

I heard recently that there are anomalies in the Hebrew Bible where the final form of a letter is deliberately used in the middle of a word. That made me think about searching for such anomalies with regular expressions. I’ll come back to that shortly, but I’ll start by looking at Greek where things are a little simpler.

Greek

Only one letter in Greek has a variant form at the end of a word. Sigma is written ς at the end of a word and σ everywhere else. The following Python code shows we can search for sigmas and final sigmas in the first sentence from Matthew’s gospel.

    import re

    matthew = "Κατάλογος του γένους του Ιησού Χριστού, γιου του Δαυείδ, γιου του Αβραάμ"
    matches = re.findall(r"\w+ς\b", matthew)
    print("Words ending in sigma:", matches)

    matches = re.findall(r"\w*σ\w+", matthew)
    print("Words with non-final sigmas:", matches)

This prints

    Words ending in sigma: ['Κατάλογος', 'γένους']
    Words with non-final sigmas: ['Ιησού', 'Χριστού']

This shows that we can use non-ASCII characters in regular expressions, at least if they’re Greek letters, and that the metacharacters \w (word character) and \b (word boundary) work as we would hope.

Hebrew

Now for the motivating example. Isaiah 9:6 contains the word “לםרבה” with the second letter (second from right) being the final form of Mem, ם, sometimes called “closed Mem” because the form of the letter ordinarily used in the middle of a word, מ, is not a closed curve [1].

Here’s code to show we could find this anomaly with regular expressions.

    # There are five Hebrew letters with special final forms.
    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ

    lemarbeh = "\u05dc\u05dd\u05e8\u05d1\u05d4" # לםרבה
    m = re.search(r"\w*[" + finalforms + "\w+", lemarbeh)
    if m:
        print("Anomaly found:", m.group(0))

As far as I know, the instance discussed above is the only one where a final letter appears in the middle of a word. And as far as I know, there are no instances of a non-final form being used at the end of a word. For the next example, we will put non-final forms at the end of words so we can find them.

We’ll need a list of non-final forms. Notice that the Unicode value for each non-final form is 1 greater than the corresponding final form. It’s surprising that the final forms come before the non-final forms in numerical (Unicode) order, but that’s how it is. The same is true in Greek: final sigma is U+03c2 and sigma is U+03c3.

    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ
    nonfinal   = "\u05db\u05de\u05e0\u05e4\u05e6" # כמנפצ

We’ll start by taking the first line of Genesis

    genesis = "בראשית ברא אלהים את השמים ואת הארץ"

and introduce errors by replacing each final form with its corresponding non-final form. The code uses a somewhat obscure feature of Python and is analogous to the shell utility tr.

    genesis_wrong = genesis.translate(str.maketrans(finalforms, nonfinal))

(The string method translate does what you would expect, except that it takes a translation table rather than a pair of strings as its argument. The maketrans method creates a translation table from two strings.)

Now we can find our errors.

    anomalies = re.findall(r"\w+[" + nonfinal + r"]\b", genesis_wrong)
    print("Anomalies:", anomalies)

This produced

    Anomalies: ['אלהימ', 'השמימ', 'הארצ']

Note that Python printed the letters left-to-right.

Vowel points

The text above did not contain vowel points. If the text does contain vowel points, these are treated as letters between the consonants.

For example, the regular expression “בר” will not match against “בְּרֵאשִׁית” because the regex engine sees two characters between ב and ר, the dot (“dagesh”) inside ב and the two dots (“sheva”) underneath ב. But the regular expression “ב..ר” does match against “בְּרֵאשִׁית”.

See footnote [1].

Python

I started this post with a warning from the Python Cookbook that mixing regular expressions and Unicode can cause your head to explode. So far my head intact, but I’ve only tested the waters. I’m sure there are dragons in the deeper end.

I should include the rest of the book’s warning. I used the default library re that ships with Python, but the authors recommend if you’re serious about using regular expressions with Unicode,

… you should consider installing the third-party regex library which provides full support for Unicode case folding, as well as a variety of other interesting features, including approximate matching.

Related posts

[1] When I refer to “בר” as a regular expression, I mean the character sequence ב followed by ר, which your browser will display from right to left because it detects it as characters from a language written from right to left. Written in terms of Unicode code points this would be “\u05d1\u05e8”, the code point for ב followed by the code point for ר.

Similarly, the second regular expression could be written “\u05d1..\u05e8”. The two periods in the middle are just two instances of the regular expression symbol that matches any character. It’s a coincidence that dots (periods) are being used to match dots (the dagesh and the sheva).