LLMs and regular expressions

Posted on 15 December 2024 by John

Yesterday I needed to write a regular expression as part of a client report. Later I was curious whether an LLM could have generated an equivalent expression.

When I started writing the prompt, I realized it wasn’t trivial to tell the LLM what I wanted. I needed some way to describe the pattern that the expression should match.

“Hmm, what’s the easiest way to describe a text pattern? I know: use a regular expression! Oh wait, …”

Prompt engineering and results

I described the pattern in words, which was more difficult than writing the regular expression, and the LLM came up with a valid regular expression, and sample code for demonstrating the use of the expression, but the expression wasn’t quite right. After a couple more nudges I managed to get it to produce a correct regex.

I had asked for a Perl regular expression, and the LLM did generate syntactically correct Perl [1], both for the regex and the sample code. When I asked it to convert the regex to POSIX form it did so, and when I asked it to convert the regex to Python it did that as well, replete with valid test code.

I repeated my experiment using three different LLMs and got similar results. In all cases, the hardest part was specifying what I wanted. Sometimes the output was correct given what I asked for but not what I intended, a common experience since the dawn of computers. It was easier to translate a regex from one syntax flavor to another than to generate a correct regex, easier for both me and the computer: it was easier for me to generate a prompt and the LLM did a better job.

Quality assurance for LLMs

Regular expressions and LLMs are complementary. The downside of regular expressions is that you have to specify exactly what you want. The upside is that you can specify exactly what you want. We’ve had several projects lately in which we tested the output of a client’s model using regular expressions and found problems. Sometimes it takes a low-tech tool to find problems with a high-tech tool.

We’ve also tested LLMs using a different LLM. That has been useful because there’s some degree of independence. But we’ve gotten better results using regular expressions since there is a greater degree of independence.

[1] Admittedly that’s a low bar. There’s an old joke that Perl was created by banging on a keyboard then hacking on the compiler until the input compiled.

One-liner to troubleshoot LaTeX references

Posted on 27 May 2024 by John

In LaTeX, sections are labeled with commands like \label{foo} and referenced like \ref{foo}. Referring to sections by labels rather than hard-coded numbers allows references to automatically update when sections are inserted, deleted, or rearranged.

For every reference there ought to be a label. A label without a corresponding reference is fine, though it might be a mistake. If you have a reference with no corresponding label, and one label without a reference, there’s a good chance the reference is a typo variation on the unreferenced label.

We’ll build up a one-liner for comparing labels and references. We’ll use grep to find patterns that look like labels by searching for label{ followed by any string of letters up to but not including a closing brace. We don’t want the label{ part, just what follows it, so we’ll use look-behind syntax, to exclude it from the match.

Here’s our regular expression:

    (?<=label{)[^}]+

We’re using Perl-style look-behind syntax, so we’ll need to give grep the -P option. Also, we only want the match itself, not matching lines, so we’ll also using the -o option. This will print all the labels:

    grep -oP '(?<=label{)[^}]+' foo.tex

The regex for finding references is the same with label replaced with ref.

To compare the list of labels and the list of references, we’ll use the comm command. For more on comm, see Set theory at the command line.

We could save the labels to a file, save the references to a file, and run comm on the two files. But we’re more interested in the differences between the two lists than the two lists, so we could pass both as streams to comm using the <(...) syntax. Finally, comm assumes its inputs are sorted so we pipe the output of both grep commands to sort.

Here’s our one-liner

    comm -12 <(grep -oP '(?<=label{)[^}]+' foo.tex | sort) 
             <(grep -oP '(?<=ref{)[^}]+' foo.tex | sort)

This will produce three sections of output: labels which are not references, references which not labels, and labels that are also references.

If you just want to see references that don’t refer to a label, give comm the option -13. This suppresses the first and third sections of output, leaving only the second section, references that are not labels.

You can also add a -u option (u for unique) to the calls to sort to suppress multiple instances of the same label or same reference.

Regex to match SWIFT-BIC codes

Posted on 29 January 2024 by John

A SWIFT-BIC number identifies a bank, not a particular bank account. The BIC part stands for Bank Identifier Code.

I had to look up the structure of SWIFT-BIC codes recently, and here it is:

Four letters to identify the bank
Two letters to identify the country
Two letters or digits to identify the location
Optionally, three letters or digits to identify a branch

Further details are given in the ISO 9362 standard.

We can use this as an example to illustrate several regular expression features, and how regular expressions are used in practice.

Regular expressions

If your regular expression flavor supports listing a number of repetitions in braces, you could write the above format as

    [A-Z]{6}[A-Z0-9]{2,5}

This would work, for example, with egrep but not with grep. YMMV.

That’s concise, but a little too permissive. It allows anywhere from 2 to 5 alphanumeric characters on the end. But the standard says 2 or 5 alphanumeric characters after the country code, not between 2 and 5. For example, 3 characters after the country code would no be valid. So we could reduce our false positive rate a little by changing the regex to

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

Without the dollar sign on the end, ABCDEF12X would still match because the part of the regex up to the optional ([A-Z0-9]{3})? at the end would match at the beginning of the string. The dollar sign marks the end of the string, so it says the code has to end either after 8 or 11 characters and stop.

If your regex flavor does not support counts in braces, you could spell everything out:

    [A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][A-Z0-9][A-Z0-9]([A-Z0-9]{3})?$

Convenience versus accuracy

If you want to match only valid SWIFT-BIC codes, you can get perfect accuracy by checking against an exhaustive list of SWIFT-BIC codes. You could even write a regular expression that matches codes on this list and only codes on the list, but what would the point be? Regular expressions usually tradeoff convenience for accuracy.

I don’t have a list of all valid SWIFT-BIC codes. If I did, it might be out of date by the time I download it. But if I’m trying to pull bank codes out of a text file, the regex

    [A-Z]{6}[A-Z0-9]{2}([A-Z0-9]{3})?$

is likely to do a pretty good job. Regular expressions are usually used in a context where there’s some tolerance for error. Maybe you use a regular expression to do a first pass, then weed out the mismatches with a manual review.

Capturing parts

Maybe you want to do more than just find SWIFT codes. Maybe you want to look at their pieces.

For example, the fifth and sixth characters of a SWIFT code are the ISO 3166 two-letter abbreviation for the country the bank is in. (With one exception: XR represents Kosovo, which does not have an ISO 3166 code.)

You could replace

    [A-Z]{6}

at the front of the regular expression with

    [A-Z]{4}([A-Z]{2})

which will not change which strings match, but it will store the fifth and sixth characters as the first captured group. How you access captured group varies between various regular expression implementations.

Legibility

The first proposed regular expression

    [A-Z]{6}[A-Z0-9]{2,5}

is easy to read, at least in my opinion. It has grown over the course of this post to

    [A-Z]{4}([A-Z]{2})[A-Z0-9]{2}([A-Z0-9]{3})?$

which is not as easy to read. This is typical: you start with a quick-and-dirty regular expression, the refine it until it meets your needs. Regular expressions tend to get uglier as they become more precise.

There are ways to make regular expressions more readable by using something like the /x modifier in Perl, which lets you insert white space and comments inside a regular expression.

That’s nice, but it’s also a little odd. If you’re going to use a complicated regular expression in production code, then you should format it nicely and add comments. But then you have to ask why you’re using a complicated regular expression in production code. I’m not saying this is never appropriate, but it’s not the most common use case.

I could imagine using a simple regular expression when you want quick and dirty, and using an exhaustive list of SWIFT codes in production. A complex, well-commented regular expression seems to fall into a sort of no man’s land in between.

Bringing regex modifiers into the regex

Posted on 8 April 2023 by John

Suppose you’re using a program that takes a regular expression as an argument. You didn’t get the match you expected, then you realize you’d like your search to be case-insensitive.

If you were using grep you’d go back and add a -i flag.

If you were writing a Perl script, you could add a /i at the end of the regex.

If you were using Python, you could add re.IGNORECASE as a function argument.

But the premise of the post isn’t that you’re using grep or Perl or Python. The premise is that you are using a program that takes a regular expression as an argument, and regular expression modifiers are not regular expressions per se.

However, you can incorporate regular expression modifiers into a regular expression, if your regular expression implementation supports it. In particular, you can add (?i) to a regex to indicate that the remainder of the search pattern is to be interpreted as case-insensitive. You can also use (?-i) to turn case sensitivity back on.

For example, the regex

   foo(?i)baz(?-i)quz

will make the baz portion of the expression case insensitive but the rest is case sensitive. For example, the expression will match fooBaZqux but not foobazQux.

You can also use things like (?s) and (?m) where you would use /s and /m in Perl, or re.S and re.M in Python.

These scoped pattern modifiers are not supported everywhere. They were introduced in Perl and have been adopted in other languages and applications.

Regex to match ICD-11 code

Posted on 6 October 2022 by John

ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)

The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)

One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.

Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.

What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.

Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.

import re

icd11_re = re.compile(r"""
    ^                  # beginning of string
    [A-HJ-NP-Z0-9]     # alphanumeric
    [A-HJ-NP-Z]        # alphabetic
    [0-9]              # digit
    [A-HJ-NP-Z0-9]     # alphanumeric
    ((\.               # optional starting with .
    [A-HJ-NP-Z0-9])    # alphanumeric
    [A-HJ-NP-Z0-9]?)?  # optional further refinement
    $                  # end of string
    """, re.VERBOSE)

good = [
    "ND52",   # fracture of arm, level unspecified
    "9D00.3", # presbyopia
    "8B60.Y", # other specified increased intercranial pressure
    "DB98.7Z" # portal hypertension, unspecified
]

bad = [
    "ABCD",    # third character must be digit
    "AB3D.",   # dot must be followed by alphanumeric
    "9D0O.3",  # letter 'O' should be number 0
    "DB9872",  # missing dot
    "AB3",     # too short
    "DB90.123" # too long
]

for g in good:
    assert(icd11_re.match(g))
for b in bad:
    assert(icd11_re.match(b) == None)

Regular expressions and successive approximation

Posted on 7 April 2022 by John

Regular expressions can do a lot of tasks in practice that they cannot do in theory. That’s because a particular application of regular expressions comes with context and with error tolerance.

For example, much has been said about how regular expressions cannot parse HTML. This is strictly true, but it says nothing about how well your particular regular expression might work for your task.

Maybe you’re searching HTML generated by a particular person or program, not all potential HTML documents, and a regular expression can find what you’re looking for in your idiomatic subset of HTML.

Types of errors

If you’re searching for a pattern, you can err by finding something that isn’t a match or by failing to find something that is a match, false positives and false negatives.

False positives are often not a problem. A very common use for regular expressions is to filter a huge amount of text down to a small amount of text to visually inspect. Maybe your first attempt at using a regular expression to find a needle in a haystack returns a smaller haystack. Then you refine your regular expression until it returns a small number of sharp pointy things in addition to the needle you’re after.

False negatives are more of a problem. If you can’t find what you’re looking for, you generalize your regular expression until you then have a false positive problem. If you still can’t find what you’re looking for, then you have to wonder whether the thing you’re after exists. Proving a negative is hard in general.

But even false negatives may not be a problem. Maybe there are a dozen needles in your haystack, but you don’t need to find all of them; you just need to find one of them.

In my work in data privacy I’ve needed to test whether personal information is somewhere it’s not supposed to be. If my regular expression finds any personal information, then I can tell my client they have a problem. If I don’t find any, I have more work to do.

Example: call signs

The FCC has fairly complicated rules for assigning amateur radio call signs. For starters,

Each call sign has a one letter prefix (K, N, W) or a two letter prefix (AA-AL, KA-KZ, NA-NZ, WA-WZ) and a one, two, or three letter suffix separated by a numeral (0-9) indicating the geographic region.

This is enough information to craft a regular expression has no false negatives, but may have some false positives. For example, you might start with

    [A-Z]+[0-9][A-Z]+

which matches a string of capital letters, followed by a digit, followed by another string of capital letters. Maybe that’s good enough. But you could get more specific, such as

    [AKNW][A-Z]?[0-9][A-Z]{1,3}

which comes closer to duplicating the quotation above from the FCC rules: one of A, K, N, or W, optionally followed by another letter, followed by a digit, followed by between one and three letters. But you could do better:

    \b(A[A-L]|[KNW][A-Z])[0-9][A-Z]{1,3}\b

This adds the restriction that our pattern appear on word boundaries, and captures the restriction on letters that can follow ‘A’.

You could keep going, crafting ever more complicated regular expressions that reduce your false positive rate. The FCC has a lot more restrictions than the ones quoted above, and you could incorporate some of these into your expression.

But false positives are inevitable. The ultimate determinant of what is a valid call sign is the FCC database of call signs, which changes continually. You refine your expression until you hit diminishing return. If you need more precision than that, don’t use regular expressions.

Prototype vs Perfect

I get pushback every time I write about regular expressions. I try to make it clear that I’m talking about quick and useful approximations, but perhaps I should do more to emphasize that. I suppose critics of regular expressions are reacting to enthusiasts who want to use regular expressions for everything.

It’s often possible to create a 99% solution in 1/1000 of the time it would take to create a 100% solution. Whether or not 99% is good enough depends on context.

Word problems, logic, and regular expressions

Posted on 23 January 2022 by John

Word problems

Suppose you have a sequence of symbols and a set of rewriting rules for replacing some patterns of symbols with others. Now you’re given two such sequences. Can you tell whether there’s a way to turn one of them into the other?

This is known as the word problem, and in general it’s undecidable. In general the problem cannot be solved by a program, but some instances can. We’ll look at a word problem that can be solved with a few regular expressions.

Modal logic

Basic modal logic has two symbols, □ (“box”) and ◇ (“diamond”), and concatenations of these symbols. In general, there are infinitely many non-equivalent sequences of boxes and diamonds, depending on the axioms of your modal logic.

In the axiom system S4, every non-empty sequence of boxes and diamonds is equivalent to one of six possibilities:

□
◇
□◇
◇□
□◇□
◇□◇

An arbitrary sequence of boxes and diamonds can be reduced to one of the forms above by applying the following rules:

□ □ → □
◇ ◇ → ◇
□◇□◇ → □◇
◇□◇□ → ◇□

Regular expressions

We can apply the reduction rules above using regular expressions with the following Perl code.

    use utf8;

    $_ = "□□◇□◇◇◇◇□□";

    s/□+/□/g;
    s/◇+/◇/g; 
    s/(□◇)+/□◇/g; 
    s/(◇□)+/◇□/g;

    print;

The directive use utf8; tells Perl to be prepared for non-ASCII characters, namely boxes and diamonds. In Perl, $_ is the implicit variable; all the following substitution commands will modify this variable, and the print statement will output the final value of this variable.

The first substitution replaces one or more consecutive boxes with one box and the second does the analogous substitution for consecutive diamonds. The third and fourth substitution commands replace repetitions of □◇ or ◇□ with a single instance.

The script above outputs

□◇□

meaning that

□□◇□◇◇◇◇□□p ⟷ □◇□p

is a theorem in S4.

Word problems can’t always be solved using regular expressions, or any other programming technique, but this one could.

Corner quotes in Unicode

Posted on 8 January 2022 by John

In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random:

    ⌜(\.\d\d[1-9]?)\d*⌟

The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular expression begins or ends with white space.

(It wouldn’t do to, say, use quotation marks because this would invite confusion between the regular expression itself and a quoted string used to express that regular expression in a programming language.)

I’ve thought about using Friedl’s convention but I didn’t think it could be done with plain text. It can, using Unicode character U+231C at the beginning and U+231D at the end.

There are four corner quotes:

    |------+--------+---------------------|
    | Char | Code   | Name                |
    |------+--------+---------------------|
    | ⌜    | U+231C | TOP LEFT CORNER     |
    | ⌝    | U+231D | TOP RIGHT CORNER    |
    | ⌞    | U+231E | BOTTOM LEFT CORNER  |
    | ⌟    | U+231F | BOTTOM RIGHT CORNER |
    |------+--------+---------------------|

Corner quotes are also used in logic to denote Gödel numbers, e.g. ⌜φ⌝ denotes the Gödel number for φ.

Corner quotes are also known as Quine quotes. They usually come in the pair top left and top right, rather than top left and bottom right as in Friedl’s usage.

Update: As Rob Wells points out in the comments, it seems Friedl used CJK quote marks 「 (U+300C) and 」 (U+300D) rather than the corner quotes, which makes sense given that Friedl speaks Japanese.

Is fast grep faster?

Posted on 26 April 2021 by John

The grep utility searches text files for regular expressions, but it can search for ordinary strings since these strings are a special case of regular expressions. However, if your regular expressions are in fact simply text strings, fgrep may be much faster than grep. Or so I’ve heard. I did some benchmarks to see.

Strictly speaking I used grep -F rather than fgrep. On Linux, if you ask for the man (manual) page for fgrep you’ll be taken to the man page for grep which says

In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility.

I was working on a project for a client where I had to search for a long list of words in a long list of files [1]. This is the kind of task where fgrep (“fast grep”) is supposed to be much faster than grep. It was a tiny bit faster, not enough to notice. When I timed it the difference was on the order of 1%.

I ran an analogous search on my own computer with different data and got similar results [2]. There may be instances where fgrep is much faster than grep, but I haven’t seen one first hand.

I suspect that the performance difference between fgrep and grep used to be larger, but the latter has gotten more efficient. Now grep is smart enough to search for strings quickly without having to be told explicitly via -F that the regular expressions are in fact strings. Maybe it scans the regular expression(s) before searching and effectively sets the -F flag itself if appropriate.

[1] I used the -f to tell grep the name of a file containing the terms to search for, not to be confused with the additional flag -F to tell grep that the search terms are simply strings.

[2] I got similar results when I was using Linux (WSL) on my computer. When I used grep from GOW the -F flag made the search 24 times faster. Because the GOW project provides light-weight ports of Gnu tools to Windows, it’s understandable that it would not include some of the optimizations in Gnu’s implementation of grep.

tcgrep: grep rewritten in Perl

Posted on 3 March 2021 by John

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility grep that he calls tcgrep. You don’t have to know Perl to use tcgrep, but you can send it Perl regular expressions.

Why not grep with PCRE?

You can get basically the same functionality as tcgrep by using grep with its PCRE option -P. Since tcgrep searches directories recursively, a more direct comparison would be

    grep -R -P

However, your version of grep might not support -P. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The man page for grep on my machine says

    -P, --perl-regexp
        Interpret the pattern as a Perl-compatible regular 
        expression (PCRE). This is experimental and grep -P 
        may warn of unimplemented features.

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of grep on your system supports the -P option and is adequately Perl-compatible, it will run faster than tcgrep. But if you find yourself on a computer that has Perl but not a recent version of grep, you may find tcgrep handy.

Installation

tcgrep is included as part of the Unicode::Tussle Perl module; since tcgrep is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install tcgrep (and several more utilities) with

    cpan Unicode::Tussle

This worked for me on Linux without any issues but the install failed on Windows.

I installed tcgrep on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of %Compress to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

Consistency

Another reason to use tcgrep is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like sed will have different features on different platforms, and it will not support the same regular expressions as another tool such as awk.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

Regular expressions

LLMs and regular expressions

Prompt engineering and results

Quality assurance for LLMs

Related posts

One-liner to troubleshoot LaTeX references

Regex to match SWIFT-BIC codes

Regular expressions

Convenience versus accuracy

Capturing parts

Legibility

Bringing regex modifiers into the regex

Related posts

Regex to match ICD-11 code

Related posts

Regular expressions and successive approximation

Types of errors

Example: call signs

Prototype vs Perfect

More regex posts

Word problems, logic, and regular expressions

Word problems

Modal logic

Regular expressions

Related posts

Corner quotes in Unicode

Related posts

Is fast grep faster?

Related posts

tcgrep: grep rewritten in Perl

Why not grep with PCRE?

Installation

Consistency

Related posts