Efficiency of regular expressions

I’ve never optimized a regular expression.  I typically use regular expressions in scripts where efficiency doesn’t matter. And sometimes I do some regular expression processing as part of a larger program in which the bottleneck is somewhere else. But I’ve never worried about the efficiency of a regular expression.

Regular expression efficiency can matter. There are some regular expressions that can be astonishingly slow to match with some regular expression implementations. Russ Cox gives an example of a regular expression that takes Perl a minute to match against a string that’s only 29 characters long. Another regular expression implementation does the same match six orders of magnitude faster.

The example is contrived to show the difference between the approaches. I’m not sure whether I’ve ever run into a horribly inefficient regular expression on accident. Maybe once. I imagine people run into efficiency issues in practical applications, though I don’t.

I’d suggest worrying about regex efficiency if and when it becomes a problem. If your experience matches mine, that need may never come.

But if you do have a regex efficiency problem, or simply find regex implementation details interesting, I’d recommend Russ Cox’s article.

Related links

Notes on using regular expressions

Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

Explanation

The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straightforward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

Related: Tips for learning regular expressions

Daily tip Twitter account FAQ

This post answers some frequently asked questions regarding my daily tip accounts on Twitter.

How many followers do you have?

About 2800 people are following at least one of these accounts at the time of writing, each following between 2 and 3 accounts on average for a total of about 5900 follows combining all accounts.

How do you schedule your posts?

I schedule the tips a week to a month in advance using HootSuite. Each account posts at the same time each week day starting with SansMouse at 9:30 AM and ending with AlgebraFact at 1:30 PM (US Central Time). Once in a while a tip will go out late due to a Twitter API failure. I occasionally sprinkle in a few unscheduled tweets but I keep the volume low.

What if I don’t use Twitter?

You can subscribe to a Twitter feed via RSS just as you would a blog. For example, you could follow Twitter accounts via Google Reader.

How advanced are these tips?

The SansMouse and RegexTip are in a cycle that starts with the most familiar tips and then moves on to less familiar ones. I mix elementary and advanced material in the mathematical accounts, though there’s a greater range in some accounts. If a tip is more elementary or more advanced than you’d like, you may find something more to your liking in a day or two.

Do you take suggestions? Questions?

I welcome corrections as well as suggestions for new tips. General suggestions are helpful, but I especially appreciate specific tweets.

I like to answer questions when I can, though I can’t respond to every question.

Any plans for new accounts?

I have a couple ideas I’m considering. If you have a suggestion for another daily tip account, let me know.

What about questions about specific accounts?

The following isn’t Q&A format, but it answers some common questions.

SansMouse icon SansMouse is my oldest daily tip account. It gives one Windows keyboard shortcut each day. Most tips apply across all applications, but some tips are specific to popular software packages. Ben Jaffe has an analogous Twitter account commandtab for Mac OS.

RegexTip icon RegexTip is currently the second most popular daily tip account with 1025 followers at the time of writing. This account gives tips for writing regular expressions as well as tips for how regular expressions are used in different environments. I mostly stick to the regular expression syntax supported by Perl 5, Microsoft’s .NET languages, JavaScript, etc.

TeXtip icon TeXtip gives tips for typesetting in TeX and LaTeX. Topics include TeX commands, software for working with LaTeX, tips on typography, etc. I basically stick to LaTeX, though much applies to plain TeX. Also, I don’t say much about add-on packages but stick to the heart of LaTeX.

ProbFact icon ProbFact is currently the most popular daily tip account with 1100 followers. This account gives one fact per day from probability. Probability and statistics are intimately related, but this account primarily sticks to probability proper, though some posts are more statistical. People have suggested I start a statistical counterpart to ProbFact, but I have no plans to do so.

AlgebraFact icon AlgebraFact gives facts from linear algebra, number theory, group theory, etc. A few of the facts are advanced but most are not. I’ve gotten a few complaints for including number theory, but that’s been part of the charter from the beginning. The name AlgebraAndNumberTheoryFact would be more accurate, but too long.

TopologyFact icon TopologyFact gives theorems from topology (point-set topology, algebraic topology, etc.) and geometry. Point-set topology has a lot of theorems that can be condensed into 140 characters but I find other areas harder to tweet about. I could use some help here. I’d like to include more geometry: Euclidean geometry, differential geometry, etc. But I don’t want to include material so esoteric that not many will understand it.

AnalysisFact icon AnalysisFact is the most advanced account on average. Some of the posts are elementary, but some fairly advanced. Topics include real and complex analysis, functional analysis, special functions, differential equations, etc.

I’m doing a drawing to give away either a coffee mug or T-shirt to someone who mentions these tips on Twitter. Tomorrow is the last day to enter.

JohnDCook icon I also have a personal Twitter account. I use this account to post links, interact with friends, etc. I try to keep the signal to noise ratio fairly high, though not as high as the tip accounts.

Twitter daily tip news

I have five Twitter accounts that send out one tip per day, including a new one I just added last week.

Regular expressions

@RegexTip started over today. It’s a cycle of tips for learning regular expressions. It sticks to the regular expression features common to Python, Perl, C#, and many other programming languages. This account posts Monday through Friday.

Keyboard shortcuts

@SansMouse gives one tip a day on using Windows without a mouse. By practicing one keyboard shortcut a day, you can get into the habit of using your mouse less and your keyboard more. This cycle of tips started over January 29 with the most common and most widely useful shortcuts. I’m also sprinkling in a few extra tips that are less well known. This account also posts Monday through Friday.

Math

I have three mathematical accounts. These post seven days a week.

@AlgebraFact, just started February 2. It will be a mixture of linear algebra, number theory, group theory, etc.

@ProbFact gives one fact per day from probability. Usually these facts are theorems, but sometimes they include a note on history or applications.

@AnalysisFact gives facts from real and complex analysis. The topics range from elementary to advanced.

What if I don’t use Twitter?

You can visit the page for a Twitter account just like any other web page. And every Twitter account has an RSS feed link allowing you to subscribe just as you would subscribe to a blog.

How do you write these?

I write up content for these accounts in bulk. I may sit down on a Saturday and come up with several weeks worth of tips. Then I use HootSuite to schedule the tips weeks in advance. Sometimes I’ll post something spontaneously, such as link to something relevant, but most of the work is done in advance. I use my personal Twitter account for live interaction.

Related links:

Using Windows without a mouse

Regular expressions in

Chart of probability distribution relationships

Regular expressions in Mathematica

Regular expressions are fairly portable. There are two main flavors of regular expressions—POSIX and Perl—and more languages these days use the Perl flavor. There are some minor differences in what it means to be “like Perl” but for the most part languages that say they follow Perl’s lead specify regular expressions the same way. The differences lie in how you use regular expressions: how you form matches, how you replace strings, etc.

Mathematica uses Perl’s regular expression flavor. But how do you use regular expressions in Mathematica? I’ll give a few tips here and give more details in the notes Regular expressions in Mathematica.

First of all, unlike Perl, Mathematica specifies regular expressions with ordinary strings. This means that metacharacters have to be doubly escaped. For example, to represent the regular expression d{4} you must use the string "\d{4}".

The function StringCases returns a list of all matches of a regular expression in a string. If you simply want to know whether there was a match, you can use the function StringFreeQ. However, note the you probably want the opposite of the return value from StringFreeQ because it returns whether a string does not contain a match.

By default, the function StringReplace replaces all matches of a regular expression with a given replacement pattern. You can limit the number of replacements it makes by specifying an addition argument.

Related links:

New daily tip feeds: RegexTip and ProbFact

A few weeks ago I started a Twitter account @SansMouse with daily tips on Windows keyboard shortcuts. That’s gone well, so I decided start two more daily tip accounts: @RegexTip and @ProbFact. If you don’t use Twitter, you can follow these tip via your blog reader. Here are the RSS feeds for RegexTip and ProbFact.

RegexTip will give one tip per day on using regular expressions. I’ll sometimes post more than once a day, but I’ll only give one tip. Other posts might be housekeeping notes etc. The idea is that people who have intended to learn regular expressions but don’t have the time can make time to absorb one tip per day.

ProbFact will give one fact per day from probability. I’ll often have a link with each fact for more details. Many of these facts will be theorems, but some will be statements about applications or history.

The SansMouse and RegexTip posts are loosely arranged in order of familiarity, starting with the most basic and most widely used material. ProbFact posts will be more random. Some will be elementary, some more advanced.

I have scheduled tips to come out at regular times starting tomorrow. SansMouse will post at 9 AM, RegexTip at 10 AM, and ProbFact at 11 AM. These times are Central Standard Time (UTC-6).

Related links:

Table-driven text munging in PowerShell

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

API symmetry

Symmetric APIs are easier to use. I was reminded of this when doing some regular expression programming in Python and comparing it to Perl. Perl’s regular expression operators for search and replace are symmetric in a way that their Python counterparts are not.

Perl uses m/pattern/ for matching and s/pattern/replacement/ for substitution. Both apply to the first instance of a pattern in a string by default. The g option following a match or substitute operator causes the command to apply to all instances of the pattern. The i option after either a match or substitute command causes the pattern to apply in a case-insensitive manner. Matching and substitution are symmetric.

Python uses re.search() for matching and re.sub() for substitution. The search function can only apply to the first instance of a pattern; to match all instances of a pattern, use re.findall(). The function re.sub() applies to all instances by default, but it has a max parameter that can be set to limit the number of instances it applies to. To make a search pattern case-insensitive, pass in re.IGNORECASE flag. To make a substitution case-insensitive, modify the regular expression itself by adding (?i).

In general, I find Python syntax much cleaner than Perl, but regular expressions are implemented more elegantly in Perl.