Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

Explanation

The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straight-forward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

Related links:

Tips for learning regular expressions
@RegexTip: One regular expression tip per day

11 thoughts on “Good old regular expressions

  1. I may have misunderstood but should ‘ghostly’ be included since the ‘L’ does not follow o, s or t alphabetically?

  2. 1) “optionally followed by a consonant at the end of the line” should be “followed by zero or more consonants…”

    2) I don’t see why the first regex used a rather than a+, and the second regex used ? rather than *, since there was no restriction in the problem description that said that each vowel/letter should appear exactly once. For instance, with the asterisk version of the second regex, several more words are returned, such as “accent” and “chilly”.

    3) it seems your last <code> tag didn’t get properly replaced;

    4) For those who might want to try this on Ubuntu, here are quick cut-and-past lines that works on my machine:
    * Problem 1: egrep "^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$" /usr/share/dict/words, or egrep "^[^aeiou]*a+[^aeiou]*e+[^aeiou]*i+[^aeiou]*o+[^aeiou]*u+[^aeiou]*$" /usr/share/dict/words
    * Problem 2: egrep "^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$" /usr/share/dict/words | grep "......" or egrep "^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$" /usr/share/dict/words | grep "......"

    (I hope this comes out properly formatted…)

  3. Cool. I think your phrasing (“optionally followed by consonants”) is better :)

    Also, regarding my comment about repeating letters: I see you updated the description of problem 1, but not problem 2, and figured out why: it indeed explicitly leaves out repeated letters, since in a sequence like “…cc…” the second c does not “appear in increasing alphabetical order” relative to the previous “c”. (This might be obvious to you or others, but I’m adding it here for the record.)

  4. Those problems were well chosen. They’ll even work relatively efficiently for regexes.

    Try writing the regex that lets you find a word that contains all the vowels exactly once in any order.

    Then try comparing how slow the regex approach is compared to a direct implementation in C of either of K&H’s problems.

    I think regexes are fine for one-off scripts, by the way, and even in production environments where they won’t cause a bottleneck.

  5. You could with a pipeline of 5 greps of the form:
    egrep "^[^a]*a[^a]*$"

    A direct C implementation could be faster, but it might still take you more time in total anyway.

Comments are closed.