Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

Explanation

The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straightforward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

12 thoughts on “Good old regular expressions”

Tom B

20 October 2010 at 06:57

I may have misunderstood but should ‘ghostly’ be included since the ‘L’ does not follow o, s or t alphabetically?

John

20 October 2010 at 06:59

Good catch. I had a typo. “ghostly” should have been “ghosty.”

Waldir

20 October 2010 at 13:26

1) “optionally followed by a consonant at the end of the line” should be “followed by zero or more consonants…”

2) I don’t see why the first regex used a rather than a+, and the second regex used ? rather than *, since there was no restriction in the problem description that said that each vowel/letter should appear exactly once. For instance, with the asterisk version of the second regex, several more words are returned, such as “accent” and “chilly”.

3) it seems your last <code> tag didn’t get properly replaced;

4) For those who might want to try this on Ubuntu, here are quick cut-and-past lines that works on my machine:
* Problem 1: egrep "^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$" /usr/share/dict/words, or egrep "^[^aeiou]*a+[^aeiou]*e+[^aeiou]*i+[^aeiou]*o+[^aeiou]*u+[^aeiou]*$" /usr/share/dict/words
* Problem 2: egrep "^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$" /usr/share/dict/words | grep "......" or egrep "^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$" /usr/share/dict/words | grep "......"

(I hope this comes out properly formatted…)

Waldir

20 October 2010 at 13:27

*that work on my machine

John

20 October 2010 at 13:40

Waldir: Thanks. I updated the post in response to your comment.

Waldir

21 October 2010 at 01:23

Cool. I think your phrasing (“optionally followed by consonants”) is better :)

Also, regarding my comment about repeating letters: I see you updated the description of problem 1, but not problem 2, and figured out why: it indeed explicitly leaves out repeated letters, since in a sequence like “…cc…” the second c does not “appear in increasing alphabetical order” relative to the previous “c”. (This might be obvious to you or others, but I’m adding it here for the record.)

Bob Carpenter

21 October 2010 at 13:15

Those problems were well chosen. They’ll even work relatively efficiently for regexes.

Try writing the regex that lets you find a word that contains all the vowels exactly once in any order.

Then try comparing how slow the regex approach is compared to a direct implementation in C of either of K&H’s problems.

I think regexes are fine for one-off scripts, by the way, and even in production environments where they won’t cause a bottleneck.

Yaroslav Bulatov

22 October 2010 at 11:26

(all vowels in any order) — You could do that with a context-free grammar, right?

Leon Planken

8 August 2011 at 12:55

You could with a pipeline of 5 greps of the form:
egrep "^[^a]*a[^a]*$"

A direct C implementation could be faster, but it might still take you more time in total anyway.

Sten-Åke Strid

28 January 2015 at 12:56

In the first problem, you state in the problem text “all five vowels”. Isn’t y also a vowel in English? Just because it can be used as a consonant doesn’t mean it can’t be a vowel. I’m not a native English speaker though.

Ted Morin

4 January 2016 at 12:40

Sten, you are right that y can act as both a vowel and a consonant. The problem with adding y to the problem is that you can’t know whether the y is being used as a vowel or a consonant with a regex solution. If you don’t care about its usage, though, you can just add on y:

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*y[^aeiou]*$

George Vaccaro

6 January 2016 at 11:31

Since a period matches any character, a sequence of six periods says to select XonlyX words that contain six characters. *Since grep matches parts of an input line rather than doing an exact match, this will return words of 6 characters or more, like “egilops”.*

Comments are closed.