Learn one Perl command

A while back I wrote a post Learn one sed command. In a nutshell, I said it’s worth learning sed just do commands of the form sed s/foo/bar/ to replace “foo” with “bar.”

Dan Haskin and Will Fitzgerald suggested in their comments that instead of sed use perl -pe with the same command. The advantage is that you could use Perl’s more powerful regular expression syntax. Will said he uses Perl like this:

cat file | perl -pe "s/old/new/g" > newfile

I think they’re right. Except for the simplest regular expressions, sed’s regular expression syntax is too restrictive. For example, I recently needed to remove commas that immediately follow a digit and this did the trick:

cat file | perl -pe "s/(?<=d),//g" > newfile

Since sed does not have the look-behind feature or d for digits, the corresponding sed code would be more complicated.

I quit writing Perl years ago. I don’t miss Perl as a whole, but I do miss Perl’s regular expression support.

Learning Perl is a big commitment, but just learning Perl regular expressions is not. Perl is the leader in regular expression support, and many programming languages implement a subset of Perl’s regex features. You could just use a subset of Perl features you already know, but you’d have the option of using more features.

Related post: Perl 6 as the anti-JavaScript

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Just what do you mean by "number"?

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

  • Is −0 a number?
  • How do you feel about √−1?
  • Is or a number?
  • Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
  • Is 6.02e23 a number?
  • Is 3.141_592_653_589 a number? How about π, or ? And −2π⁻³ ͥ?
  • How many numbers in 0.083̄?
  • How many numbers in
  • What number does hold? How about ⚂⚃?
  • Does 10,5 mm have one number in it — or does it have two?
  • Is ∛8³ a number — or is it three of them?
  • What number does ↀↀⅮⅭⅭⅬⅫ AUC represent, 2762 or 2009?
  • Are ४५६७ and ৭৮৯৮ numbers?
  • What about 0377, 0xDEADBEEF, and 0b111101101?
  • Is Inf a number? Is NaN?
  • Is ④② a number? What about ?
  • How do you feel about ?
  • What do ℵ₀ and ℵ₁ have to do with numbers? Or , , and ?

See his full response here. Thanks to Bill the Lizard for pointing this out.

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Roman numeral puzzle

I noticed an ad for Super Bowl XLVI on a pizza box this morning. The Roman numeral XLVI does not repeat any character. This brought up a couple questions.

  • How many Roman numerals are possible if you’re not allowed to repeat a character?
  • Could you write a (reasonably short) regular expression to find all such numbers?

You can post your solutions to either question in the comments.

There has never been universal agreement on the rules for constructing Roman numerals, so your solution would depend on your choice of rules. For our purposes here, assume the valid characters are I, V, X, L, C, D, and M. Also, assume any character can be subtracted from a larger character. For example, you can assume IL is a valid representation of 49.

For a more challenging problem, you can use the more restrictive subtraction rules.

  1. I can be subtracted from V and X only.
  2. X can be subtracted from L and C only.
  3. C can be subtracted from D and M only.
  4. V, L, and D can never be subtracted.

Other puzzle posts:

Learn one sed command

You may have seen sed programs even if you didn’t know that’s what they were. In online discussions it’s common to hear someone say


as a shorthand to mean “replace foo with bar.” The line s/foo/bar/ is a complete sed program to do such a replacement.

sed comes with every Unix-like operating system and is available for Windows here. It has a range of features for editing files, but sed is worth using even if you only know how to do one thing with it:

sed "s/pattern1/pattern2/g" file.txt > newfile.txt

This will replace every instance of pattern1 with pattern2 in the file file.txt and will write the result to newfile.txt. The original file file.txt is unchanged.

I used to think there was no reason to use sed when other languages like Python will do everything sed does and much more. Suppose you agree with that. Now suppose you find you often have to make global search-and-replace operations and so you write a script to do this, say a Python script. You’ve got to call your script something, remember what you called it, and put it in your path. How about calling it sed? Or better, don’t write your script, but pretend that you did. If you’re on Linux, it’s already in your path. One advantage of the real sed over your script named sed is that the former can do a lot more, should you ever need it to.

Now for a few details regarding the sed command above. The “s” on the front stands for “substitute” and the “g” on the end stands for “global.” Without the “g” on the end, sed would only replace the first instance of the pattern on each line. If that’s what you want, then remove the “g.”

The patterns inside a sed command are regular expressions, so it’s best to get in the habit of always quoting sed commands. This isn’t necessary for simple string substitutions, but regular expressions often contain characters that you’ll need to prevent the shell from interpreting.

You may find the default regular expression support in sed odd or restrictive. If you’re used to regular expressions in Perl, Python, JavaScript, etc. and you’re using a Gnu implementation of sed, you can add the -r option for more familiar regular expression syntax.

I got the idea for this post from Greg Grouthaus’ post Why you should learn just a little Awk. He makes a good case that you can benefit from learning just a few commands of a language like Awk with no intention to learn more of the language.

Related posts:

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Efficiency of regular expressions

I’ve never optimized a regular expression.  I typically use regular expressions in scripts where efficiency doesn’t matter. And sometimes I do some regular expression processing as part of a larger program in which the bottleneck is somewhere else. But I’ve never worried about the efficiency of a regular expression.

Regular expression efficiency can matter. There are some regular expressions that can be astonishingly slow to match with some regular expression implementations. Russ Cox gives an example of a regular expression that takes Perl a minute to match against a string that’s only 29 characters long. Another regular expression implementation does the same match six orders of magnitude faster.

The example is contrived to show the difference between the approaches. I’m not sure whether I’ve ever run into a horribly inefficient regular expression on accident. Maybe once. I imagine people run into efficiency issues in practical applications, though I don’t.

I’d suggest worrying about regex efficiency if and when it becomes a problem. If your experience matches mine, that need may never come.

But if you do have a regex efficiency problem, or simply find regex implementation details interesting, I’d recommend Russ Cox’s article.

Related links:

Notes on using regular expressions in:

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels


then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic


then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.


The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straight-forward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

Related links:

Tips for learning regular expressions

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Daily tip Twitter account FAQ

This post answers some frequently asked questions regarding my daily tip accounts on Twitter.

How many followers do you have?

About 2800 people are following at least one of these accounts at the time of writing, each following between 2 and 3 accounts on average for a total of about 5900 follows combining all accounts.

How do you schedule your posts?

I schedule the tips a week to a month in advance using HootSuite. Each account posts at the same time each week day starting with SansMouse at 9:30 AM and ending with AlgebraFact at 1:30 PM (US Central Time). Once in a while a tip will go out late due to a Twitter API failure. I occasionally sprinkle in a few unscheduled tweets but I keep the volume low.

What if I don’t use Twitter?

You can subscribe to a Twitter feed via RSS just as you would a blog. For example, you could follow Twitter accounts via Google Reader.

How advanced are these tips?

The SansMouse and RegexTip are in a cycle that starts with the most familiar tips and then moves on to less familiar ones. I mix elementary and advanced material in the mathematical accounts, though there’s a greater range in some accounts. If a tip is more elementary or more advanced than you’d like, you may find something more to your liking in a day or two.

Do you take suggestions? Questions?

I welcome corrections as well as suggestions for new tips. General suggestions are helpful, but I especially appreciate specific tweets.

I like to answer questions when I can, though I can’t respond to every question.

Any plans for new accounts?

I have a couple ideas I’m considering. If you have a suggestion for another daily tip account, let me know.

What about questions about specific accounts?

The following isn’t Q&A format, but it answers some common questions.

SansMouse icon SansMouse is my oldest daily tip account. It gives one Windows keyboard shortcut each day. Most tips apply across all applications, but some tips are specific to popular software packages. Ben Jaffe has an analogous Twitter account commandtab for Mac OS.

RegexTip icon RegexTip is currently the second most popular daily tip account with 1025 followers at the time of writing. This account gives tips for writing regular expressions as well as tips for how regular expressions are used in different environments. I mostly stick to the regular expression syntax supported by Perl 5, Microsoft’s .NET languages, JavaScript, etc.

TeXtip icon TeXtip gives tips for typesetting in TeX and LaTeX. Topics include TeX commands, software for working with LaTeX, tips on typography, etc. I basically stick to LaTeX, though much applies to plain TeX. Also, I don’t say much about add-on packages but stick to the heart of LaTeX.

ProbFact icon ProbFact is currently the most popular daily tip account with 1100 followers. This account gives one fact per day from probability. Probability and statistics are intimately related, but this account primarily sticks to probability proper, though some posts are more statistical. People have suggested I start a statistical counterpart to ProbFact, but I have no plans to do so.

AlgebraFact icon AlgebraFact gives facts from linear algebra, number theory, group theory, etc. A few of the facts are advanced but most are not. I’ve gotten a few complaints for including number theory, but that’s been part of the charter from the beginning. The name AlgebraAndNumberTheoryFact would be more accurate, but too long.

TopologyFact icon TopologyFact gives theorems from topology (point-set topology, algebraic topology, etc.) and geometry. Point-set topology has a lot of theorems that can be condensed into 140 characters but I find other areas harder to tweet about. I could use some help here. I’d like to include more geometry: Euclidean geometry, differential geometry, etc. But I don’t want to include material so esoteric that not many will understand it.

AnalysisFact icon AnalysisFact is the most advanced account on average. Some of the posts are elementary, but some fairly advanced. Topics include real and complex analysis, functional analysis, special functions, differential equations, etc.

I’m doing a drawing to give away either a coffee mug or T-shirt to someone who mentions these tips on Twitter. Tomorrow is the last day to enter.

JohnDCook icon I also have a personal Twitter account. I use this account to post links, interact with friends, etc. I try to keep the signal to noise ratio fairly high, though not as high as the tip accounts.

Twitter daily tip news

I have five Twitter accounts that send out one tip per day, including a new one I just added last week.

Regular expressions

@RegexTip started over today. It’s a cycle of tips for learning regular expressions. It sticks to the regular expression features common to Python, Perl, C#, and many other programming languages. This account posts Monday through Friday.

Keyboard shortcuts

@SansMouse gives one tip a day on using Windows without a mouse. By practicing one keyboard shortcut a day, you can get into the habit of using your mouse less and your keyboard more. This cycle of tips started over January 29 with the most common and most widely useful shortcuts. I’m also sprinkling in a few extra tips that are less well known. This account also posts Monday through Friday.


I have three mathematical accounts. These post seven days a week.

@AlgebraFact, just started February 2. It will be a mixture of linear algebra, number theory, group theory, etc.

@ProbFact gives one fact per day from probability. Usually these facts are theorems, but sometimes they include a note on history or applications.

@AnalysisFact gives facts from real and complex analysis. The topics range from elementary to advanced.

What if I don’t use Twitter?

You can visit the page for a Twitter account just like any other web page. And every Twitter account has an RSS feed link allowing you to subscribe just as you would subscribe to a blog.

How do you write these?

I write up content for these accounts in bulk. I may sit down on a Saturday and come up with several weeks worth of tips. Then I use HootSuite to schedule the tips weeks in advance. Sometimes I’ll post something spontaneously, such as link to something relevant, but most of the work is done in advance. I use my personal Twitter account for live interaction.

Related links:

Using Windows without a mouse

Regular expressions in

Chart of probability distribution relationships

Regular expressions in Mathematica

Regular expressions are fairly portable. There are two main flavors of regular expressions — POSIX and Perl — and more languages these days use the Perl flavor. There are some minor differences in what it means to be “like Perl” but for the most part languages that say they follow Perl’s lead specify regular expressions the same way. The differences lie in how you use regular expressions: how you form matches, how you replace strings, etc.

Mathematica uses Perl’s regular expression flavor. But how do you use regular expressions in Mathematica? I’ll give a few tips here and give more details in the notes Regular expressions in Mathematica.

First of all, unlike Perl, Mathematica specifies regular expressions with ordinary strings. This means that metacharacters have to be doubly escaped. For example, to represent the regular expression d{4} you must use the string "\d{4}".

The function StringCases returns a list of all matches of a regular expression in a string. If you simply want to know whether there was a match, you can use the function StringFreeQ. However, note the you probably want the opposite of the return value from StringFreeQ because it returns whether a string does not contain a match.

By default, the function StringReplace replaces all matches of a regular expression with a given replacement pattern. You can limit the number of replacements it makes by specifying an addition argument.

Related links:

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon