Look-behind regex

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following  pattern.”

The syntax in Perl and similar regular expression implementations is (?<= … ) for positive look-behind and (?<! … ) for negative look-behind. For the longest time I couldn’t remember whether the next symbol after ? was the direction (i.e. < for behind) or the polarity (= for positive, ! for negative). I was more likely to guess wrong unless I’d used the syntax recently.

The reason I was tempted to get these wrong is that I thought “positive look-behind” and “negative look-behind.” That’s how these patterns are described. But this means the words and symbols come in a different order. If you think look-behind positive and look-behind negative then the words and the symbols come in the same order:

look (?
behind <
positive =
negative !

Maybe this syntax comes more naturally to people who speak French and other languages where adjectives follow the thing they describe. English word order was tripping me up.

By the way, the syntax for look-ahead patterns is simpler: just leave out the <. The default direction for look-around patterns is forward. You don’t have to remember whether the symbol for direction or parity comes first because there is no symbol for direction.

Read More

Can regular expressions parse HTML or not?

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

Read More

Learn one Perl command

A while back I wrote a post Learn one sed command. In a nutshell, I said it’s worth learning sed just do commands of the form sed s/foo/bar/ to replace “foo” with “bar.”

Dan Haskin and Will Fitzgerald suggested in their comments that instead of sed use perl -pe with the same command. The advantage is that you could use Perl’s more powerful regular expression syntax. Will said he uses Perl like this:

cat file | perl -pe "s/old/new/g" > newfile

I think they’re right. Except for the simplest regular expressions, sed’s regular expression syntax is too restrictive. For example, I recently needed to remove commas that immediately follow a digit and this did the trick:

cat file | perl -pe "s/(?<=d),//g" > newfile

Since sed does not have the look-behind feature or d for digits, the corresponding sed code would be more complicated.

I quit writing Perl years ago. I don’t miss Perl as a whole, but I do miss Perl’s regular expression support.

Learning Perl is a big commitment, but just learning Perl regular expressions is not. Perl is the leader in regular expression support, and many programming languages implement a subset of Perl’s regex features. You could just use a subset of Perl features you already know, but you’d have the option of using more features.

Related post: Perl 6 as the anti-JavaScript

Read More

Just what do you mean by "number"?

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

  • Is −0 a number?
  • How do you feel about √−1?
  • Is or a number?
  • Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
  • Is 6.02e23 a number?
  • Is 3.141_592_653_589 a number? How about π, or ? And −2π⁻³ ͥ?
  • How many numbers in 0.083̄?
  • How many numbers in 128.0.0.1?
  • What number does hold? How about ⚂⚃?
  • Does 10,5 mm have one number in it — or does it have two?
  • Is ∛8³ a number — or is it three of them?
  • What number does ↀↀⅮⅭⅭⅬⅫ AUC represent, 2762 or 2009?
  • Are ४५६७ and ৭৮৯৮ numbers?
  • What about 0377, 0xDEADBEEF, and 0b111101101?
  • Is Inf a number? Is NaN?
  • Is ④② a number? What about ?
  • How do you feel about ?
  • What do ℵ₀ and ℵ₁ have to do with numbers? Or , , and ?

See his full response here. Thanks to Bill the Lizard for pointing this out.

Read More

Roman numeral puzzle

I noticed an ad for Super Bowl XLVI on a pizza box this morning. The Roman numeral XLVI does not repeat any character. This brought up a couple questions.

  • How many Roman numerals are possible if you’re not allowed to repeat a character?
  • Could you write a (reasonably short) regular expression to find all such numbers?

You can post your solutions to either question in the comments.

There has never been universal agreement on the rules for constructing Roman numerals, so your solution would depend on your choice of rules. For our purposes here, assume the valid characters are I, V, X, L, C, D, and M. Also, assume any character can be subtracted from a larger character. For example, you can assume IL is a valid representation of 49.

For a more challenging problem, you can use the more restrictive subtraction rules.

  1. I can be subtracted from V and X only.
  2. X can be subtracted from L and C only.
  3. C can be subtracted from D and M only.
  4. V, L, and D can never be subtracted.

Other puzzle posts:

A Renaissance math puzzle
Technology history quiz
A log puzzle

Read More

Learn one sed command

You may have seen sed programs even if you didn’t know that’s what they were. In online discussions it’s common to hear someone say

s/foo/bar/

as a shorthand to mean “replace foo with bar.” The line s/foo/bar/ is a complete sed program to do such a replacement.

sed comes with every Unix-like operating system and is available for Windows here. It has a range of features for editing files, but sed is worth using even if you only know how to do one thing with it:

sed "s/pattern1/pattern2/g" file.txt > newfile.txt

This will replace every instance of pattern1 with pattern2 in the file file.txt and will write the result to newfile.txt. The original file file.txt is unchanged.

I used to think there was no reason to use sed when other languages like Python will do everything sed does and much more. Suppose you agree with that. Now suppose you find you often have to make global search-and-replace operations and so you write a script to do this, say a Python script. You’ve got to call your script something, remember what you called it, and put it in your path. How about calling it sed? Or better, don’t write your script, but pretend that you did. If you’re on Linux, it’s already in your path. One advantage of the real sed over your script named sed is that the former can do a lot more, should you ever need it to.

Now for a few details regarding the sed command above. The “s” on the front stands for “substitute” and the “g” on the end stands for “global.” Without the “g” on the end, sed would only replace the first instance of the pattern on each line. If that’s what you want, then remove the “g.”

The patterns inside a sed command are regular expressions, so it’s best to get in the habit of always quoting sed commands. This isn’t necessary for simple string substitutions, but regular expressions often contain characters that you’ll need to prevent the shell from interpreting.

You may find the default regular expression support in sed odd or restrictive. If you’re used to regular expressions in Perl, Python, JavaScript, etc. and you’re using a Gnu implementation of sed, you can add the -r option for more familiar regular expression syntax.

I got the idea for this post from Greg Grouthaus’ post Why you should learn just a little Awk. He makes a good case that you can benefit from learning just a few commands of a language like Awk with no intention to learn more of the language.

Related posts:

Good old regular expressions
Tips for learning regular expressions
A little awk

Read More

Efficiency of regular expressions

I’ve never optimized a regular expression.  I typically use regular expressions in scripts where efficiency doesn’t matter. And sometimes I do some regular expression processing as part of a larger program in which the bottleneck is somewhere else. But I’ve never worried about the efficiency of a regular expression.

Regular expression efficiency can matter. There are some regular expressions that can be astonishingly slow to match with some regular expression implementations. Russ Cox gives an example of a regular expression that takes Perl a minute to match against a string that’s only 29 characters long. Another regular expression implementation does the same match six orders of magnitude faster.

The example is contrived to show the difference between the approaches. I’m not sure whether I’ve ever run into a horribly inefficient regular expression on accident. Maybe once. I imagine people run into efficiency issues in practical applications, though I don’t.

I’d suggest worrying about regex efficiency if and when it becomes a problem. If your experience matches mine, that need may never come.

But if you do have a regex efficiency problem, or simply find regex implementation details interesting, I’d recommend Russ Cox’s article.

Related links:

Good old regular expressions
RegexTip Twitter feed
Tips for learning regular expressions

Notes on using regular expressions in

Read More

Good old regular expressions

Here are two examples that persuaded me long ago that regular expressions could be powerful. Both come from The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression aphavowels

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file

egrep -f alphavowels /usr/dict/web2

This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression named monotonic

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

egrep -f monotonic /usr/dict/web2 | grep '......'

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

Explanation

The regular expressions above are fairly long, but shorter and more transparent than a procedural program to solve the same problem. The solutions may look mysterious at first sight, but they are entirely straight-forward once you know the most basic features of regular expressions.

In the first problem, the pattern [^aeiou] says to look for anything that isn’t a vowel, i.e. is a consonant (assuming entries in the dictionary file contain only letters). So the regular expression says to start at the beginning of each line and look for zero or more consonants, followed by an ‘a’, followed by zero or more consonants, followed by an ‘e’, and so on down to a ‘u’ optionally followed by consonants at the end of the line.

In the second problem, the question mark matches zero or one instances of a character, i.e. the character is optional. The regular expression says to start at the beginning of each line, look for an optional ‘a’, followed by an optional ‘b’, and so forth to the end of the line. Then the output is filtered by another regular expression ....... Since a period matches any character, a sequence of six periods says to select only words that contain six characters.

Related links:

Tips for learning regular expressions
@RegexTip: One regular expression tip per day

Read More

Daily tip Twitter account FAQ

This post answers some frequently asked questions regarding my daily tip accounts on Twitter.

How many followers do you have?

About 2800 people are following at least one of these accounts at the time of writing, each following between 2 and 3 accounts on average for a total of about 5900 follows combining all accounts.

How do you schedule your posts?

I schedule the tips a week to a month in advance using HootSuite. Each account posts at the same time each week day starting with SansMouse at 9:30 AM and ending with AlgebraFact at 1:30 PM (US Central Time). Once in a while a tip will go out late due to a Twitter API failure. I occasionally sprinkle in a few unscheduled tweets but I keep the volume low.

What if I don’t use Twitter?

You can subscribe to a Twitter feed via RSS just as you would a blog. For example, you could follow Twitter accounts via Google Reader.

How advanced are these tips?

The SansMouse and RegexTip are in a cycle that starts with the most familiar tips and then moves on to less familiar ones. I mix elementary and advanced material in the mathematical accounts, though there’s a greater range in some accounts. If a tip is more elementary or more advanced than you’d like, you may find something more to your liking in a day or two.

Do you take suggestions? Questions?

I welcome corrections as well as suggestions for new tips. General suggestions are helpful, but I especially appreciate specific tweets.

I like to answer questions when I can, though I can’t respond to every question.

Any plans for new accounts?

I have a couple ideas I’m considering. If you have a suggestion for another daily tip account, let me know.

What about questions about specific accounts?

The following isn’t Q&A format, but it answers some common questions.

SansMouse icon SansMouse is my oldest daily tip account. It gives one Windows keyboard shortcut each day. Most tips apply across all applications, but some tips are specific to popular software packages. Ben Jaffe has an analogous Twitter account commandtab for Mac OS.

RegexTip icon RegexTip is currently the second most popular daily tip account with 1025 followers at the time of writing. This account gives tips for writing regular expressions as well as tips for how regular expressions are used in different environments. I mostly stick to the regular expression syntax supported by Perl 5, Microsoft’s .NET languages, JavaScript, etc.

TeXtip icon TeXtip gives tips for typesetting in TeX and LaTeX. Topics include TeX commands, software for working with LaTeX, tips on typography, etc. I basically stick to LaTeX, though much applies to plain TeX. Also, I don’t say much about add-on packages but stick to the heart of LaTeX.

ProbFact icon ProbFact is currently the most popular daily tip account with 1100 followers. This account gives one fact per day from probability. Probability and statistics are intimately related, but this account primarily sticks to probability proper, though some posts are more statistical. People have suggested I start a statistical counterpart to ProbFact, but I have no plans to do so.

AlgebraFact icon AlgebraFact gives facts from linear algebra, number theory, group theory, etc. A few of the facts are advanced but most are not. I’ve gotten a few complaints for including number theory, but that’s been part of the charter from the beginning. The name AlgebraAndNumberTheoryFact would be more accurate, but too long.

TopologyFact icon TopologyFact gives theorems from topology (point-set topology, algebraic topology, etc.) and geometry. Point-set topology has a lot of theorems that can be condensed into 140 characters but I find other areas harder to tweet about. I could use some help here. I’d like to include more geometry: Euclidean geometry, differential geometry, etc. But I don’t want to include material so esoteric that not many will understand it.

AnalysisFact icon AnalysisFact is the most advanced account on average. Some of the posts are elementary, but some fairly advanced. Topics include real and complex analysis, functional analysis, special functions, differential equations, etc.

I’m doing a drawing to give away either a coffee mug or T-shirt to someone who mentions these tips on Twitter. Tomorrow is the last day to enter.

JohnDCook icon I also have a personal Twitter account. I use this account to post links, interact with friends, etc. I try to keep the signal to noise ratio fairly high, though not as high as the tip accounts.

Read More

Twitter daily tip news

I have five Twitter accounts that send out one tip per day, including a new one I just added last week.

Regular expressions

@RegexTip started over today. It’s a cycle of tips for learning regular expressions. It sticks to the regular expression features common to Python, Perl, C#, and many other programming languages. This account posts Monday through Friday.

Keyboard shortcuts

@SansMouse gives one tip a day on using Windows without a mouse. By practicing one keyboard shortcut a day, you can get into the habit of using your mouse less and your keyboard more. This cycle of tips started over January 29 with the most common and most widely useful shortcuts. I’m also sprinkling in a few extra tips that are less well known. This account also posts Monday through Friday.

Math

I have three mathematical accounts. These post seven days a week.

@AlgebraFact, just started February 2. It will be a mixture of linear algebra, number theory, group theory, etc.

@ProbFact gives one fact per day from probability. Usually these facts are theorems, but sometimes they include a note on history or applications.

@AnalysisFact gives facts from real and complex analysis. The topics range from elementary to advanced.

What if I don’t use Twitter?

You can visit the page for a Twitter account just like any other web page. And every Twitter account has an RSS feed link allowing you to subscribe just as you would subscribe to a blog.

How do you write these?

I write up content for these accounts in bulk. I may sit down on a Saturday and come up with several weeks worth of tips. Then I use HootSuite to schedule the tips weeks in advance. Sometimes I’ll post something spontaneously, such as link to something relevant, but most of the work is done in advance. I use my personal Twitter account for live interaction.

Related links:

Using Windows without a mouse

Regular expressions in

Chart of probability distribution relationships

Read More

Regular expressions in Mathematica

Regular expressions are fairly portable. There are two main flavors of regular expressions — POSIX and Perl — and more languages these days use the Perl flavor. There are some minor differences in what it means to be “like Perl” but for the most part languages that say they follow Perl’s lead specify regular expressions the same way. The differences lie in how you use regular expressions: how you form matches, how you replace strings, etc.

Mathematica uses Perl’s regular expression flavor. But how do you use regular expressions in Mathematica? I’ll give a few tips here and give more details in the notes Regular expressions in Mathematica.

First of all, unlike Perl, Mathematica specifies regular expressions with ordinary strings. This means that metacharacters have to be doubly escaped. For example, to represent the regular expression d{4} you must use the string "\d{4}".

The function StringCases returns a list of all matches of a regular expression in a string. If you simply want to know whether there was a match, you can use the function StringFreeQ. However, note the you probably want the opposite of the return value from StringFreeQ because it returns whether a string does not contain a match.

By default, the function StringReplace replaces all matches of a regular expression with a given replacement pattern. You can limit the number of replacements it makes by specifying an addition argument.

Related links:

Regular expressions in Mathematica
Tips for getting started with regular expressions
Languages that are easy to pick back up
Regular expressions in C++, Python, R, PowerShell

Read More

New daily tip feeds: RegexTip and ProbFact

A few weeks ago I started a Twitter account @SansMouse with daily tips on Windows keyboard shortcuts. That’s gone well, so I decided start two more daily tip accounts: @RegexTip and @ProbFact. If you don’t use Twitter, you can follow these tip via your blog reader. Here are the RSS feeds for RegexTip and ProbFact.

RegexTip will give one tip per day on using regular expressions. I’ll sometimes post more than once a day, but I’ll only give one tip. Other posts might be housekeeping notes etc. The idea is that people who have intended to learn regular expressions but don’t have the time can make time to absorb one tip per day.

ProbFact will give one fact per day from probability. I’ll often have a link with each fact for more details. Many of these facts will be theorems, but some will be statements about applications or history.

The SansMouse and RegexTip posts are loosely arranged in order of familiarity, starting with the most basic and most widely used material. ProbFact posts will be more random. Some will be elementary, some more advanced.

I have scheduled tips to come out at regular times starting tomorrow. SansMouse will post at 9 AM, RegexTip at 10 AM, and ProbFact at 11 AM. These times are Central Standard Time (UTC-6).

Related links:

Diagram of probability distribution relationships
Regular expressions in PowerShell, Python, and C++

Read More

Table-driven text munging in PowerShell

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &amp;
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

Read More

API symmetry

Symmetric APIs are easier to use. I was reminded of this when doing some regular expression programming in Python and comparing it to Perl. Perl’s regular expression operators for search and replace are symmetric in a way that their Python counterparts are not.

Perl uses m/pattern/ for matching and s/pattern/replacement/ for substitution. Both apply to the first instance of a pattern in a string by default. The g option following a match or substitute operator causes the command to apply to all instances of the pattern. The i option after either a match or substitute command causes the pattern to apply in a case-insensitive manner. Matching and substitution are symmetric.

Python uses re.search() for matching and re.sub() for substitution. The search function can only apply to the first instance of a pattern; to match all instances of a pattern, use re.findall(). The function re.sub() applies to all instances by default, but it has a max parameter that can be set to limit the number of instances it applies to. To make a search pattern case-insensitive, pass in re.IGNORECASE flag. To make a substitution case-insensitive, modify the regular expression itself by adding (?i).

In general, I find Python syntax much cleaner than Perl, but regular expressions are implemented more elegantly in Perl.

Read More