Regular expression to match any chemical element

Here’s a frivolous exercise in regular expressions: Write a regex to match any chemical element symbol.

Here’s one solution.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(u[opst])?|V|W|Xe|Yb?|Z[nr]

Update: When this post was written, elements 113, 115, 117, and 118 had placeholder names. For example, 115 was “ununpentium.” Now these elements are nihonium, moscovium, tennessine, and oganesson. The updated regular expression is the following.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[cdgnot]|N[abdehiop]?|O[gs]?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilms]|U|V|W|Xe|Yb?|Z[nr]

Making it more readable

Here’s the original expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[dgnot]        | 
N[abdeiop]?     | 
Os?             | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilm]     | 
U(u[opst])?     | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

The /x option in Perl says to ignore white space. Other regular expression implementations have something similar. Python has two such options, X for similarity with Perl, and VERBOSE for readability. Both have the same behavior.

Here’s the updated expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[cdgnot]       | 
N[abdehiop]?    | 
O[gs]?          | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilms]    | 
U               | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

Regex syntax

The regular expression says that a chemical element symbol may start with A, followed by c, g, l, m, r, s, t, or u; or a B, optionally followed by a, e, h, i, k, or r; or …

The most complicated part of the regex was the part for symbols starting with U. There’s uranium whose symbols is simply U, and there are the elements that had temporary names based on their atomic numbers: ununtrium, ununpentium, ununseptium, and ununoctium. These are just Latin for one-one-three, one-one-five, one-one-seven, and one-one-eight. The symbols were U, Uut, Uup, Uus, and Uuo. The regex U(u[opst])? can be read “U, optionally followed by u and one of o, p, s, or t.” Now that the temporary names are gone, uranium is the only element whose abbreviation starts with U.

Note that the regex will match any string that contains a chemical element symbol, but it could match more. For example, it would match “I’ve never been to Boston in the fall” because that string contains B, the symbol for boron. Exercise: Modify the regex to only match chemical element symbols.

Regex golf

There may be clever ways to use fewer characters at the cost of being more obfuscated. But this is for fun anyway, so we’ll indulge in a little regex golf.

There are five elements whose symbols start with I or Z: I, In, Ir, Zn, and Zr. You could write [IZ][nr] to match four of these. The regex I|[IZ][nr] would represent all five with 10 characters, while I[nr]?|Z[nr] uses 12. Two characters saved! Can you cut out any more?

Regex resources

Notes on regular expressions in Python, PowerShell, R, Mathematica, and C++

Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

   
# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
     
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

Perl regex twitter account

I’ve started a new Twitter account @PerlRegex for Perl regular expressions. My original account, @RegexTip, is for regular expressions in general and doesn’t go into much detail regarding any particular implementation. @PerlRegex goes into the specifics of regular expressions in Perl.

Why specifically Perl regular expressions? Because Perl has the most powerful support for regular expressions (strictly speaking, “pattern matching.”) Other languages offer “Perl compatible” regular expressions, though the degree of compatibility varies and is always less than complete.

I imagine more people have ruled England than have mastered the whole of the Perl language. But it’s possible to use Perl for regular expression processing without learning too much of the wider language.

PerlRegex icon

Update: I’ve stopped posting to this account. Here’s a list of my current accounts.

Regular expression resources

Continuing the series of resource posts each Wednesday, this week we have notes on regular expressions:

See also blog posts tagged regular expressions.

Last week: Probability resources

Next week: Numerical computing resources

Look-behind regex

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following  pattern.”

The syntax in Perl and similar regular expression implementations is (?<= … ) for positive look-behind and (?<! … ) for negative look-behind. For the longest time I couldn’t remember whether the next symbol after ? was the direction (i.e. < for behind) or the polarity (= for positive, ! for negative). I was more likely to guess wrong unless I’d used the syntax recently.

The reason I was tempted to get these wrong is that I thought “positive look-behind” and “negative look-behind.” That’s how these patterns are described. But this means the words and symbols come in a different order. If you think look-behind positive and look-behind negative then the words and the symbols come in the same order:

look (?
behind <
positive =
negative !

Maybe this syntax comes more naturally to people who speak French and other languages where adjectives follow the thing they describe. English word order was tripping me up.

By the way, the syntax for look-ahead patterns is simpler: just leave out the <. The default direction for look-around patterns is forward. You don’t have to remember whether the symbol for direction or parity comes first because there is no symbol for direction.

Can regular expressions parse HTML or not?

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post [Update: link went away] by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

Learn one Perl command

A while back I wrote a post Learn one sed command. In a nutshell, I said it’s worth learning sed just do commands of the form sed s/foo/bar/ to replace “foo” with “bar.”

Dan Haskin and Will Fitzgerald suggested in their comments that instead of sed use perl -pe with the same command. The advantage is that you could use Perl’s more powerful regular expression syntax. Will said he uses Perl like this:

    cat file | perl -lpe "s/old/new/g" > newfile

I think they’re right. Except for the simplest regular expressions, sed’s regular expression syntax is too restrictive. For example, I recently needed to remove commas that immediately follow a digit and this did the trick:

    cat file | perl -lpe "s/(?<=\d),//g" > newfile

Since sed does not have the look-behind feature or d for digits, the corresponding sed code would be more complicated.

I quit writing Perl years ago. I don’t miss Perl as a whole, but I do miss Perl’s regular expression support.

Learning Perl is a big commitment, but just learning Perl regular expressions is not. Perl is the leader in regular expression support, and many programming languages implement a subset of Perl’s regex features. You could just use a subset of Perl features you already know, but you’d have the option of using more features.

Just what do you mean by "number"?

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

  • Is −0 a number?
  • How do you feel about √−1?
  • Is or a number?
  • Is 186,282.42±0.02 miles/second one number — or is it two or three of them?
  • Is 6.02e23 a number?
  • Is 3.141_592_653_589 a number? How about π, or ? And −2π⁻³ ͥ?
  • How many numbers in 0.083̄?
  • How many numbers in 128.0.0.1?
  • What number does hold? How about ⚂⚃?
  • Does 10,5 mm have one number in it — or does it have two?
  • Is ∛8³ a number — or is it three of them?
  • What number does ↀↀⅮⅭⅭⅬⅫ AUC represent, 2762 or 2009?
  • Are ४५६७ and ৭৮৯৮ numbers?
  • What about 0377, 0xDEADBEEF, and 0b111101101?
  • Is Inf a number? Is NaN?
  • Is ④② a number? What about ?
  • How do you feel about ?
  • What do ℵ₀ and ℵ₁ have to do with numbers? Or , , and ?

See his full response here. Thanks to Bill the Lizard for pointing this out.

Roman numeral puzzle

I noticed an ad for Super Bowl XLVI on a pizza box this morning. The Roman numeral XLVI does not repeat any character. This brought up a couple questions.

  • How many Roman numerals are possible if you’re not allowed to repeat a character?
  • Could you write a (reasonably short) regular expression to find all such numbers?

You can post your solutions to either question in the comments.

There has never been universal agreement on the rules for constructing Roman numerals, so your solution would depend on your choice of rules. For our purposes here, assume the valid characters are I, V, X, L, C, D, and M. Also, assume any character can be subtracted from a larger character. For example, you can assume IL is a valid representation of 49.

For a more challenging problem, you can use the more restrictive subtraction rules.

  1. I can be subtracted from V and X only.
  2. X can be subtracted from L and C only.
  3. C can be subtracted from D and M only.
  4. V, L, and D can never be subtracted.

Other puzzle posts

Learn one sed command

You may have seen sed programs even if you didn’t know that’s what they were. In online discussions it’s common to hear someone say

s/foo/bar/

as a shorthand to mean “replace foo with bar.” The line s/foo/bar/ is a complete sed program to do such a replacement.

sed comes with every Unix-like operating system and is available for Windows here. It has a range of features for editing files, but sed is worth using even if you only know how to do one thing with it:

sed "s/pattern1/pattern2/g" file.txt > newfile.txt

This will replace every instance of pattern1 with pattern2 in the file file.txt and will write the result to newfile.txt. The original file file.txt is unchanged.

I used to think there was no reason to use sed when other languages like Python will do everything sed does and much more. Suppose you agree with that. Now suppose you find you often have to make global search-and-replace operations and so you write a script to do this, say a Python script. You’ve got to call your script something, remember what you called it, and put it in your path. How about calling it sed? Or better, don’t write your script, but pretend that you did. If you’re on Linux, it’s already in your path. One advantage of the real sed over your script named sed is that the former can do a lot more, should you ever need it to.

Now for a few details regarding the sed command above. The “s” on the front stands for “substitute” and the “g” on the end stands for “global.” Without the “g” on the end, sed would only replace the first instance of the pattern on each line. If that’s what you want, then remove the “g.”

The patterns inside a sed command are regular expressions, so it’s best to get in the habit of always quoting sed commands. This isn’t necessary for simple string substitutions, but regular expressions often contain characters that you’ll need to prevent the shell from interpreting.

You may find the default regular expression support in sed odd or restrictive. If you’re used to regular expressions in Perl, Python, JavaScript, etc. and you’re using a Gnu implementation of sed, you can add the -r option for more familiar regular expression syntax.

I got the idea for this post from Greg Grouthaus’ post Why you should learn just a little Awk. He makes a good case that you can benefit from learning just a few commands of a language like Awk with no intention to learn more of the language.

More regular expression posts