Regular expression to match a chemical element

Here’s a frivolous exercise in regular expressions: Write a regex to match any chemical element symbol.

Here’s one solution.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(u[opst])?|V|W|Xe|Yb?|Z[nr]

Update: When this post was written, elements 113, 115, 117, and 118 had placeholder names. For example, 115 was “ununpentium.” Now these elements are nihonium, moscovium, tennessine, and oganesson. The updated regular expression is the following.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[cdgnot]|N[abdehiop]?|O[gs]?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilms]|U|V|W|Xe|Yb?|Z[nr]

Making it more readable

Here’s the original expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[dgnot]        | 
N[abdeiop]?     | 
Os?             | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilm]     | 
U(u[opst])?     | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

The /x option in Perl says to ignore white space. Other regular expression implementations have something similar. Python has two such options, X for similarity with Perl, and VERBOSE for readability. Both have the same behavior.

Here’s the updated expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[cdgnot]       | 
N[abdehiop]?    | 
O[gs]?          | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilms]    | 
U               | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

Regex syntax

The regular expression says that a chemical element symbol may start with A, followed by c, g, l, m, r, s, t, or u; or a B, optionally followed by a, e, h, i, k, or r; or …

The most complicated part of the regex was the part for symbols starting with U. There’s uranium whose symbols is simply U, and there are the elements that had temporary names based on their atomic numbers: ununtrium, ununpentium, ununseptium, and ununoctium. These are just Latin for one-one-three, one-one-five, one-one-seven, and one-one-eight. The symbols were U, Uut, Uup, Uus, and Uuo. The regex U(u[opst])? can be read “U, optionally followed by u and one of o, p, s, or t.” Now that the temporary names are gone, uranium is the only element whose abbreviation starts with U.

Note that the regex will match any string that contains a chemical element symbol, but it could match more. For example, it would match “I’ve never been to Boston in the fall” because that string contains B, the symbol for boron. Exercise: Modify the regex to only match chemical element symbols.

Regex golf

There may be clever ways to use fewer characters at the cost of being more obfuscated. But this is for fun anyway, so we’ll indulge in a little regex golf.

There are five elements whose symbols start with I or Z: I, In, Ir, Zn, and Zr. You could write [IZ][nr] to match four of these. The regex I|[IZ][nr] would represent all five with 10 characters, while I[nr]?|Z[nr] uses 12. Two characters saved! Can you cut out any more?

Regex resources

Notes on regular expressions in Python, PowerShell, R, Mathematica, and C++

12 thoughts on “Regular expression to match any chemical element”

Mike G.

4 February 2016 at 09:01

Using a 118 element name list from
http://chemistry.about.com/od/elementfacts/a/elementlist.htm
and the Perl RegEx::PreSuf module, I get:

wc tells me that’s 201 characters, so that only ties your initial regex.

I’ll have to find another way to cheat :)

John

4 February 2016 at 09:18

Mike: One advantage to your expression is that it makes it obvious which letters can stand alone as an element symbol: the ones at the end of the regex.

Mike G.

4 February 2016 at 09:20

True, although it’s just serendipitous.

That could be made more obvious by promoting the single-character set to the front:

Arthur David Olson

4 February 2016 at 11:16

Two cheap optimizations:
s/efgh/e-h/
s/lmno/l-o/
And one (failed) alternate approach: group by symbol ends rather than symbol starts.

Mike G.

5 February 2016 at 07:31

New Olson-optimized version, thanks:

Too bad the ‘-‘ characters make the blog think those are line breaks…

8 February 2016 at 00:54

I got A(c|g|l|m|r|s|t|u)|B(a|e|h|i|k|r)|C(a|d|e|f|l|m|n|o|r|s|u)|D(b|s|y)|E(r|s|u)|F(e|l|m|r)|G(a|d|e)|H(e|f|g|o|s)|I(n|r)|Kr|L(a|i|r|u|v)|M(d|g|n|o|t)|N(a|b|d|e|i|o|p)|Os|P(a|b|d|m|o|r|t|u)|R(a|b|e|f|g|h|n|u)|S(b|c|e|g|i|m|n|r)|T(a|b|c|e|h|i|l|m)|Uu(o|p|s|t)|V|W|Xe|Yb|Z(n|r)

Arthur David Olson

8 February 2016 at 07:05

The appearance of I[nr] and Z[nr] cries out for the optimization [IZ][nr].
Noticing then that Sn and Sr are chemical symbols, we can add one
character to that optimization yielding [ISZ][nr]; this lets us drop two characters from S[bcegimnr].

Another pair of right-hand letters–r and u–pays off for grouping purposes, because if we can eliminate them from E[rsu] we end up with Es, saving the two bracket characters.

The result:

–ado

Mike G.

8 February 2016 at 07:39

‘E’: your regex won’t match any of the single-letter elements, like ‘C’. That’s why the original post had ‘?’ after the alternation blocks to make them optional.

‘ado’: nice!

Hermann

9 March 2016 at 13:50

Ado’s regexp is nice, and only 195 characters long:

Although “Perl” is named 2 times in the original posting, the task was to find (any) regexp.

I tried to bring in character class subtraction:
http://www.regular-expressions.info/charclasssubtract.html

But the alternative character class subtraction formulation for the 1 character elements is one character longer than Ado’s regexp:

$ echo -n “[BCFHIKNOPSUVWY]” | wc –bytes
16
$ echo -n “[B-Y-[DEGJLMQRT]]” | wc –bytes
17
$

Hermann

10 March 2016 at 10:16

2nd regexp is wrong on “X”, below is corrected version, can you please change?

…
But the alternative character class subtraction formulation for the 1 character elements is two characters longer than Ado’s regexp:

$ echo -n “[BCFHIKNOPSUVWY]” | wc –bytes
16
$ echo -n “[B-Y-[DEGJLMQRTX]]” | wc –bytes
18
$

Karsten Theis

7 March 2017 at 07:51

mosh

31 December 2017 at 03:54

A perl program to convert a string using the regexp above into periodic element symbols: https://stackoverflow.com/questions/48041494/how-to-convert-a-string-into-element-symbols-from-periodic-table-in-perl/48041495#48041495 Any improvement welcome

Comments are closed.