Here’s a frivolous exercise in regular expressions: Write a regex to match any chemical element symbol.
Here’s one solution.
A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(u[opst])?|V|W|Xe|Yb?|Z[nr]
Update: When this post was written, elements 113, 115, 117, and 118 had placeholder names. For example, 115 was “ununpentium.” Now these elements are nihonium, moscovium, tennessine, and oganesson. The updated regular expression is the following.
A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[cdgnot]|N[abdehiop]?|O[gs]?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilms]|U|V|W|Xe|Yb?|Z[nr]
Making it more readable
Here’s the original expression in more readable form:
/ A[cglmrstu] | B[aehikr]? | C[adeflmnorsu]? | D[bsy] | E[rsu] | F[elmr]? | G[ade] | H[efgos]? | I[nr]? | Kr? | L[airuv] | M[dgnot] | N[abdeiop]? | Os? | P[abdmortu]? | R[abefghnu] | S[bcegimnr]? | T[abcehilm] | U(u[opst])? | V | W | Xe | Yb? | Z[nr] /x
The /x
option in Perl says to ignore white space. Other regular expression implementations have something similar. Python has two such options, X
for similarity with Perl, and VERBOSE
for readability. Both have the same behavior.
Here’s the updated expression in more readable form:
/ A[cglmrstu] | B[aehikr]? | C[adeflmnorsu]? | D[bsy] | E[rsu] | F[elmr]? | G[ade] | H[efgos]? | I[nr]? | Kr? | L[airuv] | M[cdgnot] | N[abdehiop]? | O[gs]? | P[abdmortu]? | R[abefghnu] | S[bcegimnr]? | T[abcehilms] | U | V | W | Xe | Yb? | Z[nr] /x
Regex syntax
The regular expression says that a chemical element symbol may start with A, followed by c, g, l, m, r, s, t, or u; or a B, optionally followed by a, e, h, i, k, or r; or …
The most complicated part of the regex was the part for symbols starting with U. There’s uranium whose symbols is simply U, and there are the elements that had temporary names based on their atomic numbers: ununtrium, ununpentium, ununseptium, and ununoctium. These are just Latin for one-one-three, one-one-five, one-one-seven, and one-one-eight. The symbols were U, Uut, Uup, Uus, and Uuo. The regex U(u[opst])?
can be read “U, optionally followed by u and one of o, p, s, or t.” Now that the temporary names are gone, uranium is the only element whose abbreviation starts with U.
Note that the regex will match any string that contains a chemical element symbol, but it could match more. For example, it would match “I’ve never been to Boston in the fall” because that string contains B, the symbol for boron. Exercise: Modify the regex to only match chemical element symbols.
Regex golf
There may be clever ways to use fewer characters at the cost of being more obfuscated. But this is for fun anyway, so we’ll indulge in a little regex golf.
There are five elements whose symbols start with I or Z: I, In, Ir, Zn, and Zr. You could write [IZ][nr]
to match four of these. The regex I|[IZ][nr]
would represent all five with 10 characters, while I[nr]?|Z[nr]
uses 12. Two characters saved! Can you cut out any more?
Regex resources
Notes on regular expressions in Python, PowerShell, R, Mathematica, and C++