Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.
This post goes through an example in detail that shows how to manage special characters in several different contexts.
Escaping special TeX characters
I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like ICD9_CODE
and need to make that ICD9\_CODE
so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.
Underscore isn’t the only special character in TeX. It has ten special characters:
\ { } $ & # ^ _ % ~
The two that people most commonly stumble over are probably $
and %
because these are fairly common in ordinary prose. Since %
begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.
So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.
Raw strings
First I need to tell Python what the special characters are for TeX:
special = r"\\{}$&#^_%~"
There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r
prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”
Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.
Lookbehind
Here’s my regular expression:
re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)
I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<!
and end with )
. So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression
(?<!charity )ball
This expression would match “foot ball” or “foosball” but not “charity ball”.
Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].
After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable special
is so the regular expression engine would see two backslashes and interpret that as one literal backslash.
Captures
The second argument to re.sub
tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.
We want to tell re.sub
to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\
to represent a literal backslash. When we follow this with \1
for the first capture, the result is \\\1
as above.
Testing
We can test our code above on with the following.
line = r"a_b $200 {x} %5 x\y"
and get
a\_b \$200 \{x\} \%5 x\\y
which would cause TeX to produce output that looks like
a_b $200 {x} %5 x\y.
Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r
in front of the opening quote.
P.S. on raw strings
Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say
special = "\\\\{}$&#^_%~"
starting with four backslashes to send Python two to send the regular expression engine one.
Related posts
- Four tips for learning regular expressions
- Unicode / LaTeX conversion
- Daily regular expression tips via Twitter
[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.
[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.
I think the presented solution would fail on not so uncommon “\\%” sequence in TeX/LaTeX: end of line or row, then comment (to not introduce whitespace).
Heh. I often get lost in regex backslash-hell, where I know I need some number of backslashes at some number of locations, but I don’t recall the rules well enough to get it right the first time. So I just permute my way through the possibilities, which for me is much faster than digging into the details of the context and rules.
Well, OK. It’s not always faster. But it is always easier, from the laziness perspective.
This is a beautiful description of regexp hell. I once needed 8 (or maybe it was 16) backslashes to make something work. I had troff spitting out a shell script which it then exec’d; the script grepped through some files and produced a small output file which then was sourced back into troff. (Automated index and cross-ref generation…). I’ve always been sort of proud of that, in a masochistic way. But I couldn’t say it was fun; and the only way to deal with the backslashes was trial and error.
Languages like sed, awk, and perl have regular expressions built in. An advantage of this is that you don’t have the confusion between the host language and the language of regular expressions.
My RSS Reader cut off everything after (? In your charity ball lookbehind examples, so it became a practical lesson as well.
I like the cleanup that Perl6 did to regular expressions.
Rather than treat it as a special string like Perl5, or a string that has to be separately compiled in almost all other languages that have regexes; it is a language all its own that lives at the same level as “regular” Perl6 code. (It actually borrows some syntax from regular Perl6 code.)
So a direct translation of your code is relatively clear.
Note that
<()>
causes it to “capture” the point before the character it matches. Which makes the replacement easy as it doesn’t have to refer to the capture at all.For a more full-featured version, the following should work. (Not thoroughly tested.)
Note that captures nest in Perl6, and
$0
is just sugar for$/[0]</code. So the
(「\」*)
capture is actually accessed on the outside as$/[0][0]
or$0[0]
.Since Perl6 treats regexes as code, you can just embed Perl6 code for more difficult parsing needs.