Finding pi in pi with Perl

Posted on 12 February 2021 by John

Here’s a frivolous problem whose solution illustrates three features of Perl:

Arbitrary precision floating point
Lazy quantifiers in regular expressions
Returning the positions of matched groups.

Our problem is to look for the digits 3, 1, 4, and 1 in the decimal part of π.

First, we get the first 100 digits of π after the decimal as a string. (It turns out 100 is enough, but if it weren’t we could try again with more digits.)

    use Math::BigFloat "bpi";

    $x = substr bpi(101)->bstr(), 2;

This loads Perl’s extended precision library Math::BigFloat, gets π to 101 significant figures, converts the result to a string, then lops off the first two characters “3.” at the beginning leaving “141592…”.

Next, we want to search our string for a 3, followed by some number of digits, followed by a 1, followed by some number of digits, followed by a 4, followed by some number of digits, and finally another 1.

A naive way to search the string would be to use the regex /3.*1.*4.*1/. But the star operator is greedy: it matches as much as possible. So the .* after the 3 would match as many characters as possible before backtracking to look for a 1. But we’d like to find the first 1 after a 3 etc.

The solution is simple: add a ? after each star to make the match lazy rather than greedy. So the regular expression we want is

   /3.*?1.*?4.*?1/

This will tell us whether our string contains the pattern we’re after, but we’d like to also know where the string contains the pattern. So we make each segment a captured group.

   /(3.*?)(1.*?)(4.*?)(1)/

Perl automatically populates an array @- with the positions of the matches, so it has the information we’re looking for. Element 0 of the array is the position of the entire match, so it is redundant with element 1. The advantage of this bit of redundancy is that the starting position of group $1 is in the element with index 1, the starting position of $2 is at index 2, etc.

We use the shift operator to remove the redundant first element of the array. Since shift modifies its argument, we can’t apply it directly to the constant array @-, so we apply it to a copy.

    if ($x =~ /(3.*?)(1.*?)(4.*?)(1)/) {
        @positions = @-;
        shift  @positions;
        print "@positions\n";
    }

This says that our pattern appears at positions 8, 36, 56, and 67. Note that these are array indices, and so they are zero-based. So if you count from 1, the first 3 appears in the 9th digit etc.

To verify that the digits at these indices are 3, 1, 4, and 1 respectively, we make the digits into an array, and slice the array by the positions found above.

    @digits = split(//, $x);
    print "@digits[@positions]\n";

This prints 3 1 4 1 as expected.

Python triple quote strings and regular expressions

Posted on 30 January 2021 by John

There are several ways to quote strings in Python. Triple quotes let strings span multiple lines. Line breaks in your source file become line break characters in your string. A triple-quoted string in Python acts something like “here doc” in other languages.

However, Python’s indentation rules complicate matters because the indentation becomes part of the quoted string. For example, suppose you have the following code outside of a function.

x = """\
abc
def
ghi
"""

Then you move this into a function foo and change its name to y.

def foo():
    y = """\
    abc
    def
    ghi
    """

Now x and y are different strings! The former begins with a and the latter begins with four spaces. (The backslash after the opening triple quote prevents the following newline from being part of the quoted string. Otherwise x and y would begin with a newline.) The string y also has four spaces in front of def and four spaces in front of ghi. You can’t push the string contents to the left margin because that would violate Python’s formatting rules. (Update: Oh yes you can! See Aaron Meurer’s comment below.)

We now give three solutions to this problem.

Solution 1: textwrap.dedent

There is a function in the Python standard library that will strip the unwanted space out of the string y.

import textwrap 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = textwrap.dedent(y)

This works, but in my opinion a better approach is to use regular expressions [1].

Solution 2: Regular expression with a flag

We want to remove white space, and the regular expression for a white space character is \s. We want to remove one or more white spaces so we add a + on the end. But in general we don’t want to remove all white space, just white space at the beginning of a line, so we stick ^ on the front to say we want to match white space at the beginning of a line.

import re 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = re.sub("^\s+", "", y)

Unfortunately this doesn’t work. By default ^ only matches the beginning of a string, not the beginning of a line. So it will only remove the white space in front of the first line; there will still be white space in front of the following lines.

One solution is to add the flag re.MULTILINE to the substitution function. This will signal that we want ^ to match the beginning of every line in our multi-line string.

    y = re.sub("^\s+", "", y, re.MULTILINE)

Unfortunately that doesn’t quite work either! The fourth positional argument to re.sub is a count of how many substitutions to make. It defaults to 0, which actually means infinity, i.e. replace all occurrences. You could set count to 1 to replace only the first occurrence, for example. If we’re not going to specify count we have to set flags by name rather than by position, i.e. the line above should be

    y = re.sub("^\s+", "", y, flags=re.MULTILINE)

That works.

You could also abbreviate re.MULTILINE to re.M. The former is more explicit and the latter is more compact. To each his own. There’s more than one way to do it. [2]

Solution 3: Regular expression with a modifier

In my opinion, it is better to modify the regular expression itself than to pass in a flag. The modifier (?m) specifies that in the rest of the regular the ^ character should match the beginning of each line.

    y = re.sub("(?m)^\s+", "", y)

One reason I believe this is better is that moves information from a language-specific implementation of regular expressions into a regular expression syntax that is supported in many programming languages.

For example, the regular expression

    (?m)^\s+

would have the same meaning in Perl and Python. The two languages have the same way of expressing modifiers [3], but different ways of expressing flags. In Perl you paste an m on the end of a match operator to accomplish what Python does with setting flasgs=re.MULTILINE.

One of the most commonly used modifiers is (?i) to indicate that a regular expression should match in a case-insensitive manner. Perl and Python (and other languages) accept (?i) in a regular expression, but each language has its own way of adding modifiers. Perl adds an i after the match operator, and Python uses

    flags=re.IGNORECASE

    flags=re.I

as a function argument.

More on regular expressions

[1] Yes, I’ve heard the quip about two problems. It’s funny, but it’s not a universal law.

[2] “There’s more than one way to do it” is a mantra of Perl and contradicts The Zen of Python. I use the line here as a good-natured jab at Python. Despite its stated ideals, Python has more in common with Perl than it would like to admit and continues to adopt ideas from Perl.

[3] Python’s re module doesn’t support every regular expression modifier that Perl supports. I don’t know about Python’s regex module.

Searching Greek and Hebrew with regular expressions

Posted on 20 September 2020 by John

According to the Python Cookbook, “Mixing Unicode and regular expressions is often a good way to make your head explode.” It is thus with fear and trembling that I dip my toe into using Unicode with Greek and Hebrew.

I heard recently that there are anomalies in the Hebrew Bible where the final form of a letter is deliberately used in the middle of a word. That made me think about searching for such anomalies with regular expressions. I’ll come back to that shortly, but I’ll start by looking at Greek where things are a little simpler.

Greek

Only one letter in Greek has a variant form at the end of a word. Sigma is written ς at the end of a word and σ everywhere else. The following Python code shows we can search for sigmas and final sigmas in the first sentence from Matthew’s gospel.

    import re

    matthew = "Κατάλογος του γένους του Ιησού Χριστού, γιου του Δαυείδ, γιου του Αβραάμ"
    matches = re.findall(r"\w+ς\b", matthew)
    print("Words ending in sigma:", matches)

    matches = re.findall(r"\w*σ\w+", matthew)
    print("Words with non-final sigmas:", matches)

This prints

    Words ending in sigma: ['Κατάλογος', 'γένους']
    Words with non-final sigmas: ['Ιησού', 'Χριστού']

This shows that we can use non-ASCII characters in regular expressions, at least if they’re Greek letters, and that the metacharacters \w (word character) and \b (word boundary) work as we would hope.

Hebrew

Now for the motivating example. Isaiah 9:6 contains the word “לםרבה” with the second letter (second from right) being the final form of Mem, ם, sometimes called “closed Mem” because the form of the letter ordinarily used in the middle of a word, מ, is not a closed curve [1].

Here’s code to show we could find this anomaly with regular expressions.

    # There are five Hebrew letters with special final forms.
    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ

    lemarbeh = "\u05dc\u05dd\u05e8\u05d1\u05d4" # לםרבה
    m = re.search(r"\w*[" + finalforms + "\w+", lemarbeh)
    if m:
        print("Anomaly found:", m.group(0))

As far as I know, the instance discussed above is the only one where a final letter appears in the middle of a word. And as far as I know, there are no instances of a non-final form being used at the end of a word. For the next example, we will put non-final forms at the end of words so we can find them.

We’ll need a list of non-final forms. Notice that the Unicode value for each non-final form is 1 greater than the corresponding final form. It’s surprising that the final forms come before the non-final forms in numerical (Unicode) order, but that’s how it is. The same is true in Greek: final sigma is U+03c2 and sigma is U+03c3.

    finalforms = "\u05da\u05dd\u05df\u05e3\u05e5" # ךםןףץ
    nonfinal   = "\u05db\u05de\u05e0\u05e4\u05e6" # כמנפצ

We’ll start by taking the first line of Genesis

    genesis = "בראשית ברא אלהים את השמים ואת הארץ"

and introduce errors by replacing each final form with its corresponding non-final form. The code uses a somewhat obscure feature of Python and is analogous to the shell utility tr.

    genesis_wrong = genesis.translate(str.maketrans(finalforms, nonfinal))

(The string method translate does what you would expect, except that it takes a translation table rather than a pair of strings as its argument. The maketrans method creates a translation table from two strings.)

Now we can find our errors.

    anomalies = re.findall(r"\w+[" + nonfinal + r"]\b", genesis_wrong)
    print("Anomalies:", anomalies)

This produced

    Anomalies: ['אלהימ', 'השמימ', 'הארצ']

Note that Python printed the letters left-to-right.

Vowel points

The text above did not contain vowel points. If the text does contain vowel points, these are treated as letters between the consonants.

For example, the regular expression “בר” will not match against “בְּרֵאשִׁית” because the regex engine sees two characters between ב and ר, the dot (“dagesh”) inside ב and the two dots (“sheva”) underneath ב. But the regular expression “ב..ר” does match against “בְּרֵאשִׁית”.

See footnote [1].

Python

I started this post with a warning from the Python Cookbook that mixing regular expressions and Unicode can cause your head to explode. So far my head intact, but I’ve only tested the waters. I’m sure there are dragons in the deeper end.

I should include the rest of the book’s warning. I used the default library re that ships with Python, but the authors recommend if you’re serious about using regular expressions with Unicode,

… you should consider installing the third-party regex library which provides full support for Unicode case folding, as well as a variety of other interesting features, including approximate matching.

[1] When I refer to “בר” as a regular expression, I mean the character sequence ב followed by ר, which your browser will display from right to left because it detects it as characters from a language written from right to left. Written in terms of Unicode code points this would be “\u05d1\u05e8”, the code point for ב followed by the code point for ר.

Similarly, the second regular expression could be written “\u05d1..\u05e8”. The two periods in the middle are just two instances of the regular expression symbol that matches any character. It’s a coincidence that dots (periods) are being used to match dots (the dagesh and the sheva).

A shell one-liner to search directories

Posted on 19 April 2020 by John

I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.

So I’m splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.

I want to explore LaTeX files, so I’ll start by using find to find such files.

    find . -name "*.tex"

This searches for files ending in .tex, starting with the current directory (hence .) and searching recursively into subdirectories. The find command explores subdirectories by default; you have to tell it not to if that’s not what you want.

Next, I want to use grep to search the LaTeX files. If I pipe the output of find to grep it will search the file names, but I want it to search the file contents. The xargs command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.

    find . -name "*.tex" | xargs grep ...

LaTeX commands have the form of a backslash followed by letters, so the regular expression I’ll pass is \\[a-z]+. This says to look for a literal backslash followed by one or more letters.

I’ll give grep four option flags. I’ll use -i to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I’ll use -E to tell it I want to use extended regular expressions [1].

I’m after just the commands, not the lines containing commands, and so I use the -o option to tell grep to return just the commands, one per line. But that’s not enough. It would be enough if we were only search one file, but since we’re searching multiple files, the default behavior is for grep to return the file name as well. The -h option tells it to only return the matches, no file names.

So now we’re up to this:

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'

Next I want to count how many times each command occurs, and I need to sort the output first so that uniq will count correctly.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c

And finally I want to sort the output by frequency, in descending order. The -n option tells sort to sort numerically, and -r says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less to view it one screen at a time.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less

That’s my one-liner. In the next post I’ll look at the results.

Why can’t grep find negative numbers?

Posted on 7 December 2019 by John

Suppose you’re looking for instances of -42 in a file foo.txt. The command

    grep -42 foo.txt

won’t work. Instead you’ll get a warning message like the following.

    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

Putting single or double quotes around -42 won’t help. The problem is that grep interprets 42 as a command line option, and <codegrep doesn’t have such an option. This is a problem if you’re searching for negative numbers, or any pattern that beings with a dash, such as -able or --version.

The solution is to put -e in front of a regular expression containing a dash. That tells grep that the next token at the command line is a regular expression, not a command line option. So

    grep -e -42 foo.txt

will work.

You can also use -e several times to give grep several regular expressions to search for. For example,

    grep -e cat -e dog foo.txt

will search for “cat” or “dog.”

See the previous post for another example of where grep doesn’t seem to work. By default grep supports a restricted regular expression syntax and may need to be told to use “extended” regular expressions.

Why doesn’t grep work?

Posted on 5 December 2019 by John

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.

    [0-9]{3}\.?[0-9]{0,2}

This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.

Similarly,

    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"

and

    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually

    \d{3}\.?\d{0,2}

rather than

    [0-9]{3}\.?[0-9]{0,2}

I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.

Regular expressions and special characters

Posted on 31 August 2019 by John

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

Escaping special TeX characters

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like ICD9_CODE and need to make that ICD9\_CODE so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

    \ { } $ & # ^ _ % ~

The two that people most commonly stumble over are probably $ and % because these are fairly common in ordinary prose. Since % begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

Raw strings

First I need to tell Python what the special characters are for TeX:

    special = r"\\{}$&#^_%~"

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

Lookbehind

Here’s my regular expression:

    re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<! and end with ). So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

    (?<!charity )ball

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable special is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

Captures

The second argument to re.sub tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell re.sub to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\ to represent a literal backslash. When we follow this with \1 for the first capture, the result is \\\1 as above.

Testing

We can test our code above on with the following.

    line = r"a_b $200 {x} %5 x\y"

and get

    a\_b \$200 \{x\} \%5 x\\y

which would cause TeX to produce output that looks like

a_b $200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r in front of the opening quote.

P.S. on raw strings

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

    special = "\\\\{}$&#^_%~"

starting with four backslashes to send Python two to send the regular expression engine one.

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

Why are regular expressions difficult?

Posted on 19 June 2019 by John

Regular expressions are challenging, but not for the reasons commonly given.

Non-reasons

Here are some reasons given for the difficulty of regular expressions that I don’t agree with.

Cryptic syntax

I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If that were the only difficulty, you could easily learn Greek in a couple days. No, Greek is difficult for English speakers to learn because it is a very different language than English. The differences go much deeper than the alphabet, and in fact that alphabets are not entirely different.

The basic symbol choices for regular expressions — . to match any character, ? to denote that something is optional, etc. — were arbitrary, but any choice would be. As it is, the chosen symbols are sometimes mnemonic, or at least consistent with notation conventions in other areas.

Density

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code. But if a typical pattern described by 100 characters of regular expression were expanded into English prose or C code, the result would be hard to read as well, not because it is dense but because it is verbose.

Crafting expressions

The heart of using regular expressions is looking at a pattern and crafting a regular expression to match that pattern. I don’t think this is difficult, especially when you have some error tolerance. Very often in applications of regular expressions it’s OK to have a few false positives, as long as a human scans the output. For example, see this post on looking for ICD-10 codes.

I suspect that many people who think that writing regular expressions is difficult actually find some peripheral issue difficult, not the core activity of describing patterns per se.

Reasons

Now for what I believe are reasons why regular expressions are.

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse. People wanting to add new features to regular expressions ‐ look-behind, comments, named matches, etc. — had to come up with syntax that wouldn’t conflict with earlier usage, which meant strange combinations of symbols that would have previously been illegal.

Dialects

If you use regular expressions in multiple programming languages, you’ll run into numerous slight variations. Can I write \d for a digit, or do I need to write [0-9]? If I want to group a subexpression, do I need to put a backslash in front of the parentheses? Can I write non-capturing groups?

These variations are difficult to remember, not because they’re completely different, but because they’re so similar. It reminds me of my French teacher saying “Does literature have a double t in English and one t in French, or the other way around? I can never remember.”

Use

It’s difficult to remember the variations on expression syntax in various programming languages, but I find it even more difficult to remember how to use the expressions. If you want to replace all instances of some regular expression with a string, the way to do that could be completely different in two languages, even if the languages use the exact same dialect of regular expressions.

Resources

Here are notes on regular expressions I’ve written over the years, largely for my own reference.

Regular expression for ICD-9 and ICD-10 codes

Posted on 5 May 2019 by John

Suppose you’re searching for medical diagnosis codes in the middle of free text. One way to go about this would be to search for each of the roughly 14,000 ICD-9 codes and each of the roughly 70,000 ICD-10 codes. A simpler approach would be to use regular expressions, though that may not be as precise.

In practice regular expressions may have some false positives or false negatives. The expressions given here have only false positives. That is, no valid ICD-9 or ICD-10 codes will go unmatched, but the regular expressions may match things that are not diagnosis codes. The latter is inevitable anyway since a string of characters could coincide with a diagnosis code but not be used as a diagnosis code. For example 1234 is a valid ICD-9 code, but 1234 in a document could refer to other things, such as a street address.

ICD-9 diagnosis code format

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V.

Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits.

An E code begins with E and three digits. These may be followed by a decimal and one more digit.

A V code begins with a V followed by two digits. These may be followed by a decimal and one or two more digits.

Sometimes the decimals are left out.

Here are regular expressions that summarize the discussion above.

    N = "\d{3}\.?\d{0,2}"
    E = "E\d{3}\.?\d?"
    V = "V\d{2}\.?\d{0,2}"
    icd9_regex = "|".join([N, E, V])

Usually E and V are capitalized, but they don’t have to be, so it would be best to do a case-insensitive match.

ICD-10 diagnosis code format

ICD-10 diagnosis codes always begin with a letter (except U) followed by a digit. The third character is usually a digit, but could be an A or B [1]. After the first three characters, there may be a decimal point, and up to three more alphanumeric characters. These alphanumeric characters are never U. Sometimes the decimal is left out.

So the following regular expression would match any ICD-10 diagnosis code.

    [A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}

As with ICD-9 codes, the letters are usually capitalized, but not always, so it’s best to do a case-insensitive search. In addition to the pattern above, “special codes” may begin with U.

Testing the regular expressions

As mentioned at the beginning, the regular expressions here may have false positives. However, they don’t let any valid codes slip by. I downloaded lists of ICD-9 and ICD-10 codes from the CDC and tested to make sure the regular expressions here matched every code.

Regular expression features used

Character ranges are supported everywhere, such as [A-TV-Z] for the letters A through T and V through Z.

Not every regular expression implementation supports \d to represent a digit. In Emacs, for example, you would have to use[0-9] instead since it doesn’t support \d.

I’ve used \.? for an optional decimal point. (The . is a special character in regular expressions, so it needs to be escaped to represent a literal period.) Some people wold write [.]? instead on the grounds that it may be more readable. (Periods are not special characters in the context of a character classes.)

I’ve used {m} for a pattern that is repeated exactly m times, and {m,n} for a pattern that is repeated between m and n times. This is supported in Perl and Python, for example, but not everywhere. You could write \d\d\d instead of \d{3} and \d?\d? instead of \d{0,2}.

[1] The only ICD-10 codes with a non-digit in the third position are those beginning with C4A, C7A, C7B, D3A, M1A, O9A, and Z3A.

Perl as a better grep

Posted on 12 June 2018 by John

I like Perl’s pattern matching features more than Perl as a programming language. I’d like to take advantage of the former without having to go any deeper than necessary into the latter.

The book Minimal Perl is useful in this regard. It has chapters on Perl as a better grep, a better awk, a better sed, and a better find. While Perl is not easy to learn, it might be easier to learn a minimal subset of Perl than to learn each of the separate utilities it could potentially replace. I wrote about this a few years ago and have been thinking about it again recently.

Here I want to zoom in on Perl as a better grep. What’s the minimum Perl you need to know in order to use Perl to search files the way grep would?

By using Perl as your grep, you get to use Perl’s more extensive pattern matching features. Also, you get to use one regex syntax rather than wondering about the specifics of numerous regex dialects supported across various programs.

Let RE stand for a generic regular expression. To search a file foo.txt for lines containing the pattern RE, you could type

    perl -ln -e "/RE/ and print;" foo.txt

The Perl one-liner above requires more typing than using grep would, but you could wrap this code in a shell script if you’d like.

If you’d like to print lines that don’t match a regex, change the and to or:

    perl -ln -e "/RE/ or print;" foo.txt

By learning just a little Perl you can customize your search results. For example, if you’d like to just print the part of the line that matched the regex, not the entire line, you could modify the code above to

    perl -ln -e "/RE/ and print $&;" foo.txt

because $& is a special variable that holds the result of the latest match.

Update: If you’d like to use Perl regular expressions but you’d rather not write Perl code, you might like tcgrep. It uses Perl regular expressions but has an interface like grep.

Solution 1: textwrap.dedent

Solution 2: Regular expression with a flag

Solution 3: Regular expression with a modifier

More on regular expressions

Greek

Hebrew

Vowel points

Python

Related posts

More command line posts

Turning on ERE support

This doesn’t fix everything

Related

Escaping special TeX characters

Raw strings

Lookbehind

Captures

Testing

P.S. on raw strings

Related posts

Non-reasons

Cryptic syntax

Density

Crafting expressions

Reasons

Overloaded syntax and context

Dialects

Use

Resources

ICD-9 diagnosis code format

ICD-10 diagnosis code format

Testing the regular expressions

Regular expression features used

Related posts