Regular expression for ICD-9 and ICD-10 codes

Posted on 5 May 2019 by John

Suppose you’re searching for medical diagnosis codes in the middle of free text. One way to go about this would be to search for each of the roughly 14,000 ICD-9 codes and each of the roughly 70,000 ICD-10 codes. A simpler approach would be to use regular expressions, though that may not be as precise.

In practice regular expressions may have some false positives or false negatives. The expressions given here have only false positives. That is, no valid ICD-9 or ICD-10 codes will go unmatched, but the regular expressions may match things that are not diagnosis codes. The latter is inevitable anyway since a string of characters could coincide with a diagnosis code but not be used as a diagnosis code. For example 1234 is a valid ICD-9 code, but 1234 in a document could refer to other things, such as a street address.

ICD-9 diagnosis code format

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V.

Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits.

An E code begins with E and three digits. These may be followed by a decimal and one more digit.

A V code begins with a V followed by two digits. These may be followed by a decimal and one or two more digits.

Sometimes the decimals are left out.

Here are regular expressions that summarize the discussion above.

    N = "\d{3}\.?\d{0,2}"
    E = "E\d{3}\.?\d?"
    V = "V\d{2}\.?\d{0,2}"
    icd9_regex = "|".join([N, E, V])

Usually E and V are capitalized, but they don’t have to be, so it would be best to do a case-insensitive match.

ICD-10 diagnosis code format

ICD-10 diagnosis codes always begin with a letter (except U) followed by a digit. The third character is usually a digit, but could be an A or B [1]. After the first three characters, there may be a decimal point, and up to three more alphanumeric characters. These alphanumeric characters are never U. Sometimes the decimal is left out.

So the following regular expression would match any ICD-10 diagnosis code.

    [A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}

As with ICD-9 codes, the letters are usually capitalized, but not always, so it’s best to do a case-insensitive search. In addition to the pattern above, “special codes” may begin with U.

Testing the regular expressions

As mentioned at the beginning, the regular expressions here may have false positives. However, they don’t let any valid codes slip by. I downloaded lists of ICD-9 and ICD-10 codes from the CDC and tested to make sure the regular expressions here matched every code.

Regular expression features used

Character ranges are supported everywhere, such as [A-TV-Z] for the letters A through T and V through Z.

Not every regular expression implementation supports \d to represent a digit. In Emacs, for example, you would have to use[0-9] instead since it doesn’t support \d.

I’ve used \.? for an optional decimal point. (The . is a special character in regular expressions, so it needs to be escaped to represent a literal period.) Some people wold write [.]? instead on the grounds that it may be more readable. (Periods are not special characters in the context of a character classes.)

I’ve used {m} for a pattern that is repeated exactly m times, and {m,n} for a pattern that is repeated between m and n times. This is supported in Perl and Python, for example, but not everywhere. You could write \d\d\d instead of \d{3} and \d?\d? instead of \d{0,2}.

[1] The only ICD-10 codes with a non-digit in the third position are those beginning with C4A, C7A, C7B, D3A, M1A, O9A, and Z3A.

Perl as a better grep

Posted on 12 June 2018 by John

I like Perl’s pattern matching features more than Perl as a programming language. I’d like to take advantage of the former without having to go any deeper than necessary into the latter.

The book Minimal Perl is useful in this regard. It has chapters on Perl as a better grep, a better awk, a better sed, and a better find. While Perl is not easy to learn, it might be easier to learn a minimal subset of Perl than to learn each of the separate utilities it could potentially replace. I wrote about this a few years ago and have been thinking about it again recently.

Here I want to zoom in on Perl as a better grep. What’s the minimum Perl you need to know in order to use Perl to search files the way grep would?

By using Perl as your grep, you get to use Perl’s more extensive pattern matching features. Also, you get to use one regex syntax rather than wondering about the specifics of numerous regex dialects supported across various programs.

Let RE stand for a generic regular expression. To search a file foo.txt for lines containing the pattern RE, you could type

    perl -ln -e "/RE/ and print;" foo.txt

The Perl one-liner above requires more typing than using grep would, but you could wrap this code in a shell script if you’d like.

If you’d like to print lines that don’t match a regex, change the and to or:

    perl -ln -e "/RE/ or print;" foo.txt

By learning just a little Perl you can customize your search results. For example, if you’d like to just print the part of the line that matched the regex, not the entire line, you could modify the code above to

    perl -ln -e "/RE/ and print $&;" foo.txt

because $& is a special variable that holds the result of the latest match.

Update: If you’d like to use Perl regular expressions but you’d rather not write Perl code, you might like tcgrep. It uses Perl regular expressions but has an interface like grep.

Emacs features that use regular expressions

Posted on 27 January 2018 by John

The syntax of regular expressions in Emacs is a little disappointing, but the ways you can use regular expressions in Emacs is impressive.

I’ve written before about the syntax of Emacs regular expressions. It’s a pretty conservative subset of the features you may be used to from other environments as summarized in the diagram below.

But there are many, many was to use regular expressions in Emacs. I did a quick search and found that about 15% of the pages in the massive Emacs manual contain at least one reference to regular expressions. Exhaustively listing the uses of regular expressions would not be practical or very interesting. Instead, I’ll highlight a few uses that I find helpful.

Searching and replacing

One of the most frequently used features in Emacs is incremental search. You can search forward or backward for a string, searching as you type, with the commands C-s (isearch-forward) and C-r (isearch-backward). The regular expression counterparts of these commands are C-M-s (isearch-forward-regexp) and C-M-r (isearch-backward-regexp).

Note that the regular expression commands add the Alt (meta) key to their string counterparts. Also, note that Emacs consistently refers to regular expressions as regexp and never, as far as I know, as regex. (Emacs relies heavily on conventions like this to keep the code base manageable.)

A common task in any editor is to search and replace text. In Emacs you can replace all occurrences of a regular expression with replace-regexp or interactively choose which instances to replace with query-replace-regexp.

Purging lines

You can delete all lines in a file that contain a given regular expression with flush-lines. You can also invert this command, specifying which lines not to delete with keep-lines.

Aligning code

One lesser-known but handy feature is align-regexp. This command will insert white space as needed so that all instances of a regular expression in a region align vertically. For example, if you have a sequence of assignment statements in a programming language you could have all the equal signs line up by using align-regexp with the regular expression consisting simply of an equal sign. Of course you could also align based on a much more complex pattern.

Although I imagine this feature is primarily used when editing source code, I imagine you could use it in other context such as aligning poetry or ASCII art diagrams.

Directory editing

The Emacs directory editor dired is something like the Windows File Explorer or the OSX Finder, but text-based. dired has many features that use regular expressions. Here are a few of the more common ones.

You can mark files based on the file names with % m (dired-mark-files-regexp) or based on the contents of the files with % g (dired-mark-files-containing-regexp). You can also mark files for deletion with % d (dired-flag-files-regexp).

Inside dired you can search across a specified set of files by typing A (dired-do-find-regexp), and you can interactively search and replace across a set of files by typing Q (dired-do-find-regexp-and-replace).

Miscellaneous

The help apropos command (C-h a) can take a string or a regular expression.

The command to search for available fonts (list-faces-display) can take a string or regular expression.

Interactive highlighting commands (highlight-regexp, unhighlight-regexp, highlight-lines-matching-regexp) take a regular expression argument.

You can use a regular expression to specify which buffers to close with kill-matching-buffers.

Maybe the largest class of uses for regular expressions in Emacs is configuration. Many customizations in Emacs, such as giving Emacs hints to determine the right editing mode for a file or how to recognize comments in different languages, use regular expressions as arguments.

Resources

You can find more posts on regular expressions and on Emacs by going to my technical notes page. Note that the outline at the top has links for regular expressions
and for Emacs.

For daily tips on regular expressions or Unix-native tools like Emacs, follow @RegexTip and @UnixToolTip on Twitter.

Searching files on Windows

Posted on 27 February 2016 by John

Searching files on Windows is a pain. The built-in search features don’t find everything. There may be ways to make them work, but I haven’t persisted long enough to make them work.

On Linux, the combination of find, xargs, and grep works well, and sometimes it works on Windows using the GOW or GnuWin port of these tools. Again there may be a way to make the ported utilities work more as expected, though I haven’t found it. I suspect the problem isn’t with the tools per se but their interaction with the command line. I also tried Emacs features like rgrep, but these features use the ported find and grep utilities, and so you run into the same problems with Emacs as you do running them directly and more.

It looks like ack is the way to go. I heard about it a long time ago and kept meaning to try it out. Now I finally did. It’s fast, convenient, etc. But here are the two things I most like about it:

Ack works the same across platforms.
Ack uses Perl regular expression syntax.

While the alternatives above are supposed to work the same across platforms, they don’t in my experience. But ack does because it’s a pure Perl program. All the portability has been delegated to Perl, where it is well handled. I imagine once I become more familiar with ack I’ll prefer it on Linux as well.

Because it’s a Perl program, ack uses Perl regex syntax. Perl has the most powerful regex implementation out there, though I seldom need any features unique to Perl. More important for me is that Perl regular expression dialect is the one I remember most easily.

Regular expression to match any chemical element

Posted on 4 February 2016 by John

Here’s a frivolous exercise in regular expressions: Write a regex to match any chemical element symbol.

Here’s one solution.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|U(u[opst])?|V|W|Xe|Yb?|Z[nr]

Update: When this post was written, elements 113, 115, 117, and 118 had placeholder names. For example, 115 was “ununpentium.” Now these elements are nihonium, moscovium, tennessine, and oganesson. The updated regular expression is the following.

A[cglmrstu]|B[aehikr]?|C[adeflmnorsu]?|D[bsy]|E[rsu]|F[elmr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airuv]|M[cdgnot]|N[abdehiop]?|O[gs]?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilms]|U|V|W|Xe|Yb?|Z[nr]

Making it more readable

Here’s the original expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[dgnot]        | 
N[abdeiop]?     | 
Os?             | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilm]     | 
U(u[opst])?     | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

The /x option in Perl says to ignore white space. Other regular expression implementations have something similar. Python has two such options, X for similarity with Perl, and VERBOSE for readability. Both have the same behavior.

Here’s the updated expression in more readable form:

/
A[cglmrstu]     | 
B[aehikr]?      | 
C[adeflmnorsu]? | 
D[bsy]          | 
E[rsu]          | 
F[elmr]?        | 
G[ade]          | 
H[efgos]?       | 
I[nr]?          | 
Kr?             | 
L[airuv]        | 
M[cdgnot]       | 
N[abdehiop]?    | 
O[gs]?          | 
P[abdmortu]?    | 
R[abefghnu]     | 
S[bcegimnr]?    | 
T[abcehilms]    | 
U               | 
V               | 
W               | 
Xe              | 
Yb?             | 
Z[nr]
/x

Regex syntax

The regular expression says that a chemical element symbol may start with A, followed by c, g, l, m, r, s, t, or u; or a B, optionally followed by a, e, h, i, k, or r; or …

The most complicated part of the regex was the part for symbols starting with U. There’s uranium whose symbols is simply U, and there are the elements that had temporary names based on their atomic numbers: ununtrium, ununpentium, ununseptium, and ununoctium. These are just Latin for one-one-three, one-one-five, one-one-seven, and one-one-eight. The symbols were U, Uut, Uup, Uus, and Uuo. The regex U(u[opst])? can be read “U, optionally followed by u and one of o, p, s, or t.” Now that the temporary names are gone, uranium is the only element whose abbreviation starts with U.

Note that the regex will match any string that contains a chemical element symbol, but it could match more. For example, it would match “I’ve never been to Boston in the fall” because that string contains B, the symbol for boron. Exercise: Modify the regex to only match chemical element symbols.

Regex golf

There may be clever ways to use fewer characters at the cost of being more obfuscated. But this is for fun anyway, so we’ll indulge in a little regex golf.

There are five elements whose symbols start with I or Z: I, In, Ir, Zn, and Zr. You could write [IZ][nr] to match four of these. The regex I|[IZ][nr] would represent all five with 10 characters, while I[nr]?|Z[nr] uses 12. Two characters saved! Can you cut out any more?

Regex resources

Notes on regular expressions in Python, PowerShell, R, Mathematica, and C++

Graphemes

Posted on 1 March 2015 by John

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

   
# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
     
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

Perl regex twitter account

Posted on 10 February 2015 by John

I’ve started a new Twitter account @PerlRegex for Perl regular expressions. My original account, @RegexTip, is for regular expressions in general and doesn’t go into much detail regarding any particular implementation. @PerlRegex goes into the specifics of regular expressions in Perl.

Why specifically Perl regular expressions? Because Perl has the most powerful support for regular expressions (strictly speaking, “pattern matching.”) Other languages offer “Perl compatible” regular expressions, though the degree of compatibility varies and is always less than complete.

I imagine more people have ruled England than have mastered the whole of the Perl language. But it’s possible to use Perl for regular expression processing without learning too much of the wider language.

Update: I’ve stopped posting to this account. Here’s a list of my current accounts.

Regular expression resources

Posted on 14 January 2015 by John

Continuing the series of resource posts each Wednesday, this week we have notes on regular expressions:

Last week: Probability resources

Next week: Numerical computing resources

Look-behind regex

Posted on 1 May 2014 by John

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following pattern.”

The syntax in Perl and similar regular expression implementations is (?<= … ) for positive look-behind and (?<! … ) for negative look-behind. For the longest time I couldn’t remember whether the next symbol after ? was the direction (i.e. < for behind) or the polarity (= for positive, ! for negative). I was more likely to guess wrong unless I’d used the syntax recently.

The reason I was tempted to get these wrong is that I thought “positive look-behind” and “negative look-behind.” That’s how these patterns are described. But this means the words and symbols come in a different order. If you think look-behind positive and look-behind negative then the words and the symbols come in the same order:

look	`(?`
behind	`<`
positive	`=`
negative	`!`

Maybe this syntax comes more naturally to people who speak French and other languages where adjectives follow the thing they describe. English word order was tripping me up.

By the way, the syntax for look-ahead patterns is simpler: just leave out the <. The default direction for look-around patterns is forward. You don’t have to remember whether the symbol for direction or parity comes first because there is no symbol for direction.

Can regular expressions parse HTML or not?

Posted on 21 February 2013 by John

Can regular expressions parse HTML? There are several answers to that question, both theoretical and practical.

First, let’s look at theoretical answers.

When programmers first learn about regular expressions, they often try to use them on HTML. Then someone wise will tell them “You can’t do that. There’s a computer science theorem that says regular expressions are not powerful enough.” And that’s true, if you stick to the original meaning of “regular expression.”

But if you interpret “regular expression” the way it is commonly used today, then regular expressions can indeed parse HTML. This post [Update: link went away] by Nikita Popov explains that what programmers commonly call regular expressions, such as PCRE (Perl compatible regular expressions), can match context-free languages.

Well-formed HTML is context-free. So you can match it using regular expressions, contrary to popular opinion.

So according to computer science theory, can regular expressions parse HTML? Not by the original meaning of regular expression, but yes, PCRE can.

Now on to the practical answers. The next lines in Nikita Popov’s post say

But don’t forget two things: Firstly, most HTML you see in the wild is not well-formed (usually not even close to it). And secondly, just because you can, doesn’t mean that you should.

HTML in the wild can be rather wild. On the other hand, it can also be simpler than the HTML grammar allows. In practice, you may be able to parse a particular bit of HTML with regular expressions, even old fashioned regular expressions. It depends entirely on context, your particular piece of (possibly malformed) HTML and what you’re trying to do with it. I’m not advocating regular expressions for HTML parsing, just saying that the question of whether they work is complicated.

This opens up an interesting line of inquiry. Instead of asking whether strict regular expressions can parse strict HTML, you could ask what is the probability that a regular expression will succeed at a particular task for an HTML file in the wild. If you define “HTML” as actual web pages rather than files conforming to a particular grammar, every technique will fail with some probability. The question is whether that probability is acceptable in context, whether using regular expressions or any other technique.

Related post: Coming full circle

Regular expressions