ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)
The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)
One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.
Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.
What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.
Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.
import re
icd11_re = re.compile(r"""
^ # beginning of string
[A-HJ-NP-Z0-9] # alphanumeric
[A-HJ-NP-Z] # alphabetic
[0-9] # digit
[A-HJ-NP-Z0-9] # alphanumeric
((\. # optional starting with .
[A-HJ-NP-Z0-9]) # alphanumeric
[A-HJ-NP-Z0-9]?)? # optional further refinement
$ # end of string
""", re.VERBOSE)
good = [
"ND52", # fracture of arm, level unspecified
"9D00.3", # presbyopia
"8B60.Y", # other specified increased intercranial pressure
"DB98.7Z" # portal hypertension, unspecified
]
bad = [
"ABCD", # third character must be digit
"AB3D.", # dot must be followed by alphanumeric
"9D0O.3", # letter 'O' should be number 0
"DB9872", # missing dot
"AB3", # too short
"DB90.123" # too long
]
for g in good:
assert(icd11_re.match(g))
for b in bad:
assert(icd11_re.match(b) == None)