ICD codes are diagnostic codes created by the WHO. (Three TLAs in just the opening paragraph!)
The latest version, ICD-11, went into effect in January of this year. A few countries are using ICD-11 now; it’s expected to be at least a couple years before the US moves from ICD-10 to ICD-11. (I still see ICD-9 data even though ICD-10 came out in 1994.)
One way that ICD-11 codes differ from ICD-10 codes is that the new codes do not use the letters I or O in order to prevent possible confusion with the digits 1 and 0. In the code below, “alphabetic” and “alphanumeric” implicitly exclude the letters I and O.
Another way the codes differ is the that the second character in an ICD-10 is a digit whereas the second character in an ICD-11 code is a letter.
What follows is a heavily-commented regular expression for matching ICD-11 codes, along with a few tests to show that the regex matches things it should and does not match things it should not.
Of course you could verify an ICD-11 code by searching against an exhaustive list of such codes, but the following is much simpler and may match some false positives. However, it is future-proof against false negatives: ICD-11 codes added in the future will conform to the pattern in the regular expression.
import re icd11_re = re.compile(r""" ^ # beginning of string [A-HJ-NP-Z0-9] # alphanumeric [A-HJ-NP-Z] # alphabetic [0-9] # digit [A-HJ-NP-Z0-9] # alphanumeric ((\. # optional starting with . [A-HJ-NP-Z0-9]) # alphanumeric [A-HJ-NP-Z0-9]?)? # optional further refinement $ # end of string """, re.VERBOSE) good = [ "ND52", # fracture of arm, level unspecified "9D00.3", # presbyopia "8B60.Y", # other specified increased intercranial pressure "DB98.7Z" # portal hypertension, unspecified ] bad = [ "ABCD", # third character must be digit "AB3D.", # dot must be followed by alphanumeric "9D0O.3", # letter 'O' should be number 0 "DB9872", # missing dot "AB3", # too short "DB90.123" # too long ] for g in good: assert(icd11_re.match(g)) for b in bad: assert(icd11_re.match(b) == None)