A VIN (vehicle identification number) is a string of 17 characters that uniquely identifies a car or motorcycle. These numbers are used around the world and have three standardized formats: one for North America, one for the EU, and one for the rest of the world.

## Letters that resemble digits

The characters used in a VIN are digits and capital letters. The letters I, O, and Q are not used to avoid confusion with the numerals 0, 1, and 9. So if you’re not sure whether a character is a digit or a letter, it’s probably a digit.

It would have been better to exclude S than Q. A lower case q looks sorta like a 9, but VINs use capital letters, and an S looks like a 5.

## Check sum

The various parts of a VIN have particular meanings, as documented in the Wikipedia article on VINs. I want to focus on just the check sum, a character whose purpose is to help detect errors in the other characters.

Of the three standards for VINs, only the North American one requires a check sum. The check sum is in the middle of the VIN, the 9th character.

## Algorithm

The scheme for computing the check sum is both complicated and weak. The end result is either a digit or an X. There are 33 possibilities for each character (10 digits + 23 letters) and 11 possibilities for a check sum, so the check sum cannot possibly detect all changes to even a single character.

The check sum is computed by first converting all letters to digits, computing a weighted sum of the 17 digits, and taking the remainder by 11. The weights for the 17 characters are

8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8 ,7 ,6, 5, 4, 3, 2

I don’t see any reason for these weights other than that adjacent weights are different, which is enough to detect transposition of consecutive digits (and characters might not be digits). Maybe the process was deliberately complicated in an attempt to provide a little security by obscurity.

## Historical quirk

There’s an interesting historical quirk in how the letters are converted to digits: each letter is mapped to a digit with in a somewhat complicated way. More on that here. It has to so with gaps in the **EBCDIC** values of letters. Letters are not mapped to their EBCDIC values per se, but there are jumps that are explained by corresponding jumps in EBCDIC.

EBCDIC?! Why not ASCII? Both EBCDIC and ASCII go back to 1963. VINs date back to 1954 in the US but were standardized in 1981. Presumably the check sum algorithm using EBCDIC digits became a de facto standard before ASCII was ubiquitous.

## A better check sum

Any error detection scheme that uses 11 characters to detect changes in 33 characters is necessarily weak.

A much better approach would be to use a slight variation on the check sum algorithm Douglas Crockford recommended for base 32 encoding described here. Crockford says to take a string of characters from an alphabet of 32 characters, interpret it as a base 32 number, and take the remainder by 37 as the check sum. The same algorithm would work for an alphabet of 33 characters. All that matters is that the number of possible characters is less than 37.

Since the check sum is a number between 0 and 36 inclusive, you need 37 characters to represent it. Crockford recommended using the symbols *, ~, $, =, and U for extra symbols in his base 32 system. His system didn’t use U, and VIN numbers do. But we only need four more characters, so we could use *, ~, $, and =.

The drawback to this system is that it requires four new symbols. The advantage is that any change to a single character would be detected, as would any transposition of adjacent characters. This is proved here.

EBCDIC may go back to 1963, but BCDIC (which EBCDIC extends) goes back all the way to the 12-row punched card in 1928, and the numeric values of the letters A-Z were exactly the same then in 1928 as what they ended up in EBCDIC. So when Wikipedia says that the VIN checksum is based on the EBCDIC values is it might be committing a bit of an anachronism.

Andrew beat me to this. I’ll add that the E in EBCDIC is for extended. Given the date of adoption, they probably wanted to be able to compute and verify the checksum using a tabulator. It is conceivable that the choice of coefficients was chosen to minimize wire clutter on the patch board used to program the algorithm.

That kind of artifact is common. CDs sample at 44KHz because that lets the bits fit onto a VHS master tape nicely, frame by frame. 600+MB disk drives were expensive and rare back then, but VHS tapes were cheap and common, so they’d produce a VHS tape that could drive the CD burners.