In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows whether you’ve dialed an in-office extension or the first three digits of an outside call.
This is an example of a prefix code: No valid phone number is a prefix of another valid phone number. We’ll look at a few more examples of prefix codes in the context of phone numbers, then look at Roman numerals, Morse code, Unicode encoding, data compression, and RPN calculators.
It used to be true in the US that you could dial four or five digits for a local call, seven digits for a call within the area code, and ten digits for a long distance phone call. This didn’t cause any ambiguity because no local number would begin with the digits of your area code. You had to dial a 1 before dialing a long distance number, and no local or area code numbers begin with 1.
There are still parts of the US where you can dial either a seven digit number or a ten digit number. In most of the US you always enter 10 digits. This is a trivial form of prefix coding because all fixed-length codes are prefix codes.
A final example of prefix codes related to telephony are country calling codes. These codes have varying lengths, but phone exchanges can know when the country code stops and when the number within a country starts because no prefix of a country code is a valid country code.
Prefix codes are sometimes called self-punctuating codes. This is because you don’t need an additional symbol, a form of punctuation, to mark the end of codes.
We usually think of Morse code as a system of two symbols: dots and dashes. But it’s really a system of four symbols: dots, dashes, short space between letters, and longer space between words. As a system of only dots and dashes, Morse code is not self-punctuating. Without spaces, you couldn’t tell, for example, whether dot dash represented an A (dot dash) , or an E (dot) followed by a T (dash). It’s possible to design a prefix code with just two symbols, but that’s not what Morse did.
Q codes are not part of Morse code per se but are often used in Morse code. These are three letter codes beginning with Q. For example, QRP is an abbreviation for the question “Should I reduce power?”. Q codes are prefix codes because they have fixed length. If QR were a valid Q code by itself, that would ruin the prefix property; a recipient would not know whether to interpret QR as a complete code until listening for the next letter.
The letter components of Roman numerals are not a prefix code because you can’t tell whether a letter stands for a positive or negative amount until you read the next letter. For example, in CLX the X represents 10 but in CXL the X represents -10. If you wrote Roman numerals backward, the letters would form a prefix code.
In the previous post, I discussed UTF-16 encoding of Unicode. The way UTF-16 encodes characters outside the Basic Multilingual Plane makes it a prefix code; the meaning of a surrogate doesn’t depend on any down-stream information. UTF-8 is also a prefix code, which I discuss in detail here.
You may have seen Huffman coding, a form of data compression that uses a prefix code.
Reverse Polish Notation is an example of prefix coding. An RPN calculator doesn’t need parentheses for punctuation. You can enter calculations unambiguously with just digits and arithmetic operators because the meaning of a computation does not depend on any future input. Prefix codes are sometimes called instantaneous codes because of this feature.