Varicode is a way of encoding text and control characters into binary using code words of variable length. It was developed as part of the PSK31 protocol for digital communication over amateur radio.
In the spirit of Morse code, it uses short code words for common characters and longer code words for less common characters in the expectation that this will result in shorter encodings.
If you use variable length words, you’ve got to have some way of knowing when one word ends and the next begins. Varicode solves this problem by using only keywords that begin and end with 1 and that do not contain two consecutive zeros. Then 00 is inserted between code words. Since 00 cannot appear inside a code word, these bits unambiguously mark the space between code words.
Varicode is self-synchronizing in the sense that if you jump into a stream of bits produced by Varicode, as soon as you see two zeros, you know you’re at a code word boundary and can start reading from there. You’ve lost any bits that were transmitted before you jumped in, but you can parse everything going forward.
ASCII doesn’t have this problem or this robustness. It doesn’t have the problem of determining code word boundaries because every code word is eight bits long. But then to read a stream of ASCII bits you need to know your position mod 8. If you jump into a stream of bits, you don’t know where the next code word boundary will be, though you may be able to infer it by trying 8 possibilities and seeing which produces the most intelligible results.
Another way to state the rules for forming Varicode code words is to say that 1 is a valid code (the code for a space, ASCII 0x20) and that you can form new codes by prefixing 1 or 10 to a code. In terms of regular expressions, this says a Varicode code word matches
How many code words are there of length n? Well, there are two ways to make a code word that long: you either put a 1 in front of a code word of length n-1, or you put a 10 in front of a code word of length n-2. So the number of code words of length n equals the number of code words of length n-1 plus the number of code words of length n-2. That is, the number of code words satisfies the same recurrence relation as the Fibonacci numbers.
It’s easy to see that there’s only one code word of length one, and only one code word of length two, so the number of code words of each length satisfies the initial conditions for the Fibonacci sequence as well, and so they are the Fibonacci numbers.
Varicode encodes a lot more than lower case letters—it encodes most ASCII characters— and so it would take some work to discover the relative frequencies of the characters, and this frequency would depend on where Varicode is used. As far as I know Varicode is use primarily (only?) in PSK31, and so the relevant frequency would be the frequencies in messages sent over PSK31, not English more generally.
You can find the code words of each letter here.
To make things easier, let’s suppose messages are limited to lower case letters and spaces, and that the letters follow the same distribution as in English in general.
We can use the letter frequencies here, except these don’t take spaces into account. If we assume words are about 5 letters long, then the probability of a character being a space is 1/6 and the probabilities of the other characters conditional on not being a space are given by the table. This means the letter probabilities need to be multiplied by 5/6.
This gives us a an expected length of 3.89 bits per letter, which is effectively 5.89 bits when you consider the 00 pattern we have to add between letters.
You could represent the 26 letters and a few more characters using 5 bits, but the result would not be self-synchronizing. One way of looking at this to say that the compression provided by variable length encoding nearly pays for the overhead required to make the code self-synchronizing.
2 thoughts on “Varicode”
Thanks, nice post!
Small nit: ASCII is 7 bits. (And an extra 8th bit per character presumably can be used to make such a fixed code self-synchronizing, although I don’t see any _efficient_ way to do it.)
Technically my references to ASCII should be *extended* ASCII that makes use of the higher 128 bits as well as the lower 128 of strict ASCII.