Index of coincidence is a statistic developed by William Friedman for use in cryptanalysis. It measures how unevenly symbols are distributed in a message.
It’s a kind of signature that could be used, for example, to infer the language of a text, even if the text has been encrypted with a simple substitution cipher. It also has more sophisticated uses, such as inferring key length in text that has been encrypted with a Vigenère cipher.
If a discrete random variable has n possible values, each occurring with probability pi, then its index of coincidence is
In cryptanalysis, the probabilities are the letter frequencies in the language of the message. The sum above is often multiplied by some constant, but this makes no difference in application because you would compare index of coincidence values, not interpret the index in the abstract.
Often the sum above is multiplied by n. This amounts to dividing the sum above by the corresponding sum for a uniform distribution.
Index of coincidence is occasionally use in applications, though not in cryptography anymore.
The letter frequencies in Google’s English language corpus are given here. In this corpus the probabilities range from 0.1249 for e down to 0.0009 for z. The index of coincidence would be
0.1249² + 0.0928² + … + 0.0009² = 0.06612
You’re more likely to see this value multiplied by 26, which gives 1.719.
Relation to entropy
Index of coincidence seems a lot like entropy, and in fact it is a simple transformation of an entropy, though not Shannon entropy. Index of coincidence is closely related to Rényi entropy.
Rényi entropy of order α is defined for a random variable X to be
and so Rényi entropy of order 2 is the negative log of the index of coincidence. That is, if H is the Rényi entropy of order 2 and I is the index of coincidence, then
Index of coicidence and Rényi entropy are sometimes defined with multiplicative constants that would slightly change the equations above.
Since exp(-x) is a decreasing function, increasing index of coincidence corresponds to decreasing Rényi entropy. Sorting in increasing order by index of coincidence is the same as sorting in decreasing order by Rényi entropy.