Index of coincidence is a statistic developed by William Friedman for use in cryptanalysis. It measures how unevenly symbols are distributed in a message.
It’s a kind of signature that could be used, for example, to infer the language of a text, even if the text has been encrypted with a simple substitution cipher. It also has more sophisticated uses, such as inferring key length in text that has been encrypted with a Vigenère cipher.
If a discrete random variable has n possible values, each occurring with probability pi, then its index of coincidence is
In cryptanalysis, the probabilities are the letter frequencies in the language of the message. The sum above is often multiplied by some constant, but this makes no difference in application because you would compare index of coincidence values, not interpret the index in the abstract.
Often the sum above is multiplied by n. This amounts to dividing the sum above by the corresponding sum for a uniform distribution.
Index of coincidence is occasionally used in applications, though not in cryptography anymore.
Example
The letter frequencies in Google’s English language corpus are given here. In this corpus the probabilities range from 0.1249 for e down to 0.0009 for z. The index of coincidence would be
0.1249² + 0.0928² + … + 0.0009² = 0.06612
You’re more likely to see this value multiplied by 26, which gives 1.719.
Relation to entropy
Index of coincidence seems a lot like entropy, and in fact it is a simple transformation of an entropy, though not Shannon entropy. Index of coincidence is closely related to Rényi entropy.
Rényi entropy of order α is defined for a random variable X to be
and so Rényi entropy of order 2 is the negative log of the index of coincidence. That is, if H is the Rényi entropy of order 2 and I is the index of coincidence, then
Index of coicidence and Rényi entropy are sometimes defined with multiplicative constants that would slightly change the equations above.
Since exp(-x) is a decreasing function, increasing index of coincidence corresponds to decreasing Rényi entropy. Sorting in increasing order by index of coincidence is the same as sorting in decreasing order by Rényi entropy.
Good post but it’s Friedman not Freedman.
Index of coincidence looks like a rough “fingerprint” for a distribution.
Might there be second order measures here — how often do certain pairs appear? That second index, index of pair coincidence, might be an another fingerprint for the source language.
The lexicographic version of Benford’s law might provide yet an additional, 3rd fingerprint.
I constructed something like this years ago for a specific problem,
it’s nice to give a name to it.
Just to give an interpretation, if we write the definition of I as
I =
Then I can be thought of the average probability of being one of it’s n
values. It is a measure of the concentration of the distribution. The
reciprocal of I is (very loosely), a number of bins the distribution is spread over. All very hand-wavy, of course.
Oops, formatting got goofed up in my post in the rewriting “I”,
probably because it got confused with html tags. How about:
I = < p >
If that fails, maybe:
I = E[p]
That formula is also used in economics to measure market concentration where it is known as the Herfindahl–Hirschman index.