Index of coincidence

Index of coincidence is a statistic developed by William Friedman for use in cryptanalysis. It measures how unevenly symbols are distributed in a message.

It’s a kind of signature that could be used, for example, to infer the language of a text, even if the text has been encrypted with a simple substitution cipher. It also has more sophisticated uses, such as inferring key length in text that has been encrypted with a Vigenère cipher.

If a discrete random variable has n possible values, each occurring with probability p_i, then its index of coincidence is

$I = \sum_{i=1}^n p_i^2$

In cryptanalysis, the probabilities are the letter frequencies in the language of the message. The sum above is often multiplied by some constant, but this makes no difference in application because you would compare index of coincidence values, not interpret the index in the abstract.

Often the sum above is multiplied by n. This amounts to dividing the sum above by the corresponding sum for a uniform distribution.

Index of coincidence is occasionally used in applications, though not in cryptography anymore.

Example

The letter frequencies in Google’s English language corpus are given here. In this corpus the probabilities range from 0.1249 for e down to 0.0009 for z. The index of coincidence would be

0.1249² + 0.0928² + … + 0.0009² = 0.06612

You’re more likely to see this value multiplied by 26, which gives 1.719.

Relation to entropy

Index of coincidence seems a lot like entropy, and in fact it is a simple transformation of an entropy, though not Shannon entropy. Index of coincidence is closely related to Rényi entropy.

Rényi entropy of order α is defined for a random variable X to be

$H_\alpha(X) = \frac{1}{1 - \alpha} \log_2 \left(\sum_{i=1}^n p_i^\alpha \right)$

and so Rényi entropy of order 2 is the negative log of the index of coincidence. That is, if H is the Rényi entropy of order 2 and I is the index of coincidence, then

$\begin{align*} H &= -\log I \\ I &= \exp(-H) \end{align*}$

Index of coincidence and Rényi entropy are sometimes defined with multiplicative constants that would slightly change the equations above.

Since exp(-x) is a decreasing function, increasing index of coincidence corresponds to decreasing Rényi entropy. Sorting in increasing order by index of coincidence is the same as sorting in decreasing order by Rényi entropy.

5 thoughts on “Index of coincidence”

John

14 August 2021 at 17:33

Good post but it’s Friedman not Freedman.
Ross

15 August 2021 at 13:24

Index of coincidence looks like a rough “fingerprint” for a distribution.

Might there be second order measures here — how often do certain pairs appear? That second index, index of pair coincidence, might be an another fingerprint for the source language.

The lexicographic version of Benford’s law might provide yet an additional, 3rd fingerprint.
GlennF

15 August 2021 at 21:48

I constructed something like this years ago for a specific problem,
it’s nice to give a name to it.
Just to give an interpretation, if we write the definition of I as
I =
Then I can be thought of the average probability of being one of it’s n
values. It is a measure of the concentration of the distribution. The
reciprocal of I is (very loosely), a number of bins the distribution is spread over. All very hand-wavy, of course.
GlennF

15 August 2021 at 21:55

Oops, formatting got goofed up in my post in the rewriting “I”,
probably because it got confused with html tags. How about:
I = < p >

If that fails, maybe:

I = E[p]
Carlos J. Gil Bellosta

24 May 2022 at 20:48

That formula is also used in economics to measure market concentration where it is known as the Herfindahl–Hirschman index.

Comments are closed.

Example

Relation to entropy

Related

5 thoughts on “Index of coincidence”