A few days ago I wrote about Jaccard distance, a way of defining a distance between sets. The Ruzsa distance is similar, except it defines the distance between two subsets of an Abelian group.
Subset difference
Let A and B be two subsets of an Abelian (commutative) group G. Then the difference A − B is defined the set
As is customary with Abelian groups, we denote the group operation by + and a − b means the group operation applied to a and the inverse of b.
For example, let G be the group of integers mod 10. Let A = {1, 5} and B = {3, 7}. Then A − B is the set {2, 4, 8}. There are only three elements because 1 − 3 and 5 − 7 are both congruent to 8.
Ruzsa distance
The Ruzsa distance between two subsets of an Abelian group is defined by
where |S| denotes the number of elements in a set S.
The Ruzsa distance is not a metric, but it fails in an unusual way. the four axioms of a metric are
- d(x, x) = 0
- d(x, y) > 0 unless x = y
- d(x, y) = d(y, x)
- d(x, z) ≤ d(x, y) + d(y, z)
The first axiom is usually trivial, but it’s the only one that doesn’t hold for Ruzsa distance. In particular, the last axiom, the triangle inequality, does hold.
To see that the first axiom does not always hold, lets again let G be the integers mod 10 and let A = {1, 3}. Then A − A is the set {0, 2, 8} and d(A, A) = log 3/2 > 0.
Sometimes d(A, A) does equal zero. If A = G then A − A = A, and so d(A, A) = log 1 = 0.
Entropic Ruzsa distance
If we switch from subsets of G to random variables taking values in G we can define an analog of Ruzsa distance between random variables X and Y, the entropic Ruzsa distance
where X′ and Y′ are independent copies of X and Y and H is Shannon entropy. For more on entropic Ruzsa distance, see this post by Terence Tao.
Note that if A and B are subsets of G, and X and Y are uniform random variables with support on A and B respectively, then the negative terms above correspond to the log of 1/|A|½ |B|½. The H(X′ + Y′) term isn’t the log of |A − B| though because for one thing its a sum rather than a difference. For another, the sum of uniform random variables may not be uniform: there may be more than one way to end up at a particular sum, and so sum values will have higher probability.