Jaccard index is a way of measuring the similarity of sets. The Jaccard index, or Jaccard similarity coefficient, of two sets A and B is the number of elements in their intersection, A ∩ B, divided by the number of elements in their union, A ∪ B.
Jaccard similarity is a robust way to compare things in machine learning, say in clustering algorithms, less sensitive to outliers than other similarity measures such as cosine similarity.
Miles Davis Albums
Here we’ll illustrate Jaccard similarity by looking at the personnel on albums by Miles Davis. Specifically, which pair of albums had more similar personnel: Kind of Blue and Round About Midnight, or Bitches Brew and In a Silent Way?
There were four musicians who played on both Kind of Blue and Round About Midnight: Miles Davis, Cannonball Adderly, John Coltrane, and Paul Chambers.
There were six musicians who played on both Bitches Brew and In a Silent Way: Miles Davis, Wayne Shorter, Chick Corea, Dave Holland, and John McLaughlin, Joe Zawinul.
The latter pair of albums had more personnel in common, but they also had more personnel in total.
There were 9 musicians who performed on either Kind of Blue or Round About Midnight. Since 4 played on both albums, the Jaccard index comparing the personnel on the two albums is 4/9.
In a Silent Way and especially Bitches Brew used more musicians. A total of 17 musicians performed on one of these albums, including 6 who were on both. So the Jaccard index is 6/17.
Jaccard distance
Jaccard distance is the complement of Jaccard similarity, i.e.
In our example, the Jaccard distance between Kind of Blue and Round About Midnight is 1 − 4/9 = 0.555. The Jaccard distance between Bitches Brew and In a Silent Way is 1 − 6/17 = 0.647.
Jaccard distance really is a distance. It is clearly a symmetric function of its arguments, unlike Kulback-Liebler divergence, which is not.
The difficulty in establishing that Jaccard distance is a distance function, i.e. a metric, is the triangle inequality. The triangle inequality does hold, though this is not simple to prove.
The Jacquard distance is a great and accessible tool for statistics students studying multivariate analysis. My students have used it successfully to cluster themselves and their friends based on shopping habits, dining habits, or favorite music. Just survey the subjects, calculate the distances, and do a bit of multidimensional scaling, and Bang! there’s your clusters.