If you have two sequences, and one is a permutation of the other, how do you measure how far apart the two sequences are?
This post was motivated by the previous post on anagram statistics. For a certain dictionary, the code used in that post found that the longest anagram pairs were the following:
certifications, rectifications
impressiveness, permissiveness
teaspoonsful, teaspoonfuls
Anagrams are more amusing when the two words are more dissimilar, at least in my opinion.
There are several ways to measure (dis)similarity. The words in the first two pairs above are dissimilar in meaning, but the words in the last pair are synonyms [1]. The pairs of words above are not that dissimilar as character strings.
It would be hard to quantify how dissimilar two words are in meaning, but easier to quantify how dissimilar they are as sequences of characters. We’ll look at two approaches: Hamming distance and transposition distance.
Hamming distance
The Hamming distance between two words is the number of positions in which they differ. We can see the Hamming distances between the words above by viewing them in a fixed-width font.
certifications rectifications impressiveness permissiveness teaspoonsful teaspoonfuls
The Hamming distances above are 2, 5, and 4.
The anagrams with maximum Hamming distance in my dictionary are “imperfections” and “perfectionism.” (There’s something poetic about these words being anagrams.) The Hamming distance is 13 because these 13-letter words differ in every position.
imperfections perfectionism
Transposition distance
Another way to measure the distance between two permutations is how many elements would have to be swapped to turn one list into the other. This is called the transposition distance. For example, the transposition distance between “certifications” and “rectifications” is 1, because you only need to swap the first and third characters to turn one into the other.
It’s not so obvious what the transposition distance is between “impressiveness” and “permissiveness”. We can find an upper bound on the distance by experimentation. The two words only differ in the first five letters, so the distance between “impressiveness” and “permissiveness” is no larger than the distance between “impre” and “permi”. The latter pair of words are no more than 4 transpositions apart as the following sequence shows:
impre pmire peirm perim permi
But is there a shorter sequence of transpositions? Would it help if we used the rest of the letters “ssiveness” as working space?
Computing transposition distance is hard, as in NP-hard. This is unfortunate since transposition has more practical applications than just word games. It is used, for example, in genetic research. It may be practical to compute for short sequences, but in general one must rely on bounds and approximations.
Note that transposition distance allows swapping any two elements. If we only allowed swapping consecutive elements, the problem is much easier, but the results are not the same. When restricted to consecutive swaps, the distance between “certifications” and “rectifications” is 3 rather than 1. We can swap the “c” and the “r” to turn “cer” into “rec” so we have to do something like
cer ecr erc rec
I don’t know a name for the distance where you allow rotations. The words “teaspoonsful” and “teaspoonfuls” differ by one rotation, turning “sful” into “fuls.” While this can be done in one rotation, it would take several swaps.
Related posts
[1] Although “teaspoonfuls” is far more common, I remember a schoolmarm teaching us that this is incorrect.
And I still recoil at “brother in laws” and say “brothers in law” instead; the majority is on my side on this one.
Mark Dominus wrote (https://blog.plover.com/lang/anagram-scoring.html) about his search for the best anagram a couple of years ago, using a distance that he thought was a better match for our intuitive idea of what makes a good anagram. Instead of looking at individual letters, it breaks a word into chunks that can be rearranged to form the other word. The score is the minimum number of chunks.
So teaspoonsful teaspoonfuls is a disappointing anagram, because despite the length of the words, we need only three pieces: teaspoon, s, ful.
A follow-up post https://blog.plover.com/lang/anagram-scoring-2.html discusses the clever algorithm for finding the best division.
Thanks. Nice article by MJD! And nice result: cinematographer and megachiropteran, a movie maker and a giant bat! Semantically and textually pretty far apart.
I had a need to score how different two permutations of a set (with no duplicates) were. My approach was based on the sum of squared distances between the indices at which each element appeared:
”’
I think that a possible refinement of Mark’s idea could be something like sum(sqrt(chunk size)). This would give a higher score to anagrams with many chunks of length 2 (such as notaries/senorita, which Mark mentioned should score higher), but less to anagrams with very long chunks like teaspoonfuls/teaspoonsful. Or perhaps sqrt could be replaced with a function that drops off even more quickly.