This morning I wrote about the frequencies of names for popes and kings. This involved sorting strings with Roman numerals since it’s common for popes and kings to have Roman numerals after their names.
Something that surprised me was that sorting Roman numerals alphabetically roughly sorts them in numerical order, especially for small numbers. It’s not perfect. For example, IX comes before V in alphabetical order.
Everyone who has done much work with data will have run into the problem of a column of numbers being sorted alphabetically rather than numerically. For example, “10” comes between “1” and “2” even though 10 comes after 1 and 2.
So you can’t sort numerals, Roman or Arabic, as strings and expect them to appear in numerical order. But Roman numbers come close when you’re sorting small numbers, such as I through XXIII for popes named John or I through VIII for kings of England named Henry.
To illustrate this, I plotted how well string sort order correlates with numeric order for Roman and Arabic numbers, for the sequence 1 … n for increasing values of n. I measured correlation using Spearman’s rank-order correlation. I tried Kendall’s tau and as well and got similar results.
Alphabetical order and numerical order for Roman numerals agree pretty well up to XXXVIII, with just a few numbers out of place, namely IX, XIX, and XXIX. But alphabetical order and numerical order diverge quite a bit for Arabic numerals when all the numbers between 10 and 19 come before 2.
As you go further out, alphabetical order and numerical order diverge for both writing systems, but especially for Roman numerals.
Got a lot of time on your hands?
Some. I just finished doing a deposition and writing a client report, so I have some time to do something a little frivolous.
Of course, the correlation for Roman numerals drops off because the symbols cease to be alphabetically ordered after X.
If we used, say, A = 1, B = 5, C = 10… then the correlation would be much higher. We could even make them exactly alphabetical if we had a system in which each symbol could be subtracted only from the next one.
This would not, however, solve the other problems with Roman numerals.
The same holds if we use the Unicode characters for Roman numerals. The characters U+2160 through U_216F represent I through XII, C, L, D, and M. Then we get really good results sorting alphabetically.
(If sorting alphabetically means sorting by Unicode code point value. It doesn’t always. Life’s complicated. :)