Suppose a test asks you to place 10 events in chronological order. Label these events A through J so that chronological order is also alphabetical order.

If a student answers BACDEFGHIJ, then did they make two mistakes or just one? Two events are in the wrong position, but they made one transposition error. The simplest way to grade such a test would be to count the number of events that are in the correct position. Is this the most fair way to grade?

If you decide to count how many transpositions are needed to correct a student’s answer, do you count any transposition or only adjacent transpositions? For example, if someone answered JBCDEFGHIA, then transposing the A and the J is enough to put the results in order. But reversing the first and last event seems like a bigger mistake than reversing the first two events. Counting only adjacent transpositions would penalize this mistake more. You would have to swap the J with each of the eight letters between J and A. But it hardly seems that answering JBCDEFGHIA is eight times worse than answering BACDEFGHIJ.

Maybe counting transpositions is too much work. So we just go back to counting how many events are in the right place. But then suppose someone answers JABCDEFGHI. This is completely wrong since every event is in the wrong position. But the student obviously knows something, since the relative order of nearly all of the events is correct. From one perspective there was only one mistake: J comes last, not first.

What is the worst possible answer? Maybe getting the order exactly backward? If you have an odd number of events, then getting the order backward means one event is in the right place, and so that doesn’t receive the lowest possible score.

This is an interesting problem beyond grading exams. (As for grading exams, I’d suggest simply not using questions of this type on an exam.) In manufacturing, how serious a mistake is it to reverse two consecutive components versus two distant components? You could also ask the same question when comparing DNA sequences or other digital signals. The best way to assign a distance between the actual and desired sequence would depend entirely on context.

It also depends on the nature of the thing being measured. Can you define the correct index number and the guessed index number as ordinals? If so, you can quantify the deviation per item.

I would have to double check later, but I recall double cut and join (DCJ) for genome rearrangement should take into account the ordering subtleties you’re suggesting. I am sure there are some poor edge cases but I don’t recall off the top of my head.

Spearman’s rank correlation coefficient is meant for exactly this, although its appropriateness in any particular situation is debatable. https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

In high school, I once got a 40% on an exam because I (consistently) misapplied the ideal gas law. Instead of using the fact that 1 mole of an ideal gas will occupy a volume of 22.4 liters at STP, I used 1 liter as the volume. Because that constant (22.4 liters) was used in nearly every problem, all my answers were numerically way off.

I got everything else right in the exam — numerically and conceptually. I argued with the teacher that I had only made ONE mistake, and that 60% of the exam score was too harsh a penalty.

She disagreed. The 40% score stood. 🙁

Instead of asking to rank the order of events, why not ask to assign a year to each event? You can still extract an explicit ordering, but you also get an objective loss function.

You are describing a well known concept related to adaptive sorting algorithm: “measures of presortedness”.

There are many of them, each one targeting a specific aspect of what we consider “presortedness”.

My favorite is REM(s), which is the minimum number of elements that you must remove from “s” to obtain a sorted subsequence (= length(s) – LIS(s)).

Another useful measure of presortedness is RUNS(s)-1, which is the number of increasing runs in the sequence (minus one, to obtain 0 if the sequence is sorted).

Seems like one could write down a short list of desirable properties and prove that no measure satisfies them all. Reminds me of the Alabama Paradox in that way.

My instinct is to score them based on pairwise comparisons correct. So BACDEFGHIJ has 1 comparison (A < B) wrong out of 55, while JABCDEFGHI has 9 wrong. JBCDEFGHIA would score 67% for 37 out of 55.

True, that immediately says that JBCDEFGHIA is 18 times worse than BACDEFGHIJ, but that's definitely true in _some_ applications. For example, if I asked someone to put the following events in chronological order:

– domestication of wheat

– first man on the moon

– development of iron tools

– the internet

– the crusades

– the fall of Rome

– the Great Pyramid

I'd be pretty shocked to see the domestication of wheat last, and the internet first. The nature of this problem is that that error shows the kind of colossal misunderstanding I'd be perfectly comfortable failing a student over.

As Bill points out, you can remove this issue entirely by asking for the date on which each event happened, trading your ordinal data for cardinal data.

correction: JBCDEFGHIA has only 36/55 correct.

re-correction, for real this time:

ABCDEFGHIJ has 45, not 55, pairwise comparisons (in contrast, 55 is the sum of the integers 1 to 10).

JABCDEFGHI really does have 9 comparisons wrong out of 45 (but not 9 wrong out of 55).

JBCDEFGHIA has 28 correct, or 17 wrong, out of 45.

I think implicit in the “How many mistakes are there?” question is our intuition of a utility function that measures how costly the mistakes are. If that utility function were made explicit, that would be the measure of “wrongness” in any given sequence.

The clear answer for exams is, as you suggested, not to use questions of

this kind as they are potentially a grading nightmare.

If you wanted to go crazy, you could compute the Levenshtein edit distance or a variant thereof (say allowing transposed elements) on the student answer string and the correct answer. Of course this is not very practical for grading exams. Also, In general, complicating things is not a good idea, so just sticking with the Hamming distance (or Hamming distance divided by 2) is probably a good choice.

The edit distance discussed recently on Lorens matlab blog sounds nice as a scoring rule here : http://blogs.mathworks.com/loren/2015/10/14/40-year-old-algorithm-that-cannot-be-improved/

The worst possible answer if the dissimilarity measure is spearman’s footrule is defined here: https://www.ibm.com/developerworks/community/blogs/jfp/entry/un_peu_de_math?lang=en