In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.
In this post I’ll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.
I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.
Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.
from alignment import Needleman, Hirschberg def compare(str1, str2): seq1 = list(str1) seq2 = list(str2) for algorithm in [Needleman(), Hirschberg()]: a, b = algorithm.align(seq1, seq2) print("".join(a)) print("".join(b)) print()
The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.
|||riverrun, past Ev|e| and Adam'||||s, mov|i|er|un, past ||new and |||||hopes, from swe|rv||e of shore|||| to bend of from s||tr|ike of |||||back to bend of b|||ay, brings us by a commodius |jeday, brings us by a commodius vic|u||s of recirculation back to |||lucas of recirculation back to H|owth Ca||stle|||| and E|nvi||r|ons. |fo||||||rest||moon and |en||dor.||||
I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite similar:
Bygme||ster Fi|nnega||n, of the Bygm|onster ||Ann||akin, of the Stutte||r|||||||ing Hand, f|re|emen'|s ||||||Throatchokin| Hand, for|cemen|’s mau-rer, lived in the broadest way mau-rer, lived in the broadest way immarginable in his rushlit immarginable in his rushlit toofar-|||back for messuages before toofar| — back for messuages before
but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.
One thought on “Sequence alignment”
Both the Needleman-Wunsch algorithm and Hirschberg’s algorithm produce *an* optimal solution — the difference you’re seeing is a consequence of the fact that many optimal solutions can exist, and often do for natural-language texts (and programs, and DNA…).
Quick example: When aligning XX to X, we could either delete the first or the second X from the first string — both lead to optimal, single-deletion alignments.