Reconstructing edit distance

In "Counting point mutations1" the calculation of Hamming distance gave us a clear way to model the sequence of point mutations transforming one genetic string into another. By simply writing one string directly over the other, we could count each mismatched symbol as a substitution.

However, in the calculation of edit distance (see "Edit distance2"), the two strings can have different lengths. Thus, simply superimposing one string over the other does us no good when it comes to visualizing a sequence of edit operations transforming one string into the other. To remedy this, we will introduce a new symbol to serve as a placeholder representing an inserted or deleted symbol. Furthermore, this placeholder will allow us to align two strings of differing lengths.

Assignment

An alignment of two strings $$s$$ and $$t$$ is defined by two strings $$s'$$ and $$t'$$ satisfying the following three conditions:

We say that $$s'$$ and $$t'$$ augment $$s$$ and $$t$$. Writing $$s'$$ directly over $$t'$$ so that symbols are aligned provides us with a scenario for transforming $$s$$ into $$t$$. Mismatched symbols from $$s$$ and $$t$$ correspond to symbol substitutions. A gap symbol $$s'[j]$$ aligned with a non-gap symbol $$t'[j]$$ implies the insertion of this symbol into $$t$$. A gap symbol $$t'[j]$$ aligned with a non-gap symbol $$s'[j]$$ implies the deletion of this symbol from $$s$$.

Thus, an alignment represents a transformation of $$s$$ into $$t$$ via edit operations. We define the corresponding edit alignment score of $$s'$$ and $$t'$$ as $$d_{\textrm{H}}(s', t')$$. Hamming distance is used because the gap symbol has been introduced for insertions and deletions. It follows that $$d_{\textrm{E}}(s, t) = min_{s', t'}d_{\textrm{H}}(s', t')$$, where the minimum is taken over all alignments of $$s$$ and $$t$$. We call such a minimum score alignment an optimal alignment (with respect to edit distance).

Your task:

Example

In the following interactive session, we assume the FASTA file data.faa3 to be located in the current directory.

>>> editDistance('PRETTY', 'PRTTEIN')
4
>>> from Bio import SeqIO
>>> editDistance(*SeqIO.parse('data.faa', 'fasta'))
314

>>> editDistanceAlignment('PRETTY', 'PRTTEIN')
('PRET-TY', 'PRTTEIN')
>>> from Bio import SeqIO
>>> editDistanceAlignment(*SeqIO.parse('data.faa', 'fasta'))
('RCASGLISQRSEH----ENHTMLYWK----Y--LLVMGIHPHVVHGKF-I---TEDWDVDGLKTNREGTDPAVSWFQQKIFPT-VQFPDYFKLTILDRCGRYNIPFTMPYEF---RR-A--RQQRFKVFRYIYEMQRQWHKTAWPWAYKTAMQKRPWKAHSMNGKFVLEMEGIAESDYTMGKLSGNFDINKGFDSGQNSPFKKIAWQMGPV--L-----FVNH-TKTTSMQISTWYRTKGAKMSQNYAQSFKTNYISDCC-------YIDYYMNSMAFCFVDHFLTCVHIHMLDRVPCSYPGERAWEQEHLGVVTMAELHPKWYGTTVGRTQSVPPKNC--NK-GRMMGKVNSIEHMSRTSDSSIEDEAGILYFREIFLKDGQFKAEKWWLNVFDRCS--PFQRFGMLHVDSEILQRLFTMFCYIWVRCEDWDELFMTENWADVANGCNWDGNTEFWVPYRKWN---SRTICQQWNRNDCHGCCVPSADYEKLNGREGRCRIIALIDWMNHGDKCSNNQEWCGPFENPTSSGPANWHRLGLCKETADTLTLIGHCRRQIWQRTL-----HFPSSWSRKCNIWEMHPNKYA--H----QIRETN-------YTNPEDCNNQRCMPYKPFIPKAPQIDMQDEMGESHDYVGPDAANSVWWALQRAGMFQAWKAVKQGCEDCAQALWHYLPLSTQFRVKPELAACSTMVNFWTKSAHCSNQVWANATFLQSVWFFAYAC-K-------PFHEIALGDPRGKSDVDSGK-VMQKVENKFIGARALPADQWFSGFPACPPG-YDHHKNVHIKCDTMRPDPADCIKLRFIIHAQWTQYVCFMFYCAITQVKWSP-----TLNMRRKYTMETPAFNLGVTFIGWFAETKMW-------EHGILGLIEFD', 'RCASGTISQRSEHYNTGANHTMLYWKTQQDYCPVLVVVLH-QIAEIKFIIHYLDEDWDVDGAKKNREGTDPAVSWFQPKIVPTRKQFHDYFWLTILDRCGRYNIPFTMDYEFRQRRRVAENYHQRFKVFRQIYE-AYEVGITAWPWAYKTAMQKRPWKAHSAEYDTQIAGEGIAE--------SGTFDINKGFDSGQNSPQKKIAWQMGAVRQLNERRPFVNHTTNKSSMQI-T--ENKGAKMSKNYAQSVYTNYISDCCGVEDSPGIIDYCMNSMAFCWVDHFLECVHIAMLDRVPCSEPGERAWEQEH------WELHPKWYGTTVGRTQS-RLQNCTPGKRARMMGKV-S--GTSDPEAFQKMDEAGILYFVEIFLKDGQFKAEK-W----DRCSFNYFQRFGMLHVDSEILQRLFAMWCYIWVR------------------HHNWDGNTEFWVPYRKWNPIYHRTICQQWNRNDCHGCCSPSARYEKLN---G--R-IASI--INHGD-AS-GSPRCGPFENPTSS-----HRLGLCKETADTITLIYHIRRQIWDPTLHQVAEHAPSSNIWRCPWLFMHPNVYAWWHRTYEEKMETNWEMHEMWYTNPEDCNNQRCMPY-----K-P---MQDEMGESHDYVGPDAANSVWVALFRAYMFQAWKAMKQGCEDCAQALWH--GLKFQFEVKP-DDMNRCMVNFW---------VWANATFLQSVWFFAYACKKYWFISACLFHEIAYGDPRGKSDVDSGKAAMRKVENKFIGARALPADQW--GFEACPPGKPDHHKNVHIKCDTMRPDPADYIRLRFIFHAQWTQYV--AWWSWEFQVKWSPYCHALTLNMRWKYTMETVA--LGVTFLGWFAEWKMWCQVMVAYFPGILGLICFD')