In "Counting point mutations1" we saw that Hamming distance gave us a preliminary notion of the evolutionary distance between two DNA strings by counting the minimum number of single nucleotide substitutions that could have occurred on the evolutionary path between the two strands.
However, in practice, homologous strands of DNA or protein are rarely the same length because point mutations also include the insertion or deletion of a single nucleotide (and single amino acids can be inserted or deleted from peptides). Thus, we need to incorporate these insertions and deletions into the calculation of the minimum number of point mutations between two strings. One of the simplest models charges a unit "cost" to any single-symbol insertion/deletion, then (in keeping with parsimony) requests the minimum cost over all transformations of one genetic string into another by point substitutions, insertions, and deletions.
Given two strings $$s$$ and $$t$$ (of possibly different lengths), the edit distance $$d_{\textrm{E}}(s, t)$$ is the minimum number of edit operations needed to transform $$s$$ into $$t$$, where an edit operation is defined as the substitution, insertion, or deletion of a single symbol.
The latter two operations incorporate the case in which a contiguous
interval is inserted into or deleted from a string. Such an interval is
called a gap. For the purposes of this problem, the insertion or
deletion of a gap of length $$k$$ still counts as $$k$$ distinct edit
operations.
Write a function editDistance that takes two protein strings $$s$$ and $$t$$. The function must return the edit distance $$d_{\textrm{E}}(s, t)$$.
In the following interactive session, we assume the FASTA file data.faa2 to be located in the current directory.
>>> editDistance('PLEASANTLY', 'MEANLY') 5 >>> from Bio import SeqIO >>> editDistance(*SeqIO.parse('data.faa', 'fasta')) 349