When we assembled genomes, we discussed how to use overlapping reads to assemble a genome, a problem that was complicated by errors in reads. We would like to find overlaps between error-prone reads as well.
An overlap alignment of strings $$v = v_1\ldots v_n$$ and $$w = w_1\ldots w_m$$ is a global alignment of a suffix of $$v$$ with a prefix of $$w$$. An optimal overlap alignment of strings $$v$$ and $$w$$ maximizes the global alignment score between an $$i$$-suffix of $$v$$ and a $$j$$-prefix of $$w$$ (i.e., between $$v_i\ldots v_n$$ and $$w_1\ldots w_j$$) among all $$i$$ and $$j$$.
In this assignment we will construct a highest-scoring overlap alignment between two strings. To score alignments, we use the simple scoring method in which matches count +1 and both the mismatch and indel penalties are equal to 2. Your task:
Write a function overlap_alignment_score that takes the location of a FASTA file containing two amino acid sequences $$v$$ and $$w$$. The function must return the maximum score of an overlap alignment of $$v$$ and $$w$$.
Write a function overlap_alignment that takes the location of a FASTA file containing two amino acid sequences $$v$$ and $$w$$. The function must return an overlap alignment of $$v$$ and $$w$$, represented as a tuple of two strings with indels represented by hyphens (-). If multiple overlap alignments achieving the maximum score exist, the function may return any one.
In the following interactive session, we assume the FASTA file data01.faa1 to be located in the current directory.
>>> overlap_alignment_score('data01.faa') 1 >>> overlap_alignment('data01.faa') ('HEAE', 'HEA-')