When we assembled genomes, we discussed how to use overlapping reads to assemble a genome, a problem that was complicated by errors in reads. We would like to find overlaps between error-prone reads as well.

An overlap alignment of strings $$v = v_1\ldots v_n$$ and $$w = w_1\ldots w_m$$ is a global alignment of a suffix of $$v$$ with a prefix of $$w$$. An optimal overlap alignment of strings $$v$$ and $$w$$ maximizes the global alignment score between an $$i$$-suffix of $$v$$ and a $$j$$-prefix of $$w$$ (i.e., between $$v_i\ldots v_n$$ and $$w_1\ldots w_j$$) among all $$i$$ and $$j$$.

Assignment

In this assignment we will construct a highest-scoring overlap alignment between two strings. To score alignments, we use the simple scoring method in which matches count +1 and both the mismatch and indel penalties are equal to 2. Your task:

Write a function overlap_alignment_score that takes the location of a FASTA file containing two amino acid sequences $$v$$ and $$w$$. The function must return the maximum score of an overlap alignment of $$v$$ and $$w$$.
Write a function overlap_alignment that takes the location of a FASTA file containing two amino acid sequences $$v$$ and $$w$$. The function must return an overlap alignment of $$v$$ and $$w$$, represented as a tuple of two strings with indels represented by hyphens (-). If multiple overlap alignments achieving the maximum score exist, the function may return any one.

Example

In the following interactive session, we assume the FASTA file data01.faa¹ to be located in the current directory.

        >>> overlap_alignment_score('data01.faa')
1

>>> overlap_alignment('data01.faa')
('HEAE', 'HEA-')