When we assembled genomes, we discussed how to use overlapping reads to assemble a genome, a problem that was complicated by errors in reads. We would like to find overlaps between error-prone reads as well.

An overlap alignment of strings $$v = v_1\ldots v_n$$ and $$w = w_1\ldots w_m$$ is a global alignment of a suffix of $$v$$ with a prefix of $$w$$. An optimal overlap alignment of strings $$v$$ and $$w$$ maximizes the global alignment score between an $$i$$-suffix of $$v$$ and a $$j$$-prefix of $$w$$ (i.e., between $$v_i\ldots v_n$$ and $$w_1\ldots w_j$$) among all $$i$$ and $$j$$.

Assignment

In this assignment we will construct a highest-scoring overlap alignment between two strings. To score alignments, we use the simple scoring method in which matches count +1 and both the mismatch and indel penalties are equal to 2. Your task:

Example

In the following interactive session, we assume the FASTA file data01.faa1 to be located in the current directory.

>>> overlap_alignment_score('data01.faa')
1

>>> overlap_alignment('data01.faa')
('HEAE', 'HEA-')