Overlapping reads with errors

As also mentioned in "Error correction in reads", the sequencing machines that identify reads can make errors. However, the problem that we considered in "Genome assembly as shortest superstring1" assumed that all reads are error-free.

Thus, rather than trying to overlap reads exactly, we will instead do so approximately. The key to do this is to move toward methods that incorporate alignments. Yet neither global nor local alignment is appropriate for this task. Global alignment will attempt to align the entire reads, when we know that only the overlapping parts of the reads are relevant. For that matter, we may identify an optimal local alignment that does not correspond to an overlap.

As a result, we need a specific type of local alignment that aligns only the overlapping parts of two strings.

Assignment

An overlap alignment between two strings $$s$$ and $$t$$ is a local alignment of a suffix of $$s$$ with a prefix of $$t$$. An optimal overlap alignment will therefore maximize an alignment score over all such substrings of $$s$$ and $$t$$. Your task:

Example

In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.

>>> from Bio import SeqIO

>>> overlapAlignmentScore('CTAAGGGATTCCGGTAATTAGACAG', 'ATAGACCATATGTCAGTGACTGTGTAA')
1
>>> overlapAlignmentScore(*SeqIO.parse('data.fna', 'fasta'))
4

>>> overlapAlignment('CTAAGGGATTCCGGTAATTAGACAG', 'ATAGACCATATGTCAGTGACTGTGTAA')
('ACAG', 'ATAG')
>>> from Bio import SeqIO
>>> overlapAlignment(*SeqIO.parse('data.fna', 'fasta'))
('ATTAGCCTTTAAA-TAATCTCGGACGACTAT', 'ATTAGC-T--AAACTAA-CTGGG-CG-CT-T')