As also mentioned in "Error correction in reads", the sequencing machines that identify reads can make errors. However, the problem that we considered in "Genome assembly as shortest superstring1" assumed that all reads are error-free.
Thus, rather than trying to overlap reads exactly, we will instead do so approximately. The key to do this is to move toward methods that incorporate alignments. Yet neither global nor local alignment is appropriate for this task. Global alignment will attempt to align the entire reads, when we know that only the overlapping parts of the reads are relevant. For that matter, we may identify an optimal local alignment that does not correspond to an overlap.
As a result, we need a specific type of local alignment that aligns only
the overlapping parts of two strings.
An overlap alignment between two strings $$s$$ and $$t$$ is a local alignment of a suffix of $$s$$ with a prefix of $$t$$. An optimal overlap alignment will therefore maximize an alignment score over all such substrings of $$s$$ and $$t$$. Your task:
Write a function overlapAlignmentScore that takes two DNA strings $$s$$ and $$t$$. The function must return the score of an optimal overlap alignment of $$s$$ and $$t$$. To align the two given DNA strings, the function must use an alignment score in which matching symbols count +1, substitutions count -2, and there is a linear gap penalty of 2.
Write a function overlapAlignment that takes two DNA strings $$s$$ and $$t$$. The function must return a tuple containing an alignment of a suffix $$s'$$ of $$s$$ and a prefix $$t'$$ of $$t$$ achieving an optimal alignment score. To align the two given DNA strings, the function must use an alignment score in which matching symbols count +1, substitutions count -2, and there is a linear gap penalty of 2. If multiple optimal alignments exist, the function may return any one.
In the following interactive session, we assume the FASTA file data.fna2 to be located in the current directory.
>>> from Bio import SeqIO >>> overlapAlignmentScore('CTAAGGGATTCCGGTAATTAGACAG', 'ATAGACCATATGTCAGTGACTGTGTAA') 1 >>> overlapAlignmentScore(*SeqIO.parse('data.fna', 'fasta')) 4 >>> overlapAlignment('CTAAGGGATTCCGGTAATTAGACAG', 'ATAGACCATATGTCAGTGACTGTGTAA') ('ACAG', 'ATAG') >>> from Bio import SeqIO >>> overlapAlignment(*SeqIO.parse('data.fna', 'fasta')) ('ATTAGCCTTTAAA-TAATCTCGGACGACTAT', 'ATTAGC-T--AAACTAA-CTGGG-CG-CT-T')