Aligning similar substrings

Whereas global alignment (see "Global alignment with scoring matrix1") can be helpful for comparing genetic strings of similar length that resemble each other, often we will be presented with strings that are mostly dissimilar except for some unknown region of the strings, which may represent a shared gene. To find such genes, we need to modify global alignment to instead search for shared motifs in the form of locally similar regions (recall "Finding a shared motif2" and "Finding a shared spliced motif3").

Using global alignment often fails to find shared motifs hidden in larger strings because (especially if the similar region is found on different ends of the string) aligning the strings causes gap penalties to rack up.

If we are only interested in comparing the regions of similarity, then we would like to have some way of disregarding the parts of the strings that don't resemble each other. The way to do this is to produce alignment scores for all possible pairs of substrings.

Assignment

A local alignment of two strings $$s$$ and $$t$$ is an alignment of substrings $$r$$ and $$u$$ of $$s$$ and $$t$$, respectively. Let $$\textrm{opt}(r, u)$$ denote the score of an optimal alignment of $$r$$ and $$u$$ with respect to some predetermined alignment score. Your task:

Example

In the following interactive session, we assume the FASTA file data.faa4 to be located in the current directory.

>>> from Bio import SeqIO

>>> localAlignmentScore('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN')
23
>>> localAlignmentScore(*SeqIO.parse('data.faa', 'fasta'))
1281

>>> localAlignment('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN')
('LYPRTEINSTRIN', 'LY---EINSTEIN')
>>> localAlignment(*SeqIO.parse('data.faa', 'fasta'))
('HMTWVFRSIDMSW-MCG-PQ-YAY-WFVITSIFFHRHN-SCKLAKCCIP-EQ-Y-EKWGNQHYNDSDNRVHIPVYYAYHRGGGDDVQNHQAQFLAKFEMGAFKRWPRKAYDIFWF-MRWHKFPVAQIMSKVCIYDRGNYMMPPAALLHMIMDC-ESISQCQVASCEGVCWAGYNRSCITTKPNLTRSTL--WMDILPYFFHPYLQDYH-IWIHEMY--I-PVPVMSDRPLYRV-QTA--CD-IEVPSQFIR-GEFVLL-PWQIKVQYEQQAEFFQPTAGFRYD-YCIQGFNMYQIMHIHMFCAYKLFVEGTNVIKVTGFWFNGGENYNWDTNPTDPNNNLHHGNNGRHRAKTDTQDQMQRAGKFCSGCLGYIQHFKRMMVIDP---LC-----YHSH-------L-VPPKWNTVQTTFEVEHKHDDLGIQMDGGKWEQVRVYKLHWTIEGITYPEQTVQ-TG--MWLMIRVFIC-ETFEPSCQYLHAEYFNREKYGYYAYQTYLSKYQAWIKQHINIKNKTMVRASWSVACY-WSVVVDW-VD-EQRKINMCRVNGLSSRYGYWEMNGRKGWGPTWLPH---LWLDAS-PHHNFKGTKHW-----P--DPNVYCEHEWDNCVRWSNIV---R-M--ESCVDLDIWCYQGKQDF-M--W-GDRCEPFCGCDIWTEMQ-GGKCTTVYHYAPQVVVCSSYFKMLCVGKHWKLDHKPAVADFPPHLITTYMCHHSECQNSE-DTRSKYMSEIKHCS---WEM-WHQTLTVSHIWTHKKRANPVVYTDAEIRITPNRHFLSRNMEYVHFHFMFNSPDAVV-WEEFFGICWPMFG-CWHFDVIKHSKQAGCNNKQSLGPMFKC-KLI-WHMRVVFV-PLYEEFTE--WNPTCKPKHEWMEYGFCVCMSF-KE-Q-RIPTYRMWMQLLKRKIQV', 'QTRWR-PKLPL-WNLCQQPESITNTWGTVQCMK-QTYNWPCDF--CIMAWKPAYSQKW-W-HSRVCLYMVHLTI--SYTMCYSDQVCTR-VPVI-HW-MNGL---P-AVCIMFECGLRWC-YHVMK-RNEPEI-QMPEFKLYPS--LRL---CSEAYWHRNVGPEHDHS-NTMDQP-VGIEHTIRISNMQHWFCI-EW-FSK-LMGMNSVWSQQMLKLVFPMPQNGDW-VFELLEAKHFCMYMSLHVDLKRHGNQTQSKTRKHK-RYAALSAL---TFGYVLQIYCIQGFNMYQIMHIHMFCAYKLFVEGTN---V--Y-M-----Y-RCTN-MQ-NNNLHHGYNGRHRAKTDTQ-Q-Q--GKCCSGCLGYIQHFKRMMVIDPYSKMNRWEEEYHSHTGGSRQSMMVSPKWNTVQTTFEVEHKHDDLGIQMDGG--EPVRVYKLHWTQAQATYITEGITYPEQCVWLMISVFIGHETFE----W--AQ--NREKYG-Y-YQTVLSKYQA-IKQHKNIGNKTMVRASWSVACYDHNVFFGYAVHGEEW---MCRVNGLSSRYGYWHMNGRK--GPTWLPPFGYLWLDAAKPHHNFKGTKHWEYIDYHLQDPNVYEEHEWDNCVRWSNIVRIIRMMFWESCVDLDYWCYSGKQDFSRYHWHPDR-DIICLLNM-KPIQFPHMCMSL-K-NGNMRTDDIYWDD-GVMPAYN-NCRR-IVNVMPKWYEQLMIVQVMCMNSGPSTRCKF-SQSAHCRFIFWGKFWP-FLRVSCFF-QIATVVMVKYL-PTVLCQQQKHS-ANLMAH-HMQVQ-PNPEYMPYWIT--SI--QLVASC-H-ESVHYPWRNSITDSFAVAEKYPVDNVVKWS-DCIFICNMVQTVTKKSWNQQ-RAW-EWHEHEVNRCFDYLAAGAMRVDHTNSWNDWIK-AFQV')