Aligning similar substrings

Whereas global alignment (see "Global alignment with scoring matrix¹") can be helpful for comparing genetic strings of similar length that resemble each other, often we will be presented with strings that are mostly dissimilar except for some unknown region of the strings, which may represent a shared gene. To find such genes, we need to modify global alignment to instead search for shared motifs in the form of locally similar regions (recall "Finding a shared motif²" and "Finding a shared spliced motif³").

Using global alignment often fails to find shared motifs hidden in larger strings because (especially if the similar region is found on different ends of the string) aligning the strings causes gap penalties to rack up.

If we are only interested in comparing the regions of similarity, then we would like to have some way of disregarding the parts of the strings that don't resemble each other. The way to do this is to produce alignment scores for all possible pairs of substrings.

Assignment

A local alignment of two strings $$s$$ and $$t$$ is an alignment of substrings $$r$$ and $$u$$ of $$s$$ and $$t$$, respectively. Let $$\textrm{opt}(r, u)$$ denote the score of an optimal alignment of $$r$$ and $$u$$ with respect to some predetermined alignment score. Your task:

Write a function localAlignmentScore that takes two protein strings $$s$$ and $$t$$. The function must return $$\textrm{opt}(r, u)$$, using the PAM250 scoring matrix and a linear gap penalty equal to 5.
Write a function localAlignment that takes two protein strings $$s$$ and $$t$$. The function must return a tuple containing two strings $$r'$$ and $$u'$$ that represent an optimal local alignment of $$s$$ and $$t$$, using the PAM250 scoring matrix and a linear gap penalty equal to 5.

Example

In the following interactive session, we assume the FASTA file data.faa⁴ to be located in the current directory.

        >>> from Bio import SeqIO

>>> localAlignmentScore('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN')
23
>>> localAlignmentScore(*SeqIO.parse('data.faa', 'fasta'))
1281

>>> localAlignment('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN')
('LYPRTEINSTRIN', 'LY---EINSTEIN')
>>> localAlignment(*SeqIO.parse('data.faa', 'fasta'))
('HMTWVFRSIDMSW-MCG-PQ-YAY-WFVITSIFFHRHN-SCKLAKCCIP-EQ-Y-EKWGNQHYNDSDNRVHIPVYYAYHRGGGDDVQNHQAQFLAKFEMGAFKRWPRKAYDIFWF-MRWHKFPVAQIMSKVCIYDRGNYMMPPAALLHMIMDC-ESISQCQVASCEGVCWAGYNRSCITTKPNLTRSTL--WMDILPYFFHPYLQDYH-IWIHEMY--I-PVPVMSDRPLYRV-QTA--CD-IEVPSQFIR-GEFVLL-PWQIKVQYEQQAEFFQPTAGFRYD-YCIQGFNMYQIMHIHMFCAYKLFVEGTNVIKVTGFWFNGGENYNWDTNPTDPNNNLHHGNNGRHRAKTDTQDQMQRAGKFCSGCLGYIQHFKRMMVIDP---LC-----YHSH-------L-VPPKWNTVQTTFEVEHKHDDLGIQMDGGKWEQVRVYKLHWTIEGITYPEQTVQ-TG--MWLMIRVFIC-ETFEPSCQYLHAEYFNREKYGYYAYQTYLSKYQAWIKQHINIKNKTMVRASWSVACY-WSVVVDW-VD-EQRKINMCRVNGLSSRYGYWEMNGRKGWGPTWLPH---LWLDAS-PHHNFKGTKHW-----P--DPNVYCEHEWDNCVRWSNIV---R-M--ESCVDLDIWCYQGKQDF-M--W-GDRCEPFCGCDIWTEMQ-GGKCTTVYHYAPQVVVCSSYFKMLCVGKHWKLDHKPAVADFPPHLITTYMCHHSECQNSE-DTRSKYMSEIKHCS---WEM-WHQTLTVSHIWTHKKRANPVVYTDAEIRITPNRHFLSRNMEYVHFHFMFNSPDAVV-WEEFFGICWPMFG-CWHFDVIKHSKQAGCNNKQSLGPMFKC-KLI-WHMRVVFV-PLYEEFTE--WNPTCKPKHEWMEYGFCVCMSF-KE-Q-RIPTYRMWMQLLKRKIQV', 'QTRWR-PKLPL-WNLCQQPESITNTWGTVQCMK-QTYNWPCDF--CIMAWKPAYSQKW-W-HSRVCLYMVHLTI--SYTMCYSDQVCTR-VPVI-HW-MNGL---P-AVCIMFECGLRWC-YHVMK-RNEPEI-QMPEFKLYPS--LRL---CSEAYWHRNVGPEHDHS-NTMDQP-VGIEHTIRISNMQHWFCI-EW-FSK-LMGMNSVWSQQMLKLVFPMPQNGDW-VFELLEAKHFCMYMSLHVDLKRHGNQTQSKTRKHK-RYAALSAL---TFGYVLQIYCIQGFNMYQIMHIHMFCAYKLFVEGTN---V--Y-M-----Y-RCTN-MQ-NNNLHHGYNGRHRAKTDTQ-Q-Q--GKCCSGCLGYIQHFKRMMVIDPYSKMNRWEEEYHSHTGGSRQSMMVSPKWNTVQTTFEVEHKHDDLGIQMDGG--EPVRVYKLHWTQAQATYITEGITYPEQCVWLMISVFIGHETFE----W--AQ--NREKYG-Y-YQTVLSKYQA-IKQHKNIGNKTMVRASWSVACYDHNVFFGYAVHGEEW---MCRVNGLSSRYGYWHMNGRK--GPTWLPPFGYLWLDAAKPHHNFKGTKHWEYIDYHLQDPNVYEEHEWDNCVRWSNIVRIIRMMFWESCVDLDYWCYSGKQDFSRYHWHPDR-DIICLLNM-KPIQFPHMCMSL-K-NGNMRTDDIYWDD-GVMPAYN-NCRR-IVNVMPKWYEQLMIVQVMCMNSGPSTRCKF-SQSAHCRFIFWGKFWP-FLRVSCFF-QIATVVMVKYL-PTVLCQQQKHS-ANLMAH-HMQVQ-PNPEYMPYWIT--SI--QLVASC-H-ESVHYPWRNSITDSFAVAEKYPVDNVVKWS-DCIFICNMVQTVTKKSWNQQ-RAW-EWHEHEVNRCFDYLAAGAMRVDHTNSWNDWIK-AFQV')