Whereas global alignment (see "Global alignment with scoring matrix1") can be helpful for comparing genetic strings of similar length that resemble each other, often we will be presented with strings that are mostly dissimilar except for some unknown region of the strings, which may represent a shared gene. To find such genes, we need to modify global alignment to instead search for shared motifs in the form of locally similar regions (recall "Finding a shared motif2" and "Finding a shared spliced motif3").
Using global alignment often fails to find shared motifs hidden in larger strings because (especially if the similar region is found on different ends of the string) aligning the strings causes gap penalties to rack up.
If we are only interested in comparing the regions of similarity, then we
would like to have some way of disregarding the parts of the strings that
don't resemble each other. The way to do this is to produce alignment
scores for all possible pairs of substrings.
A local alignment of two strings $$s$$ and $$t$$ is an alignment of substrings $$r$$ and $$u$$ of $$s$$ and $$t$$, respectively. Let $$\textrm{opt}(r, u)$$ denote the score of an optimal alignment of $$r$$ and $$u$$ with respect to some predetermined alignment score. Your task:
Write a function localAlignmentScore that takes two protein strings $$s$$ and $$t$$. The function must return $$\textrm{opt}(r, u)$$, using the PAM250 scoring matrix and a linear gap penalty equal to 5.
Write a function localAlignment that takes two protein strings $$s$$ and $$t$$. The function must return a tuple containing two strings $$r'$$ and $$u'$$ that represent an optimal local alignment of $$s$$ and $$t$$, using the PAM250 scoring matrix and a linear gap penalty equal to 5.
In the following interactive session, we assume the FASTA file data.faa4 to be located in the current directory.
>>> from Bio import SeqIO >>> localAlignmentScore('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN') 23 >>> localAlignmentScore(*SeqIO.parse('data.faa', 'fasta')) 1281 >>> localAlignment('MEANLYPRTEINSTRING', 'PLEASANTLYEINSTEIN') ('LYPRTEINSTRIN', 'LY---EINSTEIN') >>> localAlignment(*SeqIO.parse('data.faa', 'fasta')) ('HMTWVFRSIDMSW-MCG-PQ-YAY-WFVITSIFFHRHN-SCKLAKCCIP-EQ-Y-EKWGNQHYNDSDNRVHIPVYYAYHRGGGDDVQNHQAQFLAKFEMGAFKRWPRKAYDIFWF-MRWHKFPVAQIMSKVCIYDRGNYMMPPAALLHMIMDC-ESISQCQVASCEGVCWAGYNRSCITTKPNLTRSTL--WMDILPYFFHPYLQDYH-IWIHEMY--I-PVPVMSDRPLYRV-QTA--CD-IEVPSQFIR-GEFVLL-PWQIKVQYEQQAEFFQPTAGFRYD-YCIQGFNMYQIMHIHMFCAYKLFVEGTNVIKVTGFWFNGGENYNWDTNPTDPNNNLHHGNNGRHRAKTDTQDQMQRAGKFCSGCLGYIQHFKRMMVIDP---LC-----YHSH-------L-VPPKWNTVQTTFEVEHKHDDLGIQMDGGKWEQVRVYKLHWTIEGITYPEQTVQ-TG--MWLMIRVFIC-ETFEPSCQYLHAEYFNREKYGYYAYQTYLSKYQAWIKQHINIKNKTMVRASWSVACY-WSVVVDW-VD-EQRKINMCRVNGLSSRYGYWEMNGRKGWGPTWLPH---LWLDAS-PHHNFKGTKHW-----P--DPNVYCEHEWDNCVRWSNIV---R-M--ESCVDLDIWCYQGKQDF-M--W-GDRCEPFCGCDIWTEMQ-GGKCTTVYHYAPQVVVCSSYFKMLCVGKHWKLDHKPAVADFPPHLITTYMCHHSECQNSE-DTRSKYMSEIKHCS---WEM-WHQTLTVSHIWTHKKRANPVVYTDAEIRITPNRHFLSRNMEYVHFHFMFNSPDAVV-WEEFFGICWPMFG-CWHFDVIKHSKQAGCNNKQSLGPMFKC-KLI-WHMRVVFV-PLYEEFTE--WNPTCKPKHEWMEYGFCVCMSF-KE-Q-RIPTYRMWMQLLKRKIQV', 'QTRWR-PKLPL-WNLCQQPESITNTWGTVQCMK-QTYNWPCDF--CIMAWKPAYSQKW-W-HSRVCLYMVHLTI--SYTMCYSDQVCTR-VPVI-HW-MNGL---P-AVCIMFECGLRWC-YHVMK-RNEPEI-QMPEFKLYPS--LRL---CSEAYWHRNVGPEHDHS-NTMDQP-VGIEHTIRISNMQHWFCI-EW-FSK-LMGMNSVWSQQMLKLVFPMPQNGDW-VFELLEAKHFCMYMSLHVDLKRHGNQTQSKTRKHK-RYAALSAL---TFGYVLQIYCIQGFNMYQIMHIHMFCAYKLFVEGTN---V--Y-M-----Y-RCTN-MQ-NNNLHHGYNGRHRAKTDTQ-Q-Q--GKCCSGCLGYIQHFKRMMVIDPYSKMNRWEEEYHSHTGGSRQSMMVSPKWNTVQTTFEVEHKHDDLGIQMDGG--EPVRVYKLHWTQAQATYITEGITYPEQCVWLMISVFIGHETFE----W--AQ--NREKYG-Y-YQTVLSKYQA-IKQHKNIGNKTMVRASWSVACYDHNVFFGYAVHGEEW---MCRVNGLSSRYGYWHMNGRK--GPTWLPPFGYLWLDAAKPHHNFKGTKHWEYIDYHLQDPNVYEEHEWDNCVRWSNIVRIIRMMFWESCVDLDYWCYSGKQDFSRYHWHPDR-DIICLLNM-KPIQFPHMCMSL-K-NGNMRTDDIYWDD-GVMPAYN-NCRR-IVNVMPKWYEQLMIVQVMCMNSGPSTRCKF-SQSAHCRFIFWGKFWP-FLRVSCFF-QIATVVMVKYL-PTVLCQQQKHS-ANLMAH-HMQVQ-PNPEYMPYWIT--SI--QLVASC-H-ESVHYPWRNSITDSFAVAEKYPVDNVVKWS-DCIFICNMVQTVTKKSWNQQ-RAW-EWHEHEVNRCFDYLAAGAMRVDHTNSWNDWIK-AFQV')