In "Consensus and profile1" we generalized the notion of Hamming distance to find an average case for a collection of nucleic acids or peptides. However, this method only worked if the polymers had the same length. As we have already noted in "Edit distance2", homologous strands of DNA have varying lengths because of the effect of mutations inserting and deleting intervals of genetic material. As a result, we need to generalize the notion of alignment to cover multiple strings.
A multiple alignment of a collection of three or more strings is formed by adding gap symbols to the strings to produce a collection of augmented strings all having the same length.
A multiple alignment score is obtained by taking the sum of an alignment score over all possible pairs of augmented strings. The only difference in scoring the alignment of two strings is that two gap symbols may be aligned for a given pair (requiring us to specify a score for matched gap symbols).
Your task:
Write a function multipleAlignmentScore that takes four DNA strings. The function must return the maximal score obtained by a multiple alignment of the strings, where we score matched symbols 0 (including matched gap symbols) and all mismatched symbols -1 (thus incorporating a linear gap penalty of 1).
Write a function multipleAlignment that takes four DNA strings. The function must return a multiple alignment of the strings having maximum score, where we score matched symbols 0 (including matched gap symbols) and all mismatched symbols -1 (thus incorporating a linear gap penalty of 1). The multiple alignment must be returned as a tuple containing four augmented DNA strings.
In the following interactive session, we assume the FASTA file data.fna3 to be located in the current directory.
>>> from Bio import SeqIO >>> multipleAlignmentScore('ATATCCG', 'TCCG', 'ATGTACTG', 'ATGTCTG') -18 >>> multipleAlignmentScore(*SeqIO.parse('data.fna', 'fasta')) -35 >>> multipleAlignment('ATATCCG', 'TCCG', 'ATGTACTG', 'ATGTCTG') ('ATAT-CCG', '-T---CCG', 'ATGTACTG', 'ATGT-CTG') >>> multipleAlignment(*SeqIO.parse('data.fna', 'fasta')) ('-CGTCCATG-', 'GAATAGG-GT', 'ACATAGGGG-', 'CCAGCTG-G-')