A mutation is simply a mistake that occurs during the creation or copying of a nucleic acid, in particular DNA. Because nucleic acids are vital to cellular functions, mutations tend to cause a ripple effect throughout the cell. Although mutations are technically mistakes, a very rare mutation may equip the cell with a beneficial attribute. In fact, the macro effects of evolution are attributable to the accumulated result of beneficial microscopic mutations over many generations.
DNA strands taken from different organism or species genomes are homologous if they share a recent ancestor. In comparing several homologous DNA strands, it might be helpful to compute their consensus sequence. After all, according to the biological principle of parsimony1 — which demands that evolutionary histories should be as simply explained as possible — this sequence represents the most likely ancestor of the given DNA strands.
A matrix is a rectangular table of values divided into rows and columns. An $$m \times n$$ matrix has $$m$$ rows and $$n$$ columns. Given a matrix $$A$$, we write $$A_{i,j}$$ ($$0 \leq i < m; 0 \leq j < n$$) to indicate the value at the intersection of row $$i$$ and column $$j$$.
Say that we have a series of DNA sequences, all having the same length $$n$$. Their profile matrix is a $$4 \times n$$ matrix $$P$$ in which $$P_{0, j}$$ represents the number of times the base A occurs in the $$j$$-th position of the given sequences, $$P_{1, j}$$ represents the number of times the base C occurs in the $$j$$-th position of the given sequences, and so on (see table below).
The consensus sequence $$c$$ is a string of length $$n$$ formed from the series of DNA sequences by taking the most common base at each position. The $$j$$-th character of $$c$$ therefore corresponds to the base having the maximal value in the $$j$$-th column of the profile matrix of the DNA sequences. If there is more than one maximal value in the $$j$$-th column of the profile matrix, the letter N is used as the $$j$$-th character of $$c$$.
G C A A A A C G | |
G C G A A A C T | |
T A C C T T C A | |
sequences | T A T G T T C A |
G C C T T A G G | |
G A C T T A T A | |
T C G G A T C C | |
|
|
A 0 3 1 2 3 4 0 3 | |
profile | C 0 4 3 1 0 0 5 1 |
G 4 0 2 2 0 0 1 2 | |
T 3 0 1 2 4 3 1 1 | |
|
|
consensus | G C C N T A C A |
Your task:
Write a function profile that takes a series (a list or a tuple) of DNA sequences as its argument. In this assignment, DNA sequences are represented as strings that only contain the uppercase letters A, C, G and T. In case not all DNA sequences have equal length, the function must raise an AssertionError with the message sequences should have equal length. In case all DNA sequences have equal length $$n$$, the function must return the profile matrix of the sequences. The profile matrix is represented as a dictionary that maps each of the bases A, C, G and T onto a list of $$n$$ integers, where the integer at position $$j$$ indicates how often that base occurs in the $$j$$-th position of the given DNA sequences.
Write a function consensus that takes a profile matrix as its argument. This profile matrix must be a dictionary that is formatted as the return values of the function profile. The function consensus must return the consensus sequence that corresponds to the given profile matrix.
>>> seqs = ['GCAAAACG', 'GCGAAACT', 'TACCTTCA', 'TATGTTCA', 'GCCTTAGG', 'GACTTATA', 'TCGGATCC']
>>> profile(seqs)
{'A': [0, 3, 1, 2, 3, 4, 0, 3], 'C': [0, 4, 3, 1, 0, 0, 5, 1], 'T': [3, 0, 1, 2, 4, 3, 1, 1], 'G': [4, 0, 2, 2, 0, 0, 1, 2]}
>>> consensus(profile(seqs))
'GCCNTACA'
>>> seqs = ['GGTATCTTTA', 'TTGTCGTCTTAGA', 'GGATCCAGAC', 'ATTCAATCGA', 'TGATCTGGAA', 'AGAGTCATGC']
>>> profile(seqs)
Traceback (most recent call last):
AssertionError: sequences should have equal length