A mutation is simply a mistake that occurs during the creation or copying of a nucleic acid, in particular DNA. Because nucleic acids are vital to cellular functions, mutations tend to cause a ripple effect throughout the cell. Although mutations are technically mistakes, a very rare mutation may equip the cell with a beneficial attribute. In fact, the macro effects of evolution are attributable to the accumulated result of beneficial microscopic mutations over many generations.

code 39

DNA strands taken from different organism or species genomes are homologous if they share a recent ancestor. In comparing several homologous DNA strands, it might be helpful to compute their consensus sequence. After all, according to the biological principle of parsimony — which demands that evolutionary histories should be as simply explained as possible — this sequence represents the most likely ancestor of the given DNA strands.

Assignment

A matrix is a rectangular table of values divided into rows and columns. An $$m \times n$$ matrix has $$m$$ rows and $$n$$ columns. Given a matrix $$A$$, we write $$A_{i,j}$$ ($$0 \leq i < m; 0 \leq j < n$$) to indicate the value at the intersection of row $$i$$ and column $$j$$.

Say that we have a series of DNA sequences, all having the same length $$n$$. Their profile matrix is a $$4 \times n$$ matrix $$P$$ in which $$P_{0, j}$$ represents the number of times the base A occurs in the $$j$$-th position of the given sequences, $$P_{1, j}$$ represents the number of times the base C occurs in the $$j$$-th position of the given sequences, and so on (see table below).

The consensus sequence $$c$$ is a string of length $$n$$ formed from the series of DNA sequences by taking the most common base at each position. The $$j$$-th character of $$c$$ therefore corresponds to the base having the maximal value in the $$j$$-th column of the profile matrix of the DNA sequences. If there is more than one maximal value in the $$j$$-th column of the profile matrix, the letter N is used as the $$j$$-th character of $$c$$.


G C A A A A C G

G C G A A A C T

T A C C T T C A
sequences T A T G T T C A

G C C T T A G G

G A C T T A T A

T C G G A T C C


A   0 3 1 2 3 4 0 3
profile C   0 4 3 1 0 0 5 1

G   4 0 2 2 0 0 1 2

T   3 0 1 2 4 3 1 1

consensus G C C N T A C A

Your task:

Example

>>> seqs = ['GCAAAACG', 'GCGAAACT', 'TACCTTCA', 'TATGTTCA', 'GCCTTAGG', 'GACTTATA', 'TCGGATCC']
>>> profile(seqs)
{'A': [0, 3, 1, 2, 3, 4, 0, 3], 'C': [0, 4, 3, 1, 0, 0, 5, 1], 'T': [3, 0, 1, 2, 4, 3, 1, 1], 'G': [4, 0, 2, 2, 0, 0, 1, 2]}
>>> consensus(profile(seqs))
'GCCNTACA'

>>> seqs = ['GGTATCTTTA', 'TTGTCGTCTTAGA', 'GGATCCAGAC', 'ATTCAATCGA', 'TGATCTGGAA', 'AGAGTCATGC']
>>> profile(seqs)
Traceback (most recent call last):
AssertionError: sequences should have equal length