A mutation is simply a mistake that occurs during the creation or copying of a nucleic acid, in particular DNA. Because nucleic acids are vital to cellular functions, mutations tend to cause a ripple effect throughout the cell. Although mutations are technically mistakes, a very rare mutation may equip the cell with a beneficial attribute. In fact, the macro effects of evolution are attributable to the accumulated result of beneficial microscopic mutations over many generations.

code 39
The simplest and most common type of nucleic acid mutation is a point mutation, which replaces one base with another at a single nucleotide. In the case of DNA a point mutation must change the complementary base accordingly, as in the figure where a C-G pair is changed into an A-T pair.

DNA strands taken from different organism or species genomes are homologous if they share a recent ancestor. In comparing several homologous DNA strands, it might be helpful to compute their consensus sequence. After all, according to the biological principle of parsimony1 — which demands that evolutionary histories should be as simply explained as possible — this sequence represents the most likely ancestor of the given DNA strands.

Assignment

A matrix is a rectangular table of values divided into rows and columns. An $$m \times n$$ matrix has $$m$$ rows and $$n$$ columns. Given a matrix $$A$$, we write $$A_{i,j}$$ ($$0 \leq i < m; 0 \leq j < n$$) to indicate the value at the intersection of row $$i$$ and column $$j$$.

Say that we have a series of DNA sequences, all having the same length $$n$$. Their profile matrix is a $$4 \times n$$ matrix $$P$$ in which $$P_{0, j}$$ represents the number of times the base A occurs in the $$j$$-th position of the given sequences, $$P_{1, j}$$ represents the number of times the base C occurs in the $$j$$-th position of the given sequences, and so on (see table below).

The consensus sequence $$c$$ is a string of length $$n$$ formed from the series of DNA sequences by taking the most common base at each position. The $$j$$-th character of $$c$$ therefore corresponds to the base having the maximal value in the $$j$$-th column of the profile matrix of the DNA sequences. If there is more than one maximal value in the $$j$$-th column of the profile matrix, the letter N is used as the $$j$$-th character of $$c$$.


G C A A A A C G

G C G A A A C T

T A C C T T C A
sequences T A T G T T C A

G C C T T A G G

G A C T T A T A

T C G G A T C C


A   0 3 1 2 3 4 0 3
profile C   0 4 3 1 0 0 5 1

G   4 0 2 2 0 0 1 2

T   3 0 1 2 4 3 1 1

consensus G C C N T A C A

Your task:

Example

>>> seqs = ['GCAAAACG', 'GCGAAACT', 'TACCTTCA', 'TATGTTCA', 'GCCTTAGG', 'GACTTATA', 'TCGGATCC']
>>> profile(seqs)
{'A': [0, 3, 1, 2, 3, 4, 0, 3], 'C': [0, 4, 3, 1, 0, 0, 5, 1], 'T': [3, 0, 1, 2, 4, 3, 1, 1], 'G': [4, 0, 2, 2, 0, 0, 1, 2]}
>>> consensus(profile(seqs))
'GCCNTACA'

>>> seqs = ['GGTATCTTTA', 'TTGTCGTCTTAGA', 'GGATCCAGAC', 'ATTCAATCGA', 'TGATCTGGAA', 'AGAGTCATGC']
>>> profile(seqs)
Traceback (most recent call last):
AssertionError: sequences should have equal length