In "Counting point mutations1" we calculated the minimum number of symbol mismatches between two strings of equal length, to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.

Assignment

A matrix is a rectangular table of values divided into rows and columns. An $$m \times n$$ matrix has $$m$$ rows and $$n$$ columns. Given a matrix $$A$$, we write $$A_{i,j}$$ ($$1 \leq i \leq m; 1 \leq j \leq n$$) to indicate the value at the intersection of row $$i$$ and column $$j$$.

Say that we have a collection of DNA strings, all having the same length $$n$$. Their profile matrix is a $$4 \times n$$ matrix $$P$$ in which $$P_{1, j}$$ represents the number of times the base A occurs in the $$j$$-th position of the given sequences, $$P_{2, j}$$ represents the number of times the base C occurs in the $$j$$-th position of the given sequences, and so on (see table below).

The consensus sequence $$c$$ is a string of length $$n$$ formed from our collections of DNA strings by taking the most common base at each position. The $$j$$-th character of $$c$$ therefore corresponds to the base having the maximal value in the $$j$$-th column of the profile matrix. If there is more than one maximal value in the $$j$$-th column of the profile matrix, the letter N is used as the $$j$$-th character of $$c$$.


G C A A A A C G

G C G A A A C T

T A C C T T C A
DNA strings
T A T G T T C A

G C C T T A G G

G A C T T A T A

T C G G A T C C


A   0 3 1 2 3 4 0 3
profile C   0 4 3 1 0 0 5 1

G   4 0 2 2 0 0 1 2

T   3 0 1 2 4 3 1 1

consensus G C C N T A C A

Your task:

Example

In the following interactive session, we assume the text files data01.fna2 and data02.fna3 to be located in the current directory.

>>> profile('data01.fna')
{'A': [0, 3, 1, 2, 3, 4, 0, 3], 'C': [0, 4, 3, 1, 0, 0, 5, 1], 'T': [3, 0, 1, 2, 4, 3, 1, 1], 'G': [4, 0, 2, 2, 0, 0, 1, 2]}
>>> profile('data02.fna')
Traceback (most recent call last):
AssertionError: all strings should have equal length

>>> consensus(profile('data01.fna'))
'GCCNTACA'