In "Counting point mutations1" we calculated the minimum number of symbol mismatches between two strings of equal length, to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.
A matrix is a rectangular table of values divided into rows and columns. An $$m \times n$$ matrix has $$m$$ rows and $$n$$ columns. Given a matrix $$A$$, we write $$A_{i,j}$$ ($$1 \leq i \leq m; 1 \leq j \leq n$$) to indicate the value at the intersection of row $$i$$ and column $$j$$.
Say that we have a collection of DNA strings, all having the same length $$n$$. Their profile matrix is a $$4 \times n$$ matrix $$P$$ in which $$P_{1, j}$$ represents the number of times the base A occurs in the $$j$$-th position of the given sequences, $$P_{2, j}$$ represents the number of times the base C occurs in the $$j$$-th position of the given sequences, and so on (see table below).
The consensus sequence $$c$$ is a string of length $$n$$ formed from our collections of DNA strings by taking the most common base at each position. The $$j$$-th character of $$c$$ therefore corresponds to the base having the maximal value in the $$j$$-th column of the profile matrix. If there is more than one maximal value in the $$j$$-th column of the profile matrix, the letter N is used as the $$j$$-th character of $$c$$.
G C A A A A C G | |
G C G A A A C T | |
T A C C T T C A | |
DNA
strings |
T A T G T T C A |
G C C T T A G G | |
G A C T T A T A | |
T C G G A T C C | |
|
|
A 0 3 1 2 3 4 0 3 | |
profile | C 0 4 3 1 0 0 5 1 |
G 4 0 2 2 0 0 1 2 | |
T 3 0 1 2 4 3 1 1 | |
|
|
consensus | G C C N T A C A |
Your task:
Write a function profile that takes the location of a FASTA file containing DNA strings of equal length. The function must return the profile matrix of the DNA strings. The profile matrix is represented as a dictionary that maps each of the bases A, C, G and T onto a list of $$n$$ integers, where the integer at position $$j$$ indicates how often that base occurs in the $$j$$-th position of the given DNA strings. In case the DNA strings in the given FASTA file do not have equal length, an AssertionError must be raised with the message all strings should have equal length.
Write a function consensus that takes a profile matrix. This profile matrix must be a dictionary that is formatted as the return values of the function profile. The function consensus must return the consensus sequence that corresponds to the given profile matrix.
In the following interactive session, we assume the text files data01.fna2 and data02.fna3 to be located in the current directory.
>>> profile('data01.fna') {'A': [0, 3, 1, 2, 3, 4, 0, 3], 'C': [0, 4, 3, 1, 0, 0, 5, 1], 'T': [3, 0, 1, 2, 4, 3, 1, 1], 'G': [4, 0, 2, 2, 0, 0, 1, 2]} >>> profile('data02.fna') Traceback (most recent call last): AssertionError: all strings should have equal length >>> consensus(profile('data01.fna')) 'GCCNTACA'