A number of different approaches are used to build phylogenies, each one featuring its own computational strengths and weaknesses. One of these measures is distance-based phylogeny, which constructs a tree from evolutionary distances calculated between pairs of taxa.
A wide assortment of different measures exist for quantifying this evolutionary distance. Once we have selected a distance function and used it to calculate the distance between every pair of taxa, we place these pairwise distance functions into a table.
In this problem, we will consider an evolutionary function based on Hamming distance. Recall that this function compares two homologous strands of DNA by counting the minimum possible number of point mutations that could have occurred on the evolutionary path between the two strands.
For two strings $$s_1$$ and $$s_2$$ of equal length, the p-distance between them, denoted $$d_p(s_1, s_2)$$, is the proportion of corresponding symbols that differ between $$s_1$$ and $$s_2$$.
For a general distance function $$d$$ on $$n$$ taxa $$s_1, s_2, \ldots,
s_n$$ (taxa are often represented by genetic strings), we may encode the
distances between pairs of taxa via a pairwise distance matrix
$$D$$ in which $D_{i, j} = d(s_i, s_j)$$. Your task:
Write a function pDistance that takes two DNA strings $$s_1$$ and $$s_2$$ of equal length. The function must return the p-distance $$d_p(s_1, s_2)$$. If the strings $$s_1$$ and $$s_2$$ do not have equal length, the function must raise an AssertionError with message strings must have equal length.
Write a function pairwiseComparison that takes two arguments: i) the location of a FASTA file containing DNA strings of equal length and ii) a distance function $$d$$ defined on pairs of DNA strings. The function must return the pairwise distance matrix $$D$$ computed using the distance $$d$$ on the DNA strings contained in the given FASTA file. If not all strings in the given FASTA file have equal length, the function must raise an AssertionError with message all sequences must have equal length.
In the following interactive session, we assume the FASTA files data01.fna1 and data02.fna2 to be located in the current directory.
>>> pDistance('TTTCCATTTA', 'TTTCCATTTA') 0.0 >>> pDistance('TTTCCATTTA', 'GATTCATTTC') 0.4 >>> pDistance('TTTCCATTTA', 'TTTCCATTTT') 0.1 >>> pDistance('TTTCCATTTA', 'GTTCCATTTA') 0.1 >>> pDistance('TTTCCATTTA', 'ACGT') Traceback (most recent call last): AssertionError: strings must have equal length >>> pairwiseComparison('data01.fna', distance=pDistance) [[0.0, 0.4, 0.1, 0.1], [0.4, 0.0, 0.4, 0.3], [0.1, 0.4, 0.0, 0.2], [0.1, 0.3, 0.2, 0.0]] >>> pairwiseComparison('data02.fna', distance=pDistance) Traceback (most recent call last): AssertionError: all sequences must have equal length