A number of different approaches are used to build phylogenies, each one featuring its own computational strengths and weaknesses. One of these measures is distance-based phylogeny, which constructs a tree from evolutionary distances calculated between pairs of taxa.

A wide assortment of different measures exist for quantifying this evolutionary distance. Once we have selected a distance function and used it to calculate the distance between every pair of taxa, we place these pairwise distance functions into a table.

In this problem, we will consider an evolutionary function based on Hamming distance. Recall that this function compares two homologous strands of DNA by counting the minimum possible number of point mutations that could have occurred on the evolutionary path between the two strands.

Assignment

For two strings $$s_1$$ and $$s_2$$ of equal length, the p-distance between them, denoted $$d_p(s_1, s_2)$$, is the proportion of corresponding symbols that differ between $$s_1$$ and $$s_2$$.

For a general distance function $$d$$ on $$n$$ taxa $$s_1, s_2, \ldots, s_n$$ (taxa are often represented by genetic strings), we may encode the distances between pairs of taxa via a pairwise distance matrix $$D$$ in which $D_{i, j} = d(s_i, s_j)$$. Your task:

Example

In the following interactive session, we assume the FASTA files data01.fna1 and data02.fna2 to be located in the current directory.

>>> pDistance('TTTCCATTTA', 'TTTCCATTTA')
0.0
>>> pDistance('TTTCCATTTA', 'GATTCATTTC')
0.4
>>> pDistance('TTTCCATTTA', 'TTTCCATTTT')
0.1
>>> pDistance('TTTCCATTTA', 'GTTCCATTTA')
0.1
>>> pDistance('TTTCCATTTA', 'ACGT')
Traceback (most recent call last):
AssertionError: strings must have equal length

>>> pairwiseComparison('data01.fna', distance=pDistance)
[[0.0, 0.4, 0.1, 0.1], [0.4, 0.0, 0.4, 0.3], [0.1, 0.4, 0.0, 0.2], [0.1, 0.3, 0.2, 0.0]]
>>> pairwiseComparison('data02.fna', distance=pDistance)
Traceback (most recent call last):
AssertionError: all sequences must have equal length