HierarchicalClustering, whose pseudocode is shown below, progressively generates n different partitions of the underlying data into clusters, all represented by a tree in which each node is labeled by a cluster of genes. The first partition has n single-element clusters represented by the leaves of the tree, with each element forming its own cluster. The second partition merges the two “closest” clusters into a single cluster consisting of two elements. In general, the i-th partition merges the two closest clusters from the (i - 1)-th partition and has n - i + 1 clusters. We hope this algorithm looks familiar — it is UPGMA (from “Implement UPGMA”¹) in disguise.

HierarchicalClustering(D, n)
 Clusters ← n single-element clusters labeled 1, ... , n
  construct a graph T with n isolated nodes labeled by single elements 1, ... , n
while there is more than one cluster
find the two closest clusters C_i and C_j 
merge C_i and C_j into a new cluster C_new with |C_i| + |C_j| elements
add a new node labeled by cluster C_new to T
connect node C_new to C_i and C_j by directed edges 
remove the rows and columns of D corresponding to C_i and C_j 
remove C_i and C_j from Clusters 
add a row/column to D for C_new by computing D(C_new, C) for each C in Clusters
add C_new to Clusters
assign root in T as a node with no incoming edges
return T

Note that we have not yet defined how HierarchicalClustering computes the distance D(C_new, C) between a newly formed cluster C_new and each old cluster C. In practice, clustering algorithms vary in how they compute these distances, with results that can vary greatly. One commonly used approach defines the distance between clusters C₁ and C₂ as the smallest distance between any pair of elements from these clusters,

D_min(C₁,C₂) = min_{all points i in cluster C₁, all points j in cluster C₂} D_{i, j}.

The distance function that we encountered with UPGMA uses the average distance between elements in two clusters,

$$ D_\text{avg}(C_1, C_2) = \dfrac{\sum_{\text{all points }i\text{ in cluster }C_1} ~\sum_{\text{all points }j\text{ in cluster }C_2} D_{i,j}}{|C_1| \cdot |C_2|} $$

Assignment

Add a method hierarchical_clustering that takes file that contains a distance matrix in the form of a tuple of $$n$$ n-dimensional tuples (in python notation), and an integer $$n$$ which is the dimension of the matrix (($$n$$ x $$n$$) matrix).

The function returns the hierarchical clusters of the data points as a set of tuples of floats, a set of all the clusters.

Example

In the following interactive session, we assume the TXT file data01.txt² to be located in the current directory.

>>> hierarchical_clustering('data01.txt'3, 3)
({1, 3}, {1, 2, 3})