HierarchicalClustering, whose pseudocode is shown below, progressively generates n different partitions of the underlying data into clusters, all represented by a tree in which each node is labeled by a cluster of genes. The first partition has n single-element clusters represented by the leaves of the tree, with each element forming its own cluster. The second partition merges the two “closest” clusters into a single cluster consisting of two elements. In general, the i-th partition merges the two closest clusters from the (i - 1)-th partition and has n - i + 1 clusters. We hope this algorithm looks familiar — it is UPGMA (from “Implement UPGMA”1) in disguise.
HierarchicalClustering(D, n)
Clusters ← n single-element clusters labeled 1, ... , n
construct a graph T with n isolated nodes labeled by single elements 1, ... , n
while there is more than one cluster
find the two closest clusters Ci and Cj
merge Ci and Cj into a new cluster Cnew with |Ci| + |Cj| elements
add a new node labeled by cluster Cnew to T
connect node Cnew to Ci and Cj by directed edges
remove the rows and columns of D corresponding to Ci and Cj
remove Ci and Cj from Clusters
add a row/column to D for Cnew by computing D(Cnew, C) for each C in Clusters
add Cnew to Clusters
assign root in T as a node with no incoming edges
return T
Note that we have not yet defined how HierarchicalClustering computes the distance D(Cnew, C) between a newly formed cluster Cnew and each old cluster C. In practice, clustering algorithms vary in how they compute these distances, with results that can vary greatly. One commonly used approach defines the distance between clusters C1 and C2 as the smallest distance between any pair of elements from these clusters,
Dmin(C1,C2) = minall points i in cluster C1, all points j in cluster C2 Di, j .
The distance function that we encountered with UPGMA uses the average distance between elements in two clusters,
$$ D_\text{avg}(C_1, C_2) = \dfrac{\sum_{\text{all points }i\text{ in cluster }C_1} ~\sum_{\text{all points }j\text{ in cluster }C_2} D_{i,j}}{|C_1| \cdot |C_2|} $$
Add a method hierarchical_clustering
that takes file that contains a distance matrix in the form of a tuple of $$n$$ n-dimensional tuples (in python notation), and an integer $$n$$ which is the dimension of the matrix (($$n$$ x $$n$$) matrix).
The function returns the hierarchical clusters of the data points as a set of tuples of floats, a set of all the clusters.
In the following interactive session, we assume the TXT file data01.txt2 to be located in the current directory.
>>> hierarchical_clustering('data01.txt'3, 3) ({1, 3}, {1, 2, 3})