Here's the pseudocode for the Unweighted Pair Group Method with Arithmetic Mean (UPGMA):
UPGMA(D) Clusters ← n single-element clusters labeled 1, …, n construct a graph T with n isolated nodes labeled by single elements 1, …, n for every node v in T Age(v) ← 0 while there is more than one cluster find the two closest clusters Ci and Cj merge Ci and Cj into a new cluster Cnew with |Ci| + |Cj| elements add a new node labeled by cluster Cnew to T connect node Cnew to Ci and Cj by directed edges remove the rows and columns of D corresponding to Ci and Cj remove Ci and Cj from Clusters add a row/column to D for Cnew by computing D(Cnew, C) for each C in Clusters add Cnew to Clusters root ← the node in T corresponding to the remaining cluster for each edge (v, w) in T length of (v, w) ← Age(v) - Age(w) return tree T
Define a class Tree that can be used to represent rooted trees whose nodes are uniquely labeled and whose edges are weighted with a floating point number. The initialisation method takes the label (an integer or a string) of the root node $$r$$ and zero or more subtrees that are connected to the root node. Each connected subtree is represented as a tuple $$(s, w)$$, with $$s$$ the subtree (an object of the class Tree) and $$w$$ (a floating point number) the weight of the edge connecting the subtree $$s$$ to the root node $$r$$. Make sure the built-in function repr returns string representations of objects of the class Tree that are formatted according to the example given below.
Add a method UPGMA to the class DistanceMatrix
that takes no arguments and returns the rooted tree (an object of the
class Tree) that results from applying the UPGMA algorithm
on the distance matrix.
The n leaves must be labeled 0, 1, …, n - 1 in order of their appearance in the distance matrix. Internal nodes must be uniquely labeled, but the actual labeling can be freely chosen (integers, strings, …).
In the following example we assume the text file distances.txt1 to be located in the current directory.
>>> tree = Tree('A', (Tree('B'), 6), (Tree('C'), 5), (Tree('D'), 11)) >>> tree Tree('A', (Tree('B'), 6), (Tree('C'), 5), (Tree('D'), 11)) >>> D = DistanceMatrix.loadtxt('distances.txt') >>> D.UPGMA() Tree(6, (Tree(5, (Tree(0), 7.0), (Tree(4, (Tree(2), 5.0), (Tree(3), 5.0)), 2.0)), 1.833333333333334), (Tree(1), 8.833333333333334))