FarthestFirstTraversal (which we introduced in “FarthestFirstTraversal”1) is fast, and its solution approximates the optimal solution of the k-Center Clustering Problem; however, this algorithm is rarely used for gene expression analysis. In k-Center Clustering, we selected Centers so that these points would minimize MaxDistance(Data, Centers), the maximum distance between any point in Data and its nearest center. But biologists are usually interested in analyzing typical rather than maximum deviations, since the latter may correspond to outliers representing experimental errors.

To address limitations of MaxDistance, we will introduce a new scoring function. Given a set Data of n data points and a set Centers of k centers, the squared error distortion of Data and Centers, denoted Distortion(Data, Centers), is defined as the mean squared distance from each data point to its nearest center,

Distortion(Data,Centers) = (1/n) $$\sum\nolimits_{DataPoint_i \in Data}$$all points DataPoint in Datad(DataPoint, Centers)2.

Assignment

Add a method squared_error_distortion that takes a file that contains a set of centers, which are n-dimensional tuples and a list data points which are also n-dimensional tuples. (in python notation)

The function returns the squared error distortion of the centers and data points as a float.

Example

In the following interactive session, we assume the TXT file data02.txt2 to be located in the current directory.

>>> squared_error_distortion('data02.txt'3)
88.21795899901