To think about clustering as dividing data points Data into k clusters, we will try to select a set Centers of k points that will serve as the centers of these clusters. We would like to choose Centers so that they minimize some distance function between Centers and Data over all possible choices of centers. But how should this distance function be defined?
First, we define the Euclidean distance between points v = (v1, ... , vm) and w = (w1, ... , wm) in m-dimensional space, denoted d(v, w), as the length of the line segment connecting these points,
$$d(v, w) = \sqrt{\sum_{i=1}^m (v_i - w_i)^2}$$
Next, given a point DataPoint in multi-dimensional space and a set of k points Centers, we define the distance from DataPoint to Centers, denoted d(DataPoint, Centers), as the Euclidean distance from DataPoint to its closest center,
d(DataPoint,Centers) = minall points x from Centersd(DataPoint, x).
The length of the segments in the figure below correspond to d(DataPoint, Centers) for each point DataPoint.
We now define the distance between all data points Data and centers Centers. This distance, denoted MaxDistance(Data, Centers), is the maximum of d(DataPoint, Centers) among all data points DataPoint,
MaxDistance(Data, Centers) = maxall points DataPoint from Data d(DataPoints,Centers).
We can now formulate a well-defined clustering problem.
k-Center Clustering Problem: Given a set of data points, find k centers minimizing the maximum distance between these data points and centers.
Input: A set of points Data and an integer k.
Output: A set Centers of k centers that minimize the distance MaxDistance(DataPoints, Centers)
over all possible choices of k centers.
Although the k-Center Clustering Problem is easy to state, it is NP-Hard. The Farthest First Traversal heuristic, whose pseudocode is shown below, selects centers from the points in Data (instead of from all possible points in m-dimensional space). It begins by selecting an arbitrary point in Data as the first center and iteratively adds a new center as the point in Data that is farthest from the centers chosen so far, with ties broken arbitrarily.
FarthestFirstTraversal(Data, k)
Centers ← the set consisting of a single randomly chosen point from Data
while |Centers| < k
DataPoint ← the point in Data maximizing d(DataPoint, Centers)
add DataPoint to Centers
return Centers
Add a method farthest_first_traversal
that takes an amount of centers $$k$$ (integer) and a file which contains a list of n-dimensional tuples (in
python notation).
The function returns a set of $$k$$ centers, which are n-dimensional tuples from the given matrix.
In the following interactive session, we assume the TXT file data01.txt1 to be located in the current directory.
>>> farthest_first_traversal('data01.txt'2, 1) {(78, 25)}