The soft k-means clustering algorithm starts from randomly chosen centers and iterates the following two steps:

Centers to Soft Clusters (E-step): After centers have been selected, assign each data point a "responsibility" value for each cluster, where higher values correspond to stronger cluster membership.
Soft Clusters to Centers (M-step): After data points have been assigned to soft clusters, compute new centers.

We begin with the "Centers to Soft Clusters" step. If we think about the centers as stars and the data points as planets, then the closer a point is to a center, the stronger that center's "pull" should be on the point. Given k centers Centers = (x₁, ..., x_k) and n points Data = (Data₁, ... , Data_n), we therefore need to construct a (k x n) responsibility matrix HiddenMatrix for which HiddenMatrix_i,j is the pull of center i on data point j. This pull can be computed according to the Newtonian inverse-square law of gravitation,

$$ \textit{HiddenMatrix}_{i, j} = \dfrac{1/d(\textit{Data}_j, x_i)^2}{\sum_{\text{all centers }x_i} 1/d(\textit{Data}_j , x_i)^2} $$

Unfortunately for Newton fans, the following partition function from statistical physics often works better in practice:

$$ \textit{HiddenMatrix}_{i,j} = \dfrac{e^{-\beta \cdot d(Data_j,\, x_i)}}{\sum_{\text{all centers }x_i} e^{-\beta \cdot d(Data_{j},\, x_i)}} $$

In this formula, e is the base of the natural logarithm ($$e\approx2.72$$), and $$\beta$$ is a parameter reflecting the amount of flexibility in our soft assignment and called - appropriately enough - the stiffness parameter.

In soft k-means clustering, if we let HiddenMatrix_i denote the i-th row of HiddenMatrix, then we can update center x_i using an analogue of the above formulas. Specifically, we will define the j-th coordinate of center x_i, denoted x_{i, j}, as

$$ x_{i, j} = \dfrac{\textit{HiddenMatrix}_i \cdot \textit{Data}^j}{\textit{HiddenMatrix}_i\cdot\overrightarrow{1}} $$

Here, Data^j is the n-dimensional vector holding the j-th coordinates of the n points in Data.

The updated center x_i is called a weighted center of gravity of the points Data.

Assignment

Add a method soft_kmeans that takes a file of a list of n-dimensional tuples, an integer $$k$$ which is the required number of calculated centers and $$\beta$$, a stiffness parameter.

The function returns the $$k$$ centers of the data points as a set of tuples of floats.

Example

In the following interactive session, we assume the TXT file data01.txt¹ to be located in the current directory.

>>> soft_kmeans('data01.txt'2, 1, 4.478143076533713)
{(38.893358437857046,)}