The soft k-means clustering algorithm starts from randomly chosen centers and iterates the following two steps:
We begin with the "Centers to Soft Clusters" step. If we think about the centers as stars and the data points as planets, then the closer a point is to a center, the stronger that center's "pull" should be on the point. Given k centers Centers = (x1, ..., xk) and n points Data = (Data1, ... , Datan), we therefore need to construct a (k x n) responsibility matrix HiddenMatrix for which HiddenMatrixi,j is the pull of center i on data point j. This pull can be computed according to the Newtonian inverse-square law of gravitation,
$$ \textit{HiddenMatrix}_{i, j} = \dfrac{1/d(\textit{Data}_j, x_i)^2}{\sum_{\text{all centers }x_i} 1/d(\textit{Data}_j , x_i)^2} $$
Unfortunately for Newton fans, the following partition function from statistical physics often works better in practice:
$$ \textit{HiddenMatrix}_{i,j} = \dfrac{e^{-\beta \cdot d(Data_j,\, x_i)}}{\sum_{\text{all centers }x_i} e^{-\beta \cdot d(Data_{j},\, x_i)}} $$
In this formula, e is the base of the natural logarithm ($$e\approx2.72$$), and $$\beta$$ is a parameter reflecting the amount of flexibility in our soft assignment and called - appropriately enough - the stiffness parameter.
In soft k-means clustering, if we let HiddenMatrixi denote the i-th row of HiddenMatrix, then we can update center xi using an analogue of the above formulas. Specifically, we will define the j-th coordinate of center xi, denoted xi, j, as
$$ x_{i, j} = \dfrac{\textit{HiddenMatrix}_i \cdot \textit{Data}^j}{\textit{HiddenMatrix}_i\cdot\overrightarrow{1}} $$
Here, Dataj is the n-dimensional vector holding the j-th coordinates of the n points in Data.
The updated center xi is called a weighted center of gravity of the points Data.
Add a method soft_kmeans
that takes a file of a list of n-dimensional tuples, an integer $$k$$ which is the required number of calculated centers and $$\beta$$, a stiffness parameter.
The function returns the $$k$$ centers of the data points as a set of tuples of floats.
In the following interactive session, we assume the TXT file data01.txt1 to be located in the current directory.
>>> soft_kmeans('data01.txt'2, 1, 4.478143076533713) {(38.893358437857046,)}