To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple random assignments in Step 1 of

  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
  2. Iterate until the cluster assignments stop changing:
    2.1. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
    2.2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

The kmeans() function will report only the best results. Here we compare using nstart=1 to nstart=20.

> set.seed(125)
> km.out <- kmeans(x, 3, nstart = 1)
> km.out$tot.withinss
[1] 98.16736
> km.out <- kmeans(x, 3, nstart = 20)
> km.out$tot.withinss
[1] 97.97927

Note that km.out$tot.withinss is the total within-cluster sum of squares, which we seek to minimize by performing K-means clustering:

\[\displaystyle{\min_{C_1,...,C_k} \sum_{k=1}^{K} \frac{1}{\left | C_k \right | }\sum_{i, i' \in C_k}^{} \sum_{j=1}^{p}(x_{ij}-x_{i'j})^2}\]

The individual within-cluster sum-of-squares are contained in the vector km.out$withinss.

We strongly recommend always running K-means clustering with a large value of nstart, such as 20 or 50, since otherwise an undesirable local optimum may be obtained.

When performing K-means clustering, in addition to using multiple initial cluster assignments, it is also important to set a random seed using the set.seed() function. This way, the initial cluster assignments in Step 1 can be replicated, and the K-means output will be fully reproducible.

When comparing nstart=1 to nstart=20, we can conclude that, in this case, using a higher value of nstart results in better clustering since the total within-cluster sum of squares of nstart=20 is lower.

