To run the kmeans() function in R with multiple initial cluster assignments,
we use the nstart argument. If a value of nstart greater than one
is used, then K-means clustering will be performed using multiple random
assignments in Step 1 of
The kmeans() function will report only the best results. Here we compare using nstart=1 to nstart=20.
> set.seed(125)
> km.out <- kmeans(x, 3, nstart = 1)
> km.out$tot.withinss
[1] 98.16736
> km.out <- kmeans(x, 3, nstart = 20)
> km.out$tot.withinss
[1] 97.97927
Note that km.out$tot.withinss is the total within-cluster sum of squares,
which we seek to minimize by performing K-means clustering:
The individual within-cluster sum-of-squares are contained in the vector km.out$withinss.
We strongly recommend always running K-means clustering with a large
value of nstart, such as 20 or 50, since otherwise an undesirable local
optimum may be obtained.
When performing K-means clustering, in addition to using multiple initial
cluster assignments, it is also important to set a random seed using the
set.seed() function. This way, the initial cluster assignments in Step 1 can
be replicated, and the K-means output will be fully reproducible.
MC1:
When comparing nstart=1 to nstart=20,
we can conclude that, in this case,
using a higher value of nstart results in better clustering since the total within-cluster sum of squares of nstart=20 is lower.