To run the kmeans()
function in R with multiple initial cluster assignments,
we use the nstart
argument. If a value of nstart
greater than one
is used, then K-means clustering will be performed using multiple random
assignments in Step 1 of
The kmeans()
function will report only the best results. Here we compare using nstart=1
to nstart=20
.
> set.seed(125)
> km.out <- kmeans(x, 3, nstart = 1)
> km.out$tot.withinss
[1] 98.16736
> km.out <- kmeans(x, 3, nstart = 20)
> km.out$tot.withinss
[1] 97.97927
Note that km.out$tot.withinss
is the total within-cluster sum of squares,
which we seek to minimize by performing K-means clustering:
The individual within-cluster sum-of-squares are contained in the vector km.out$withinss
.
We strongly recommend always running K-means clustering with a large
value of nstart
, such as 20 or 50, since otherwise an undesirable local
optimum may be obtained.
When performing K-means clustering, in addition to using multiple initial
cluster assignments, it is also important to set a random seed using the
set.seed()
function. This way, the initial cluster assignments in Step 1 can
be replicated, and the K-means output will be fully reproducible.
MC1:
When comparing nstart=1
to nstart=20
,
we can conclude that, in this case,
using a higher value of nstart
results in better clustering since the total within-cluster sum of squares of nstart=20
is lower.