In this lab, we perform PCA on the USArrests
data set, which is part of
the base R package. The data set contains statistics in arrests per 100,000 residents for assault, murder and rape in each of the 50 US states in 1973.
The rows of the data set contain the 50 states, in alphabetical order.
states <- row.names(USArrests)
states
The columns of the data set contain the four variables.
> names(USArrests)
[1] "Murder" "Assault" "UrbanPop" "Rape"
We first briefly examine the data. We notice that the variables have vastly different means.
> apply(USArrests, 2, mean)
Murder Assault UrbanPop Rape
7.788 170.760 65.540 21.232
Note that the apply()
function allows us to apply a function—in this case,
the mean()
function—to each row or column of the data set. The second
input here denotes whether we wish to compute the mean of the rows, 1,
or the columns, for which we would use 2. We see that there are on average three times as many
rapes as murders, and more than eight times as many assaults as rapes.
We can also examine the variances of the four variables using the apply()
function.
Store the result in the object called variances
.
Not surprisingly, you will see that the variables also have vastly different variances: the
UrbanPop
variable measures the percentage of the population in each state
living in an urban area, which is not a comparable number to the number
of rapes in each state per 100,000 individuals. If we failed to scale the
variables before performing PCA, then most of the principal components
that we observed would be driven by the Assault
variable, since it has by
far the largest mean and variance. Thus, it is important to standardize the
variables to have mean zero and standard deviation one before performing
PCA.