In this lab, we perform PCA on the USArrests data set, which is part of the base R package. The data set contains statistics in arrests per 100,000 residents for assault, murder and rape in each of the 50 US states in 1973. The rows of the data set contain the 50 states, in alphabetical order.

states <- row.names(USArrests)
states

The columns of the data set contain the four variables.

> names(USArrests)
[1] "Murder"   "Assault"  "UrbanPop" "Rape"   

We first briefly examine the data. We notice that the variables have vastly different means.

> apply(USArrests, 2, mean)
  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232 

Note that the apply() function allows us to apply a function—in this case, the mean() function—to each row or column of the data set. The second input here denotes whether we wish to compute the mean of the rows, 1, or the columns, for which we would use 2. We see that there are on average three times as many rapes as murders, and more than eight times as many assaults as rapes. We can also examine the variances of the four variables using the apply() function. Store the result in the object called variances.

Not surprisingly, you will see that the variables also have vastly different variances: the UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of rapes in each state per 100,000 individuals. If we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by the Assault variable, since it has by far the largest mean and variance. Thus, it is important to standardize the variables to have mean zero and standard deviation one before performing PCA.