We now re-create the analysis carried out on the USArrests
data in Section
12.3 of the book. We turn the data frame into a matrix, after centering and scaling
each column to have mean zero and variance one.
> X <- data.matrix(scale(USArrests))
> pcob <- prcomp(X)
> summary(pcob)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
We see that the first principal component explains 62% of the variance.
We saw in Section 12.2.2 of the book that solving the optimization problem
\[\underset{\mathbf{A} \in \mathbb{R}^{n \times M}, \mathbf{B} \in \mathbb{R}^{p \times M}}{\operatorname{minimize}_{j=1}}\left\{\sum_{j=1}^{p} \sum_{i=1}^{n}\left(x_{i j}-\sum_{m=1}^{M} a_{i m} b_{j m}\right)^{2}\right\}\]on a centered data matrix X is equivalent to computing the first M principal components of the data. The singular value decomposition (SVD) is a general algorithm for solving this optimization problem.
> sX <- svd(X)
> names(sX)
[1] "d" "u" "v"
> round(sX$v, 3)
[,1] [,2] [,3] [,4]
[1,] -0.536 0.418 -0.341 0.649
[2,] -0.583 0.188 -0.268 -0.743
[3,] -0.278 -0.873 -0.378 0.134
[4,] -0.543 -0.167 0.818 0.089
The svd()
function returns three components, u
, d
, and v
. The matrix v
is equivalent to the loading matrix from principal components (up to an
unimportant sign flip).
> pcob$rotation
PC1 PC2 PC3 PC4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818 0.089
The matrix u is equivalent to the matrix of standardized scores, and the
standard deviations are in the vector d
. We can recover the score vectors
using the output of svd()
. They are identical to the score vectors output
by prcomp()
.
*
not %*%
)
The matrices \(D\) and \(U\) can be found as the attributes d
and u
of the sX
variable.
Store the result in score_svd
.prcomp()
function: pcob$x
.