Principal components regression (PCR) can be performed using the pcr() function, which is part of the pls library. We now apply PCR to the Hitters data, in order to predict Salary. Again, ensure that the missing values have been removed from the data, as described in Best Subset Selection 1¹.

library(pls)
set.seed(2)
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, validation = "CV")

The syntax for the pcr() function is similar to that for lm(), with a few additional options. Setting scale=TRUE has the effect of standardizing each predictor, using

\[\tilde x_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-\bar x_j)^2}}\]

, prior to generating the principal components, so that the scale on which each variable is measured will not have an effect. Setting validation="CV" causes pcr() to compute the ten-fold cross-validation error for each possible value of \(M\), the number of principal components used. The resulting fit can be examined using summary().

> summary(pcr.fit)
Data: 	X dimension: 263 19 
	Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps
CV             452    351.9    353.2    355.0    352.8
adjCV          452    351.6    352.7    354.4    352.1
...
TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps
X         38.31    60.16    70.84    79.03    84.29
Salary    40.63    41.58    42.17    43.22    44.90
...

The CV score is provided for each possible number of components, ranging from \(M = 0\) onwards. (We have printed the CV output only up to \(M = 4\).) Note that pcr() reports the root mean squared error; in order to obtain the usual \(MSE\), we must square this quantity. For instance, a root mean squared error of 352.8 corresponds to an \(MSE\) of 352.8² = 124,468.

Using the Boston dataset, try creating a principal component regression with medv as the response and all other variables as the predictor and store it in pcr.fit.

Assume that:

The MASS and pls libraries have been loaded
The Boston dataset has been loaded and attached