Principal components regression (PCR) can be performed using the pcr()
function, which is part of the pls
library. We now apply PCR to the Hitters
data, in order to predict Salary. Again, ensure that the missing values have
been removed from the data, as described in Best Subset Selection 11.
library(pls)
set.seed(2)
pcr.fit <- pcr(Salary ~ ., data = Hitters, scale = TRUE, validation = "CV")
The syntax for the pcr()
function is similar to that for lm()
, with a few
additional options. Setting scale=TRUE
has the effect of standardizing each
predictor, using
, prior to generating the principal components, so that
the scale on which each variable is measured will not have an effect. Setting
validation="CV"
causes pcr()
to compute the ten-fold cross-validation error
for each possible value of \(M\), the number of principal components used. The
resulting fit can be examined using summary()
.
> summary(pcr.fit)
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 452 351.9 353.2 355.0 352.8
adjCV 452 351.6 352.7 354.4 352.1
...
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps
X 38.31 60.16 70.84 79.03 84.29
Salary 40.63 41.58 42.17 43.22 44.90
...
The CV score is provided for each possible number of components, ranging
from \(M = 0\) onwards. (We have printed the CV output only up to \(M = 4\).)
Note that pcr()
reports the root mean squared error; in order to obtain
the usual \(MSE\), we must square this quantity. For instance, a root mean
squared error of 352.8 corresponds to an \(MSE\) of 352.8² = 124,468.
Using the Boston
dataset, try creating a principal component regression with medv
as the response and all other variables as the predictor and store it in pcr.fit
.
Assume that:
MASS
and pls
libraries have been loadedBoston
dataset has been loaded and attached