We just saw that it is possible to choose among a set of models of different sizes using \(C_p\), \(BIC\), and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.
In order to use the validation set approach, we begin by splitting the
observations into a training set and a test set. We do this by creating
a random vector, train
, of elements equal to TRUE
if the corresponding
observation is in the training set, and FALSE
otherwise. The vector test has
a TRUE
if the observation is in the test set, and a FALSE
otherwise. Note the
! in the command to create test
causes TRUE
s to be switched to FALSE
s and
vice versa. We also set a random seed so that the user will obtain the same
training set/test set split.
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test <- (!train)
Now, we apply regsubsets()
to the training set in order to perform best
subset selection.
regfit.best <- regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)
Try creating a training and test set for the Boston
dataset and store it in train
and test
respectively.
Use this training set in the regsubsets()
function with medv
as the response and all other variables as predictors and store it in regfit.best
.
Use 13 for the nvmax
parameter.
Assume that:
MASS
and leaps
libraries have been loadedBoston
dataset has been loaded and attached