We just saw that it is possible to choose among a set of models of different sizes using \(C_p\), \(BIC\), and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We do this by creating a random vector, train, of elements equal to TRUE if the corresponding observation is in the training set, and FALSE otherwise. The vector test has a TRUE if the observation is in the test set, and a FALSE otherwise. Note the ! in the command to create test causes TRUEs to be switched to FALSEs and vice versa. We also set a random seed so that the user will obtain the same training set/test set split.

set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test <- (!train)

Now, we apply regsubsets() to the training set in order to perform best subset selection.

regfit.best <- regsubsets(Salary ~ ., data = Hitters[train,], nvmax = 19)

Try creating a training and test set for the Boston dataset and store it in train and test respectively. Use this training set in the regsubsets() function with medv as the response and all other variables as predictors and store it in regfit.best. Use 13 for the nvmax parameter.


Assume that: