Notice that we subset the Hitters data frame directly in the call in order to access only the training subset of the data, using the expression Hitters[train,]. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.

test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])

The model.matrix() function is used in many regression packages for building an “X” matrix from data. Now we run a loop, and for each size i, we extract the coefficients from regfit.best for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test \(MSE\).

val.errors <- rep(NA, 19)
for (i in 1:19) {
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[, names(coefi)] %*% coefi
  val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}

We find that the best model is the one that contains seven variables.

> val.errors
 [1] 164377.3 144405.5 152175.7 145198.4 137902.1 139175.7
 [7] 126849.0 136191.4 132889.6 135434.9 136963.3 140694.9
[13] 140690.9 141951.2 141508.2 142164.4 141767.4 142339.6
[19] 142238.2
> which.min(val.errors)
[1] 7
> coef(regfit.best, 7)
 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks    DivisionW      PutOuts 
  67.1085369   -2.1462987    7.0149547    8.0716640    1.2425113   -0.8337844 -118.4364998    0.2526925 

Try calculating the validation errors for regfit.best you created in the previous exercise. Store the model matrix in test.mat and the validation errors in val.errors:


Assume that: