Notice that we subset the Hitters
data frame directly in the call in order
to access only the training subset of the data, using the expression
Hitters[train,]
. We now compute the validation set error for the best
model of each model size. We first make a model matrix from the test
data.
test.mat <- model.matrix(Salary ~ ., data = Hitters[test,])
The model.matrix()
function is used in many regression packages for building
an “X” matrix from data. Now we run a loop, and for each size i
, we
extract the coefficients from regfit.best
for the best model of that size,
multiply them into the appropriate columns of the test model matrix to
form the predictions, and compute the test \(MSE\).
val.errors <- rep(NA, 19)
for (i in 1:19) {
coefi <- coef(regfit.best, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}
We find that the best model is the one that contains seven variables.
> val.errors
[1] 164377.3 144405.5 152175.7 145198.4 137902.1 139175.7
[7] 126849.0 136191.4 132889.6 135434.9 136963.3 140694.9
[13] 140690.9 141951.2 141508.2 142164.4 141767.4 142339.6
[19] 142238.2
> which.min(val.errors)
[1] 7
> coef(regfit.best, 7)
(Intercept) AtBat Hits Walks CRuns CWalks DivisionW PutOuts
67.1085369 -2.1462987 7.0149547 8.0716640 1.2425113 -0.8337844 -118.4364998 0.2526925
Try calculating the validation errors for regfit.best
you created in the previous exercise.
Store the model matrix in test.mat
and the validation errors in val.errors
:
Assume that:
MASS
and leaps
libraries have been loadedBoston
dataset has been loaded and attachedtrain
, test
and regfit.best
from the previous exercise are already loaded