We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso. There are two common ways to randomly split a data set. The first is to produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data. The second is to randomly choose a subset of numbers between 1 and n; these can then be used as the indices for the training observations. The two approaches work equally well. We used the former method in Section 6.5.3. Here we demonstrate the latter approach. We first set a random seed so that the results obtained will be reproducible.

set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]

Next we fit a ridge regression model on the training set, and evaluate its \(MSE\) on the test set, using \(\lambda = 4\) . Note the use of the predict() function again. This time we get predictions for a test set, by replacing type="coefficients" with the newx argument.

> ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
> ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
> mean((ridge.pred - y.test)^2)
[1] 142199.2

The test \(MSE\) is 142199. Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set \(MSE\) like this:

> mean((mean(y[train]) - y.test)^2)
[1] 224669.9

Using the Boston dataset, try creating ridge regression with a training set (\(\lambda = 4\))and store it in ridge.mod with medv as the response and all other variables as the predictors. With this model, calculate the predictions for the test set and store them in ridge.pred. Additionally, try calculating the test \(MSE\) and store it in ridge.mse.


Assume that: