We now split the samples into a training set and a test set in order
to estimate the test error of ridge regression and the lasso. There are two
common ways to randomly split a data set. The first is to produce a random
vector of TRUE
, FALSE
elements and select the observations corresponding to
TRUE for the training data. The second is to randomly choose a subset of
numbers between 1 and n; these can then be used as the indices for the
training observations. The two approaches work equally well. We used the
former method in Section 6.5.3. Here we demonstrate the latter approach.
We first set a random seed so that the results obtained will be reproducible.
set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]
Next we fit a ridge regression model on the training set, and evaluate
its \(MSE\) on the test set, using \(\lambda = 4\) . Note the use of the predict()
function again. This time we get predictions for a test set, by replacing
type="coefficients"
with the newx
argument.
> ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
> ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
> mean((ridge.pred - y.test)^2)
[1] 142199.2
The test \(MSE\) is 142199. Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set \(MSE\) like this:
> mean((mean(y[train]) - y.test)^2)
[1] 224669.9
Using the Boston dataset, try creating ridge regression with a training set (\(\lambda = 4\))and store it in ridge.mod
with medv
as the response and all other variables as the predictors.
With this model, calculate the predictions for the test set and store them in ridge.pred
.
Additionally, try calculating the test \(MSE\) and store it in ridge.mse
.
Assume that:
MASS
and glmnet
libraries have been loadedBoston
dataset has been loaded and attachedx
, y
and ridge.mod
created in exercise Ridge Regression 11 are already loaded