We could also get the same result by fitting a ridge regression model with
a very large value of \(\lambda\). Note that 1e10
means \(10^{10}\).
> ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
> mean((ridge.pred - y.test)^2)
[1] 224669.8
So fitting a ridge regression model with \(\lambda = 4\) leads to a much lower test \(MSE\) (142199) than fitting a model with just an intercept. We now check whether there is any benefit to performing ridge regression with \(\lambda = 4\) instead of just performing least squares regression. Recall that least squares is simply ridge regression with \(\lambda = 0\).
ridge.pred <- predict(ridge.mod, s = 0, newx = x[test,], exact = T, x = x[train,], y = y[train])
mean((ridge.pred - y.test)^2)
[1] 168588.6
lm(y ~ x, subset = train)
predict(ridge.mod, s = 0, exact = T, type = "coefficients", x = x[train,], y = y[train])[1:20,]
In general, if we want to fit a (unpenalized) least squares model, then
we should use the lm()
function, since that function provides more useful
outputs, such as standard errors and p-values for the coefficients.
Note: In order for glmnet()
to yield the exact least squares coefficients when \(\lambda = 0\),
we use the argument exact=T
when calling the predict()
function. Otherwise, the
predict()
function will interpolate over the grid of \(\lambda\) values used in fitting the
glmnet()
model, yielding approximate results. When we use exact=T, there remains
a slight discrepancy in the third decimal place between the output of glmnet()
when
\(\lambda = 0\) and the output of lm()
; this is due to numerical approximation on the part of
glmnet()
.
MC1:
A ridge regression model with \(\lambda = 4\) leads to a lower \(MSE\) than a regular least squares regression model.
Therefore we can conclude that introducing a small bias on how the model is fit on the training data, reduces the variance of the model.
In other words, by starting with a slightly worse fit, ridge regression provides better predictions on unseen data.