We could also get the same result by fitting a ridge regression model with a very large value of \(\lambda\). Note that 1e10 means \(10^{10}\).

> ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
> mean((ridge.pred - y.test)^2)
[1] 224669.8

So fitting a ridge regression model with \(\lambda = 4\) leads to a much lower test \(MSE\) (142199) than fitting a model with just an intercept. We now check whether there is any benefit to performing ridge regression with \(\lambda = 4\) instead of just performing least squares regression. Recall that least squares is simply ridge regression with \(\lambda = 0\).

ridge.pred <- predict(ridge.mod, s = 0, newx = x[test,], exact = T, x = x[train,], y = y[train])
mean((ridge.pred - y.test)^2)
[1] 168588.6
lm(y ~ x, subset = train)
predict(ridge.mod, s = 0, exact = T, type = "coefficients", x = x[train,], y = y[train])[1:20,]

In general, if we want to fit a (unpenalized) least squares model, then we should use the lm() function, since that function provides more useful outputs, such as standard errors and p-values for the coefficients.

Note: In order for glmnet() to yield the exact least squares coefficients when \(\lambda = 0\), we use the argument exact=T when calling the predict() function. Otherwise, the predict() function will interpolate over the grid of \(\lambda\) values used in fitting the glmnet() model, yielding approximate results. When we use exact=T, there remains a slight discrepancy in the third decimal place between the output of glmnet() when \(\lambda = 0\) and the output of lm(); this is due to numerical approximation on the part of glmnet().

MC1:
A ridge regression model with \(\lambda = 4\) leads to a lower \(MSE\) than a regular least squares regression model. Therefore we can conclude that introducing a small bias on how the model is fit on the training data, reduces the variance of the model. In other words, by starting with a slightly worse fit, ridge regression provides better predictions on unseen data.