In this exercise, we will predict the number of applications (Apps) received using the other variables in the College data set. The dataset is available in the ISLR2 library.

Note: this exercise is primarily based on the lab of chapter 6 (Linear Model Selection and Regularization), which builds on the labs of chapters 3 and 5.

plot

Questions



  1. Split the data set into a training and a test set with a 60/40 distribution. Set a seed value of 42 and use the sample() function. Store the train and test dataframe in college.train and college.test, respectively.

  2. Fit a linear regression model using least squares on the training set. Use Apps as dependent variable and all other variables as predictors. Store the model in lm.fit, the predictions in pred.lm and the test error (MSE) in lm.error.

  3. Prepare the data to be used with a lasso or ridge regression model from the glmnet package. That is, convert the dataframes to matrices using the model.matrix() function. Note that this matrix only holds the independent variables. Store the matrices in train.x and test.x. Also, store the dependent variables in train.y and test.y.

  4. Fit a lasso regression model on the training set, with \(\lambda\) chosen by 5-fold cross-validation (the default argument is 10-fold CV!). Use the following grid: grid <- 10 ^ seq(4, -2, length = 100). Make sure the predictors are standardized and again set a seed value of 42. Use the function cv.glmnet() from the glmnet package. Store the lasso model in cv.lasso and the optimal \(\lambda\) value in bestlam.lasso.
    Hint: in cv.glmnet(), supply the training data (train.x and train.y), and specify the values for arguments alpha, lambda, nfolds.
    Hint: the optimal lambda value is an attribute of the cv.lasso object.

  5. Make predictions on the test set with the optimal \(\lambda\) value. Store the predictions in pred.lasso, the test MSE in lasso.error and the coefficient estimates in coef.lasso.
    Hint: in predict() for pred.lasso, supply the CV object, and specify the values for arguments s and newx.
    Hint: in predict() for coef.lasso, supply the CV object, and specify the values for arguments s and type.


Assume that: