We have seen that as a number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

Questions

  1. We first generate a data set with \(p = 20\) features, \(n = 1000\) observations, and an associated quantitative response vector generated according to the model \(Y = X\beta + \epsilon\) where \(\beta\) has some elements that are exactly equal to zero. The code for generating this simulated data set is already provided in the hand in section below.

  2. Next, we split the data set into a training set containing 100 observations and a test set containing 900 observations. Again, this is already pre-coded.

  3. Now, perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size. Store the training errors (train MSE) in a vector train.errors. (You may want to use a for loop).

  4. Calculate the test MSEs associated with the best model of each size and store in the vector test.errors.

  5. For which model size do the train and test set MSE take on its minimum values? Store the model sizes (number of independent variables) in modelsize.train and modelsize.test.

  6. Store the coefficients for the model at which the test set MSE is minimized in coef.bestmodel. Compare this to the true model used to generate the data. Note that the best model dropped (almost) all zeroed out coefficients.


Assume that: