In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

Questions

  1. Use the rnorm() function to generate a predictor x of length \(n = 100\), as well as a noise vector eps of length \(n = 100\). Set a seed value of 1.

  2. Generate a response vector y of length \(n = 100\) according to the model \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \varepsilon\) where \(\beta_0 = 2\), \(\beta_1 = 3\), \(\beta_2 = -1\) and \(\beta_3 = 0.5\).

  3. Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X,X^2,\cdots,X^{10}\) (don’t forget to load the correct package). Note you will need to use the data.frame() function to create a single data set containing both \(X\) and \(Y\). What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)? Store the \(C_p\), \(BIC\), and adjusted \(R^2\) value for the best model according to each measure in min_cp, min_bic and max_adjR2. Have a look at some plots in RStudio to understand the behaviour of the three metrics. Store the coefficients of the best model according to the adjusted \(R^2\) in coef_bestmodel.

  4. Repeat question 3, using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in question 3? Store the coefficients of the forward and backward models with the highest adjusted \(R^2\) in coef.fwd and coef.bwd.


Assume that: