We now use boosting to predict Salary in the Hitters data set.

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

  1. Remove the observations from the Hitters data.frame that have missing values using the is.na() function, and then log-transform the salaries.

  2. Create a training set (Hitters.train) consisting of the first 200 observations, and a test set (Hitters.test) consisting of the remaining observations.

  3. Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter \(\lambda\) defined as follows.
    pows <- seq(-10, -0.2, by = 0.5)
    lambdas <- 10^pows
    

    Use “gaussian” for the distribution argument. Set a seed value of 1. Store the training MSE for each lambda in train.err and the test MSE in test.err (using a for-loop). Produce a plot with different shrinkage values on the \(x\)-axis and the corresponding training set MSE on the \(y\)-axis. Make the same plot for the test MSE.

  4. Derive from the test MSE above, the best value for lambda. Store this value in lambda.boost.

  5. Run a boosted model on the training set with the settings as before. Use lambda.boost for \(\lambda\).
    • MC5:
      Which variables appear to be the most important predictors in this model?
      • 1: CAtBat, CRuns and CRBI.
      • 2: League, Division and NewLeague.
      • 3: It is not possible to calculate variable importances for a boosting model.


Assume that: