In this exercise, you will further analyze the Wage data set considered throughout this chapter, available in the ISLR2 package.

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

  1. Perform polynomial regression to predict wage using age and use 10-fold cross-validation to select the optimal degree \(d\) for the polynomial.
    1. Set a seed value of 4.
    2. Initialize an empty vector to store the CV errors. Store it in deltas.
    3. In each iteration i of the for loop, fit a linear model with a polynomial of order i. Check degree 1 to 10.
    4. With the cv.glm() function, store the CV error in the i-th element of the vector deltas.poly. Set the correct value for \(K\) and use the attribute delta[1] to extract the CV error.
    5. Inspect the vector deltas and make a line plot.
    6. Which order polynomial \(d\) has the lowest test MSE? Store the answer in d.min.poly.

    Verify your decision by comparing it to the results of hypothesis testing using ANOVA.

    1. Fit 5 models with increasing order of polynomials. fit1 has order-1 until fit5 with order-5.
    2. Perform an ANOVA analysis using the 5 model. Store it in anova.poly.

      • MC1:
        Interpret the output. What can you conclude based on ANOVA?
        • 1: A model of order-1 or order-2 is most appropriate
        • 2: A model of order-2 or order-3 is most appropriate
        • 3: A model of order-3 or order-4 is most appropriate
        • 4: A model of order-4 or order-5 is most appropriate

      • MC2:
        Does this confirm our findings of the 10-fold CV?
        • 1: No
        • 2: Yes

  2. Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts.
    1. The procedure is similar than in the previous question. Set a seed value of 1.
    2. However, this time, use a model with i cuts in the i-th iteration. Store the results in deltas.cut. Check cuts 2 to 10; deltas.cut should have NA for cut 1.
    3. Store the optimal number of cuts in d.min.cut.

    Try to recreate the following plot of a step function with 8 cuts fitted to the training data.

    1. Create a scatterplot of wage vs age using all the data.
    2. Create a sequence age.grid of integer values ranging from the lowest age value in the data to the highest age value observed.
    3. Fit a step function with 8 cuts. Store the model in fit8.
    4. Using the model fit8, predict wage for the entire sequence. Store the result in preds.
    5. Add the predictions preds on the plot.

    plot


Assume that: