We will now perform cross-validation on a simulated data set.

Generate a simulated data set as follows.

set.seed(1)
x <- rnorm(100)
y <- x - 2*x^2 + rnorm(100) 

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

  1. In this data set, what is \(n\) and what is \(p\)? Store the result in data.n and data.p, respectively. Write out the model used to generate the data in equation form.

  2. Create a scatterplot of \(X\) against \(Y\). Reflect on what you find.

  3. Compute the LOOCV errors that result from fitting the following four models using least squares:

    \(\begin{align} Y &= \beta_0 + \beta_1X + \varepsilon \\ Y &= \beta_0 + \beta_1X + \beta_2X^2 + \varepsilon \\ Y &= \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \varepsilon\\ Y &= \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \beta_4X^4 + \varepsilon \\ \end{align}\)

    Implement this as follows:

    1. Convert the y and x to a dataframe. Include y as first column and x as second. Store the dataframe in data.
    2. Create an empty vector loocv1 of length 4.
    3. Inside the for loop, set a seed value of 1.
    4. Then, fit a logistic regression with a polynomial of column x of degree i, with i the index. y is the dependent variable.
    5. Then, use the cv.glm() function on the logistic regression and extract the LOOCV value with the appropriate attribute. Don’t forget to load the library boot in your R session.
    6. Store the LOOCV value for iteration i in the i‘th element of loocv1.

  4. Repeat question 3 using seed value 2, and reflect on your results. Store the result in loocv2. Are your results the same as what you got in question 3 ? Why ?

    • MC1:
      Are your results the same as what you got in question 3 ? Why ?
      • 1: Yes, the LOOCV value for the error is equal because the LOOCV process does not use any random numbers.
      • 2: No, the LOOCV value for the error is different because the LOOCV process needs random numbers and we provide a different seed.

  5. Which of the models in question 3 had the smallest LOOCV error? Is this what you expected?

    • MC2:
      Which of the models in question 3 had the smallest LOOCV error?:
      • 1: the order-1 polynomial
      • 2: the order-2 polynomial
      • 3: the order-3 polynomial
      • 4: the order-4 polynomial

  6. Check the statistical significance of the coefficient estimates that results from fitting each of the models in question 3 using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

    Hint: you can adapt the for loop of question 3 or 4 and print out the coefficients in each iteration.

    • MC3:
      Which coefficients are significant?:
      • 1: order-1
      • 2: order-1 + order-2
      • 3: order-1 + order-2 + order-3
      • 4: order-1 + order-2 + order-3 + order-4


Assume that: