We will now perform cross-validation on a simulated data set.
Generate a simulated data set as follows.
set.seed(1)
x <- rnorm(100)
y <- x - 2*x^2 + rnorm(100)
Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.
In this data set, what is \(n\) and what is \(p\)? Store the result in data.n
and data.p
, respectively. Write out the
model used to generate the data in equation form.
Create a scatterplot of \(X\) against \(Y\). Reflect on what you find.
Compute the LOOCV errors that result from fitting the following four models using least squares:
\(\begin{align}
Y &= \beta_0 + \beta_1X + \varepsilon \\
Y &= \beta_0 + \beta_1X + \beta_2X^2 + \varepsilon \\
Y &= \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \varepsilon\\
Y &= \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \beta_4X^4 + \varepsilon \\
\end{align}\)
Implement this as follows:
y
and x
to a dataframe. Include y
as first column and x
as second. Store the dataframe in data
.loocv1
of length 4.x
of degree i
, with i
the index. y is the dependent variable.cv.glm()
function on the logistic regression and extract the LOOCV value with the appropriate attribute.
Don’t forget to load the library boot
in your R session.i
in the i
‘th element of loocv1
.
Repeat question 3 using seed value 2, and reflect on your results. Store the result in loocv2
.
Are your results the same as what you got in question 3 ? Why ?
Which of the models in question 3 had the smallest LOOCV error? Is this what you expected?
Check the statistical significance of the coefficient estimates that results from fitting each of the models in question 3 using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
Hint: you can adapt the for loop of question 3 or 4 and print out the coefficients in each iteration.
Assume that:
boot
library has been loaded