The cv.glm()
function can also be used to implement k-fold CV. Below we
use k = 10, a common choice for k, on the Auto
data set. We once again set
a random seed and initialize a vector in which we will store the CV errors
corresponding to the polynomial fits of orders one to ten.
> set.seed(17)
> cv.error.10 <- rep(0 ,10)
> for (i in 1:10) {
+ glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
+ cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
+ }
> cv.error.10
[1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
[9] 18.87013 20.95520
Notice that the computation time is much shorter than that of LOOCV. (In principle, the computation time for LOOCV for a least squares linear model should be faster than for k-fold CV, due to the availability of this formula:
\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}(\frac{y_i - \hat y_i}{1-h_i})^2\]for LOOCV; however, unfortunately the cv.glm()
function
does not make use of this formula.) We still see little evidence that using
cubic or higher-order polynomial terms leads to lower test error than simply
using a quadratic fit.
We saw in Section LOOCV 1 that the two numbers associated with delta
are
essentially the same when LOOCV is performed. When we instead perform
k-fold CV, then the two numbers associated with delta
differ slightly. The
first is the standard k-fold CV estimate, as in:
The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.
Using the Boston dataset:
medv
as the response and lstat
as the predictor,
store the CV errors corresponding to the polynomial fits of orders one to ten in cv.error.10
using K-fold cross validation (K = 10)Assume that:
MASS
and boot
libraries have been loadedBoston
dataset has been loaded and attached