If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.
> set.seed(2)
> train <- sample(392, 196)
> lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
> mean((mpg - predict(lm.fit, Auto))[-train]^2)
[1] 25.72651
> lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
> mean((mpg - predict(lm.fit2, Auto))[-train]^2)
[1] 20.43036
> lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
> mean((mpg - predict(lm.fit3, Auto))[-train]^2)
[1] 20.38533
Using this split of the observations into a training set and a validation
set, we find that the validation set error rates for the models with linear,
quadratic, and cubic terms are 25.73, 20.43, and 20.39, respectively.
These results are consistent with our previous findings: a model that
predicts mpg
using a quadratic function of horsepower
performs better than
a model that involves only a linear function of horsepower
, and there is
little evidence in favor of a model that uses a cubic function of horsepower
.
To see more clearly why a quadratic term results in the best fit, try plotting mpg against horsepower using the plot()
function!
(if you forgot, take another look at this1 exercise)