In performing a polynomial regression we must decide on the degree of
the polynomial to use. One way to do this is by using hypothesis tests. We
now fit models ranging from linear to a degree-5 polynomial and seek to
determine the simplest model which is sufficient to explain the relationship
between wage
and age
. We use the anova()
function, which performs an
analysis of variance (ANOVA, using an F-test) in order to test the null
hypothesis that a model \(\mathcal{M}_1\) is sufficient to explain the data against the
alternative hypothesis that a more complex model \(\mathcal{M}_2\) is required. In order
to use the anova()
function, \(\mathcal{M}_1\) and \(\mathcal{M}_2\) must be nested models: the
predictors in \(\mathcal{M}_1\) must be a subset of the predictors in \(\mathcal{M}_2\). In this case,
we fit five different models and sequentially compare the simpler model to
the more complex model.
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table
Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2998 5022216
2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
3 2996 4777674 1 15756 9.8888 0.001679 **
4 2995 4771604 1 6070 3.8098 0.051046 .
5 2994 4770322 1 1283 0.8050 0.369682
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value comparing the linear Model 1
to the quadratic Model 2
is
essentially zero (\(< 10^{- 15}\)), indicating that a linear fit is not sufficient. Similarly
the p-value comparing the quadratic Model 2
to the cubic Model 3
is very low (0.0017), so the quadratic fit is also insufficient. The p-value
comparing the cubic and degree-4 polynomials, Model 3
and Model 4
, is approximately
5% while the degree-5 polynomial Model 5
seems unnecessary
because its p-value is 0.37. Hence, either a cubic or a quartic polynomial
appear to provide a reasonable fit to the data, but lower- or higher-order
models are not justified.
In this case, instead of using the anova()
function, we could have obtained
these p-values more succinctly by exploiting the fact that poly()
creates
orthogonal polynomials.
coef(summary(fit.5))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
Notice that the p-values are the same, and in fact the square of the
t-statistics are equal to the F-statistics from the anova()
function; for
example:
(-11.983)^2
[1] 143.5923
medv
as dependent variable and crim
as independent variable.
Use the orthogonal polynomials.
The model of degree-1 should be stored in the variable fit.1
, model of degree-2 in fit.2
, etc.anova.medv
.Assume that: