In order to fit a step function, we use the
cut()
function.
table(cut(age, 4))
(17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
750 1399 779 72
fit <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(fit))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.158392 1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
Here cut()
automatically picked the cutpoints at 33.5, 49, and 64.5 years
of age. We could also have specified our own cutpoints directly using the
breaks
option. The function cut()
returns an ordered categorical variable;
the lm()
function then creates a set of dummy variables for use in the regression.
The age<33.5
category is left out, so the intercept coefficient of
$94,160 can be interpreted as the average salary for those under 33.5 years
of age, and the other coefficients can be interpreted as the average additional
salary for those in the other age groups. We can produce predictions
and plots just as we did in the case of the polynomial fit.
medv
using step functions of dis
.
However, instead of giving the number of intervals, set 5 cut points by specifying a numeric vector in the breaks
argument.
The cut points should be placed at 1, 3, 6, 9, and 12. Store the result in the variable fit
.medv
for a neighbourhood with a dist
value of 8:
19.73
25.15
25.07
22.45
Assume that: