Here we fit a regression tree to the Boston
data set. First, we create a
training set, and fit the tree to the training data.
library(MASS)
set.seed(8)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
tree.boston <- tree(medv ~ ., Boston, subset = train)
summary(tree.boston)
Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "rm" "lstat" "dis"
Number of terminal nodes: 8
Residual mean deviance: 15.16 = 3713 / 245
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-18.3800 -2.3390 -0.1132 0.0000 2.2770 14.3600
Notice that the output of summary()
indicates that only three of the variables
have been used in constructing the tree. In the context of a regression
tree, the deviance is simply the sum of squared errors for the tree. We now
plot the tree.
plot(tree.boston)
text(tree.boston, pretty = 0)
The variable lstat
measures the percentage of individuals with lower
socioeconomic status. The tree indicates that lower values of lstat
correspond
to more expensive houses. The tree predicts a median house price
of $35,640 for larger homes in suburbs in which residents have high socioeconomic
status (rm>=6.924
, rm<7.3935
and lstat<8.845
).
Hitters
data set. Use a seed value of 1.
Store the numeric vector with training data indices in train.idx
.Salary
as dependent variable and all other variables as independent variables.
Store the result in the variable tree.hitters
.MC1: Create a plot of the tree and select the correct answer
RBI
results in a higher expected value for Salary
Errors
leads to a higher expected value for Salary
Assume that: