The tree
library is used to construct classification and regression trees.
library(tree)
We first use classification trees to analyze the Carseats
data set. In these
data, Sales
is a continuous variable, and so we begin by recoding it as a
binary variable. We use the ifelse()
function to create a variable, called
High
, which takes on a value of Yes
if the Sales
variable exceeds 8, and
takes on a value of No
otherwise.
library(ISLR2)
attach(Carseats)
High <- factor(ifelse(Sales <= 8, "No", "Yes"))
Finally, we use the data.frame()
function to merge High
with the rest of
the Carseats
data.
Carseats <- data.frame(Carseats, High)
We now use the tree()
function to fit a classification tree in order to predict
High
using all variables but Sales
. The syntax of the tree()
function is quite
similar to that of the lm()
function.
tree.carseats <- tree(High ~ . - Sales, Carseats)
The summary()
function lists the variables that are used as internal nodes
in the tree, the number of terminal nodes, and the (training) error rate.
summary(tree.carseats)
Classification tree:
tree(formula = High ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Income" "CompPrice" "Population" "Advertising" "Age" "US"
Number of terminal nodes: 27
Residual mean deviance: 0.4575 = 170.7 / 373
Misclassification error rate: 0.09 = 36 / 400
We see that the training error rate is 9%. For classification trees, the deviance
reported in the output of summary()
is given by
where \(n_{mk}\) is the number of observations in the \(m\)th terminal node that belong to the \(k\)th class. A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by \(n - | T_0 |\), which in this case is 400−27 = 373.
For this and the following exercises, we use the OJ dataset from the ISLR2 library. The dataset contains sales information where the customer either purchased Citrus Hill or Minute Maid Orange Juice.
Purchase
as dependent variable and all other variables as independent variables.
Store the model in the variable tree.oj
.Assume that:
ISLR2
library has been loadedtree
library has been installed and loadedOJ
dataset has been loaded and attached