The tree library is used to construct classification and regression trees.

library(tree)

We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to create a variable, called High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise.

library(ISLR2)
attach(Carseats)
High <- factor(ifelse(Sales <= 8, "No", "Yes"))

Finally, we use the data.frame() function to merge High with the rest of the Carseats data.

Carseats <- data.frame(Carseats, High)

We now use the tree() function to fit a classification tree in order to predict High using all variables but Sales. The syntax of the tree() function is quite similar to that of the lm() function.

tree.carseats <- tree(High ~ . - Sales, Carseats)

The summary() function lists the variables that are used as internal nodes in the tree, the number of terminal nodes, and the (training) error rate.

summary(tree.carseats)

Classification tree:
tree(formula = High ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population"  "Advertising" "Age"         "US"         
Number of terminal nodes:  27 
Residual mean deviance:  0.4575 = 170.7 / 373 
Misclassification error rate: 0.09 = 36 / 400

We see that the training error rate is 9%. For classification trees, the deviance reported in the output of summary() is given by

\[-2 \sum_{m} \sum_{k} n_{mk} \log \hat{p}_{mk},\]

where \(n_{mk}\) is the number of observations in the \(m\)th terminal node that belong to the \(k\)th class. A small deviance indicates a tree that provides a good fit to the (training) data. The residual mean deviance reported is simply the deviance divided by \(n - | T_0 |\), which in this case is 400−27 = 373.

Questions

For this and the following exercises, we use the OJ dataset from the ISLR2 library. The dataset contains sales information where the customer either purchased Citrus Hill or Minute Maid Orange Juice.

Assume that: