Here we use the gbm package, and within it the gbm()
function, to fit boosted
regression trees to the Boston data set. We run gbm()
with the option
distribution="gaussian"
since this is a regression problem; if it were a binary
classification problem, we would use distribution="bernoulli"
. The
argument n.trees=5000
indicates that we want 5000 trees, and the option
interaction.depth=4
limits the depth of each tree.
library(gbm)
set.seed(1)
boost.boston <- gbm(medv ~ ., data = Boston[train,], distribution = "gaussian", n.trees = 5000, interaction.depth = 4)
The summary()
function produces a relative influence plot and also outputs
the relative influence statistics.
summary(boost.boston)
var rel.inf
rm rm 43.9919329
lstat lstat 33.1216941
crim crim 4.2604167
dis dis 4.0111090
nox nox 3.4353017
black black 2.8267554
age age 2.6113938
ptratio ptratio 2.5403035
tax tax 1.4565654
indus indus 0.8008740
rad rad 0.6546400
zn zn 0.1446149
chas chas 0.1443986
We see that lstat
and rm
are by far the most important variables. We can
also produce partial dependence plots for these two variables. These plots
illustrate the marginal effect of the selected variables on the response after
integrating out the other variables. In this case, as we might expect, median
house prices are increasing with rm
and decreasing with lstat
.
plot(boost.boston, i = "rm")
plot(boost.boston, i = "lstat")
We now use the boosted model to predict medv
on the test set:
yhat.boost <- predict(boost.boston, newdata = Boston[-train,], n.trees = 5000)
mean((yhat.boost - boston.test)^2)
[1] 18.84709
The test MSE obtained is 18.85; similar to the test MSE for random forests and superior to that for bagging. If we want to, we can perform boosting with a different value of the shrinkage parameter \(\lambda\) in \(\hat f(x) \leftarrow \hat f(x) + \lambda \hat f^b(x)\). The default value is 0.1, but this is easily modified. Here we take \(\lambda = 0.2\).
boost.boston <- gbm(medv ~ ., data = Boston[train,], distribution = "gaussian", n.trees = 5000, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost <- predict(boost.boston, newdata = Boston[-train,], n.trees = 5000)
mean((yhat.boost - boston.test)^2)
[1] 18.33455
In this case, using \(\lambda = 0.2\) leads to a slightly lower test MSE than \(\lambda = 0.1\).
The Smarket data set contains the daily percentage returns for the S&P 500 stock index between 2001 and 2005.
In this question we will try to predict whether a stock will go up or down (Direction
) based on the its activity of the previous 5 days (Lag1
, Lag2
…).
library(ISLR2)
data(Smarket)
dim(Smarket)
[1] 1250 9
head(Smarket)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
Target
in the Smarket
dataframe which has the value 1 if the stock went up and 0 if the stock went down.train
.Create a test set of all the observations in 2005. Store the dataframe in test
.
Target
as the dependent variable and Lag1
to Lag5
as the independent variables and store in an object named fit
.
preds
.
(If the model predicts a value smaller than 0, output 0, if the model predicts a value higher or equal to 0, output 1)MC1: Does the model perform better than random guessing (accuracy higher than 50%)?
Assume that: