As we have seen previously, the training error rate is often overly optimistic—it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.
To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.
> train <- (Year < 2005)
> Smarket.2005 <- Smarket[!train,]
> dim(Smarket.2005)
[1] 252 9
> Direction.2005 <- Direction[!train]
The object train is a vector of 1, 250 elements, corresponding to the observations in our data set.
The elements of the vector that correspond to
observations that occurred before 2005 are set to TRUE
, whereas those that
correspond to observations in 2005 are set to FALSE
. The object train is
a Boolean vector, since its elements are TRUE
and FALSE
.
Boolean vectors
can be used to obtain a subset of the rows or columns of a matrix. For
instance, the command Smarket[train,]
would pick out a submatrix of the
stock market data set, corresponding only to the dates before 2005, since
those are the ones for which the elements of train are TRUE
.
The !
symbol
can be used to reverse all of the elements of a Boolean vector. That is,
!train
is a vector similar to train
, except that the elements that are TRUE
in train
get swapped to FALSE
in !train
, and the elements that are FALSE
in train
get swapped to TRUE
in !train
. Therefore, Smarket[!train,]
yields
a submatrix of the stock market data containing only the observations for
which train is FALSE—that is, the observations with dates in 2005. The
output above indicates that there are 252 such observations.
Try creating Smarket.2004 with all the observations from 2004 and 2005:
Assume that:
ISLR2
library has been loadedSmarket
dataset has been loaded and attached