Here we apply the best subset selection approach to the Hitters
data. We
wish to predict a baseball player’s Salary on the basis of various statistics
associated with performance in the previous year.
First of all, we note that the Salary
variable is missing for some of the
players. The is.na()
function can be used to identify the missing observations.
It returns a vector of the same length as the input vector, with a TRUE
for any elements that are missing, and a FALSE
for non-missing elements.
The sum()
function can then be used to count all of the missing elements.
> library(ISLR2)
> head(Hitters)
> names(Hitters)
[1] "AtBat" "Hits" "HmRun" "Runs"
[5] "RBI" "Walks" "Years" "CAtBat" "CHits" "CHmRun"
[11] "CRuns" "CRBI" "CWalks" "League" "Division" "PutOuts"
[17] "Assists" "Errors" "Salary" "NewLeague"
> dim(Hitters)
[1] 322 20
> sum(is.na(Hitters$Salary))
[1] 59
Hence we see that Salary
is missing for 59 players. The na.omit()
function
removes all of the rows that have missing values in any variable.
> Hitters <- na.omit(Hitters)
> dim(Hitters)
[1] 263 20
> sum(is.na(Hitters))
[1] 0
The regsubsets()
function (part of the leaps
library) performs best subset
selection by identifying the best model that contains a given number
of predictors, where best is quantified using RSS. The syntax is the same
as for lm()
. The summary()
command outputs the best set of variables for
each model size.
> library(leaps)
> regfit.full <- regsubsets(Salary ~ ., Hitters)
> summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables (and intercept)
...
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
1 ( 1 ) " " " " " " " " " " " " " " " " " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*"
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
1 ( 1 ) " " " " "*" " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " "
3 ( 1 ) " " " " "*" " " " " " " "*"
4 ( 1 ) " " " " "*" " " " " "*" "*"
5 ( 1 ) " " " " "*" " " " " "*" "*"
6 ( 1 ) " " " " "*" " " " " "*" "*"
7 ( 1 ) "*" " " " " " " " " "*" "*"
8 ( 1 ) "*" "*" " " "*" " " "*" "*"
Assists Errors NewLeagueN
1 ( 1 ) " " " " " "
2 ( 1 ) " " " " " "
3 ( 1 ) " " " " " "
4 ( 1 ) " " " " " "
5 ( 1 ) " " " " " "
6 ( 1 ) " " " " " "
7 ( 1 ) " " " " " "
8 ( 1 ) " " " " " "
An asterisk indicates that a given variable is included in the corresponding
model. For instance, this output indicates that the best two-variable model
contains only Hits
and CRBI
.
Try using the regsubsets()
function in the same fashion with the Boston dataset (medv
as the response and all other variables as predictors) and store it in regfit.full
:
MC1: The three-variable model contains only age
, ptratio
and lstat
Store the number of your answer in an object named MC1
.
Assume that:
MASS
and leaps
libraries have been loadedBoston
dataset has been loaded and attached