Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

> library(ISLR2)
> head(Hitters)
> names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"     
 [5] "RBI"       "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
[11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division"  "PutOuts"  
[17] "Assists"   "Errors"    "Salary"    "NewLeague"
> dim(Hitters)
[1] 322  20
> sum(is.na(Hitters$Salary))
[1] 59

Hence we see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable.

> Hitters <- na.omit(Hitters)
> dim(Hitters)
[1] 263  20
> sum(is.na(Hitters))
[1] 0

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

> library(leaps)
> regfit.full <- regsubsets(Salary ~ ., Hitters)
> summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables  (and intercept)
...
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "  
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "  
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "  
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "  
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "  
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "  
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"  
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "  
         CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
1  ( 1 ) " "    " "   "*"  " "    " "     " "       " "    
2  ( 1 ) " "    " "   "*"  " "    " "     " "       " "    
3  ( 1 ) " "    " "   "*"  " "    " "     " "       "*"    
4  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"    
5  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"    
6  ( 1 ) " "    " "   "*"  " "    " "     "*"       "*"    
7  ( 1 ) "*"    " "   " "  " "    " "     "*"       "*"    
8  ( 1 ) "*"    "*"   " "  "*"    " "     "*"       "*"    
         Assists Errors NewLeagueN
1  ( 1 ) " "     " "    " "       
2  ( 1 ) " "     " "    " "       
3  ( 1 ) " "     " "    " "       
4  ( 1 ) " "     " "    " "       
5  ( 1 ) " "     " "    " "       
6  ( 1 ) " "     " "    " "       
7  ( 1 ) " "     " "    " "       
8  ( 1 ) " "     " "    " "       

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI.

Try using the regsubsets() function in the same fashion with the Boston dataset (medv as the response and all other variables as predictors) and store it in regfit.full:

MC1: The three-variable model contains only age, ptratio and lstat

Store the number of your answer in an object named MC1.


Assume that: