The Default
data set of the ISLR2
package contains data about ten thousand customers.
We know the balance of their bank account, their annual income and whether they are a student.
> head(Default)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
The aim is to predict which customer will default on their credit card debt. Next, we will now estimate the test error of this logistic regression model using the validation set approach.
First, we convert the dependent variable to a numeric variable:
Default$default <- as.numeric(Default$default == "Yes")
Notice how the dependent variable default
is now numeric:
> head(Default)
default student balance income
1 0 No 729.5265 44361.625
2 0 Yes 817.1804 12106.135
3 0 No 1073.5492 31767.139
4 0 No 529.2506 35704.494
5 0 No 785.6559 38463.496
6 0 Yes 919.5885 7491.559
Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.
Fit a logistic regression model on the entire dataset that uses income
and balance
to predict default
.
Store the model in the variable glm.fit1
.
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
Split the data into a training set and a validation set.
Take 50% of the data (5000 rows) in the training set and the other 50% in the validation set.
Use a seed value of 1.
Store the indices of the training set in train
.
Fit a multiple logistic regression model using only the training observations.
Use income
and balance
to predict default
. Store the model in the variable glm.fit2
.
Obtain a prediction of default status for each individual in the validation dataset.
In other words, compute the posterior probability of default for that individual,
and classify the individual to the “default” category if the posterior probability is greater than 0.5.
Store the result in glm.pred2
. Make sure that the predicted class label is numeric (1
or 0
), not boolean (TRUE
or FALSE
).
Compute the validation set error. Store the result in val.error2
.
Repeat the process in question 2 three times with a for loop, using three different splits of the observations into a training set and a validation set.
Outside the for loop, create an empty numeric vector of length three, stored in the variable val.error3
.
You can use the function rep()
to create a vector by repeating 0 three times.
Inside the for loop, we use the index 1 until 3 to set a seed.
Next, use the code of question 2 to compute the validation error for each specific seed.
Change the number “2” in the variable names of question 2 to “3” so you do not overwrite your solution for 2.
Store the validation error of the i’th iteration in the i’th index of the variable val.error3
you created.
Inspect the variable val.error3
. Even if we use the same model, the results will be different. This happens because we split
the dataset randomly each time, and so the validation set is different on every iteration of our loop.
Now consider a logistic regression model that predicts the probability of default
using income
, balance
, and a dummy variable for student
.
Estimate the test error for this model using the validation set approach.
Reuse the train/validation split code of question 2 to create indices train4
with seed 1.
Store the model in glm.fit4
, the predictions in glm.pred4
, and the validation error in val.error4
. Make sure that the predicted class label is numeric (1
or 0
), not boolean (TRUE
or FALSE
).
val.error4
and compare with val.error2
.
Does including a dummy variable for student
lead to a reduction in the validation error rate?
Assume that:
ISLR2
library has been loadedDefault
dataset has been loaded and attached