The Default data set of the ISLR2 package contains data about ten thousand customers. We know the balance of their bank account, their annual income and whether they are a student.

> head(Default)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

The aim is to predict which customer will default on their credit card debt. Next, we will now estimate the test error of this logistic regression model using the validation set approach.

First, we convert the dependent variable to a numeric variable:

Default$default <- as.numeric(Default$default == "Yes")

Notice how the dependent variable default is now numeric:

> head(Default)
  default student   balance    income
1       0      No  729.5265 44361.625
2       0     Yes  817.1804 12106.135
3       0      No 1073.5492 31767.139
4       0      No  529.2506 35704.494
5       0      No  785.6559 38463.496
6       0     Yes  919.5885  7491.559

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

  1. Fit a logistic regression model on the entire dataset that uses income and balance to predict default. Store the model in the variable glm.fit1.

  2. Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

    1. Split the data into a training set and a validation set. Take 50% of the data (5000 rows) in the training set and the other 50% in the validation set. Use a seed value of 1. Store the indices of the training set in train.

    2. Fit a multiple logistic regression model using only the training observations. Use income and balance to predict default. Store the model in the variable glm.fit2.

    3. Obtain a prediction of default status for each individual in the validation dataset. In other words, compute the posterior probability of default for that individual, and classify the individual to the “default” category if the posterior probability is greater than 0.5. Store the result in glm.pred2. Make sure that the predicted class label is numeric (1 or 0), not boolean (TRUE or FALSE).

    4. Compute the validation set error. Store the result in val.error2.

  3. Repeat the process in question 2 three times with a for loop, using three different splits of the observations into a training set and a validation set.

    1. Outside the for loop, create an empty numeric vector of length three, stored in the variable val.error3. You can use the function rep() to create a vector by repeating 0 three times.

    2. Inside the for loop, we use the index 1 until 3 to set a seed.

    3. Next, use the code of question 2 to compute the validation error for each specific seed. Change the number “2” in the variable names of question 2 to “3” so you do not overwrite your solution for 2. Store the validation error of the i’th iteration in the i’th index of the variable val.error3 you created.

    4. Inspect the variable val.error3. Even if we use the same model, the results will be different. This happens because we split the dataset randomly each time, and so the validation set is different on every iteration of our loop.


  4. Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Reuse the train/validation split code of question 2 to create indices train4 with seed 1. Store the model in glm.fit4, the predictions in glm.pred4, and the validation error in val.error4. Make sure that the predicted class label is numeric (1 or 0), not boolean (TRUE or FALSE).

    • MC1:
      Inspect val.error4 and compare with val.error2. Does including a dummy variable for student lead to a reduction in the validation error rate?
      • 1: Yes, the validation error rate is lower
      • 2: No, the validation error rate is higher


Assume that: