The class distribution in a classification problem is often highly skewed (e.g. 99/1 in fraud detection).

We again consider the Titanic dataset, which is a binary classification dataset. There is a 60/40 class distribution in this dataset.

> table(train_y)
train_y
  0   1 
549 342 
> table(train_y)/length(train_y)
train_y
        0         1 
0.6161616 0.3838384 

The class imbalance is small here and it is not necessary to correct it. Nonetheless, for educational purposes, we show how to create a more even distribution. Note that we only change the training set; we do not touch the test set.

The 2 main re-sampling techniques are:

A. Undersampling

Undersampling: randomly select examples from the majority class and remove them.

First, we store the indices of survivors (1) and non-survivors (0).

survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)

In undersampling, we want to bring the number of non-survivors (549) back to the number of survivors (342). This yields a 50/50 distribution.

n_desired <- length(survivors)
set.seed(42)
resampled_nonsurvivors <- sample(x = nonsurvivors, size = n_desired, replace = FALSE)
train_X_undersample <- train_X[c(resampled_nonsurvivors, survivors), ]
train_y_undersample <- train_y[c(resampled_nonsurvivors, survivors)]

Now, we can check the class distribution:

> table(train_y_undersample)
train_y_undersample
  0   1 
342 342 
> table(train_y_undersample)/length(train_y_undersample)
train_y_undersample
  0   1 
0.5 0.5 



B. Oversampling

Oversampling: randomly select examples from the minority class, with replacement, and replace them.

Similarly, we first store the indices of survivors (1) and non-survivors (0).

survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)

In oversampling, we want to bring the number of survivors (342) up to the number of non-survivors (549). This yields a 50/50 distribution. Here, we retain the original survivors and add the resampled survivors.

n_desired <- length(nonsurvivors)
set.seed(42)
resampled_survivors <- sample(x = survivors, size = (n_desired - length(survivors)), replace = TRUE)
train_X_oversample <- train_X[c(survivors, resampled_survivors, nonsurvivors), ]
train_y_oversample <- train_y[c(survivors, resampled_survivors, nonsurvivors)]

Now, we can check the class distribution:

> table(train_y_oversample)
train_y_oversample
  0   1 
549 549 
> table(train_y_oversample)/length(train_y_oversample)
train_y_oversample
  0   1 
0.5 0.5 

In order to get a ratio different than 50/50, adapt the variable n_desired. For example, in oversampling, this snippet yields a 2/1 ratio (i.e. 2 parts majority, 1 part minority): (this is more unbalanced than the Titanic data so it does not make much sense here)

n_desired <- length(nonsurvivors)*(1/2)



Questions

We cannot use the House Prices dataset because that is a regression problem. Therefore, we use the Default dataset from the ISLR2 library.

  1. Imitate the Kaggle data structure (see boilerplate).
  2. Inspect the class distribution. Store the proportion table in prop_class.
  3. Store the indices of the defaulters in defaulters, and the indices of the non-defaulters in nondefaulters.
  4. Bring the current distribution to a 80/20 distribution using oversampling.
    1. Store the intermediary steps in n_desired and resampled_defaulters.
    2. Store the final oversampled data in train_X_oversample and train_y_oversample.
  5. Inspect the class distribution. Store the proportion table in prop_class_oversample. Do you get the 80/20 distribution?

Assume that: