The class distribution in a classification problem is often highly skewed (e.g. 99/1 in fraud detection).

We again consider the Titanic dataset, which is a binary classification dataset. There is a 60/40 class distribution in this dataset.

> table(train_y)
train_y
  0   1 
549 342 
> table(train_y)/length(train_y)
train_y
        0         1 
0.6161616 0.3838384 

The class imbalance is small here and it is not necessary to correct it. Nonetheless, for educational purposes, we show how to create a more even distribution. Note that we only change the training set; we do not touch the test set.

The 2 main re-sampling techniques are:

A. Undersampling

Undersampling: randomly select examples from the majority class and remove them.

First, we store the indices of survivors (1) and non-survivors (0).

survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)

In undersampling, we want to bring the number of non-survivors (549) back to the number of survivors (342). This yields a 50/50 distribution.

n_desired <- length(survivors)
set.seed(42)
resampled_nonsurvivors <- sample(x = nonsurvivors, size = n_desired, replace = FALSE)
train_X_undersample <- train_X[c(resampled_nonsurvivors, survivors), ]
train_y_undersample <- train_y[c(resampled_nonsurvivors, survivors)]

Now, we can check the class distribution:

> table(train_y_undersample)
train_y_undersample
  0   1 
342 342 
> table(train_y_undersample)/length(train_y_undersample)
train_y_undersample
  0   1 
0.5 0.5 

B. Oversampling

Oversampling: randomly select examples from the minority class, with replacement, and replace them.

Similarly, we first store the indices of survivors (1) and non-survivors (0).

survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)

In oversampling, we want to bring the number of survivors (342) up to the number of non-survivors (549). This yields a 50/50 distribution. Here, we retain the original survivors and add the resampled survivors.

n_desired <- length(nonsurvivors)
set.seed(42)
resampled_survivors <- sample(x = survivors, size = (n_desired - length(survivors)), replace = TRUE)
train_X_oversample <- train_X[c(survivors, resampled_survivors, nonsurvivors), ]
train_y_oversample <- train_y[c(survivors, resampled_survivors, nonsurvivors)]

Now, we can check the class distribution:

> table(train_y_oversample)
train_y_oversample
  0   1 
549 549 
> table(train_y_oversample)/length(train_y_oversample)
train_y_oversample
  0   1 
0.5 0.5 

In order to get a ratio different than 50/50, adapt the variable n_desired. For example, in oversampling, this snippet yields a 2/1 ratio (i.e. 2 parts majority, 1 part minority): (this is more unbalanced than the Titanic data so it does not make much sense here)
n_desired <- length(nonsurvivors)*(1/2)

Questions

We cannot use the House Prices dataset because that is a regression problem. Therefore, we use the Default dataset from the ISLR2 library.

Imitate the Kaggle data structure (see boilerplate).
Inspect the class distribution. Store the proportion table in prop_class.
Store the indices of the defaulters in defaulters, and the indices of the non-defaulters in nondefaulters.
Bring the current distribution to a 80/20 distribution using oversampling.
1. Store the intermediary steps in n_desired and resampled_defaulters.
2. Store the final oversampled data in train_X_oversample and train_y_oversample.
Inspect the class distribution. Store the proportion table in prop_class_oversample. Do you get the 80/20 distribution?

Assume that:

The ISLR2 library has been loaded
The Default dataset has been loaded