The class distribution in a classification problem is often highly skewed (e.g. 99/1 in fraud detection).
We again consider the Titanic dataset, which is a binary classification dataset. There is a 60/40 class distribution in this dataset.
> table(train_y)
train_y
0 1
549 342
> table(train_y)/length(train_y)
train_y
0 1
0.6161616 0.3838384
The class imbalance is small here and it is not necessary to correct it. Nonetheless, for educational purposes, we show how to create a more even distribution. Note that we only change the training set; we do not touch the test set.
The 2 main re-sampling techniques are:
Undersampling: randomly select examples from the majority class and remove them.
First, we store the indices of survivors (1) and non-survivors (0).
survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)
In undersampling, we want to bring the number of non-survivors (549) back to the number of survivors (342). This yields a 50/50 distribution.
n_desired <- length(survivors)
set.seed(42)
resampled_nonsurvivors <- sample(x = nonsurvivors, size = n_desired, replace = FALSE)
train_X_undersample <- train_X[c(resampled_nonsurvivors, survivors), ]
train_y_undersample <- train_y[c(resampled_nonsurvivors, survivors)]
Now, we can check the class distribution:
> table(train_y_undersample)
train_y_undersample
0 1
342 342
> table(train_y_undersample)/length(train_y_undersample)
train_y_undersample
0 1
0.5 0.5
Oversampling: randomly select examples from the minority class, with replacement, and replace them.
Similarly, we first store the indices of survivors (1) and non-survivors (0).
survivors <- which(train_y == 1)
nonsurvivors <- which(train_y == 0)
In oversampling, we want to bring the number of survivors (342) up to the number of non-survivors (549). This yields a 50/50 distribution. Here, we retain the original survivors and add the resampled survivors.
n_desired <- length(nonsurvivors)
set.seed(42)
resampled_survivors <- sample(x = survivors, size = (n_desired - length(survivors)), replace = TRUE)
train_X_oversample <- train_X[c(survivors, resampled_survivors, nonsurvivors), ]
train_y_oversample <- train_y[c(survivors, resampled_survivors, nonsurvivors)]
Now, we can check the class distribution:
> table(train_y_oversample)
train_y_oversample
0 1
549 549
> table(train_y_oversample)/length(train_y_oversample)
train_y_oversample
0 1
0.5 0.5
In order to get a ratio different than 50/50, adapt the variable
n_desired
. For example, in oversampling, this snippet yields a 2/1 ratio (i.e. 2 parts majority, 1 part minority): (this is more unbalanced than the Titanic data so it does not make much sense here)n_desired <- length(nonsurvivors)*(1/2)
We cannot use the House Prices dataset because that is a regression problem. Therefore, we use the Default
dataset from the ISLR2
library.
prop_class
.defaulters
, and the indices of the non-defaulters in nondefaulters
.n_desired
and resampled_defaulters
.train_X_oversample
and train_y_oversample
.prop_class_oversample
. Do you get the 80/20 distribution?Assume that:
ISLR2
library has been loadedDefault
dataset has been loaded