This step should be performed on data without missing values.
For this exercise, we take a subset of the Titanic dataset: Fare
. (to continue the exercise, we also quickly impute the missing values in the test set).
train_X_bin <- train_X$Fare
test_X_bin <- test_X$Fare
test_X_bin$Fare[is.na(test_X_bin$Fare)] <- mean(train_X$Fare, na.rm sRUE)
We will also use the toy example from the theory lecture.
toy_bin <- c(1000, 1200, 1300, 2000, 1800, 1400)
Binning refers to translating a numeric variable into a set of two or more discrete bins/categories.
There are 2 main binning methods:
Each group has the same number of observations per bin. The length of the bins is changed accordingly.
The following R function implements this. Again, the bins in train & test should be equal:
bin_data_frequency <- function(x_train, x_val, bins = 5) {
cut(x_val, breaks = quantile(x_train, seq(0, 1, 1 / bins)), include.lowest = TRUE)
}
Applying this to the toy example gives the following result:
> toy_bin_freq <- bin_data_frequency(x_train = toy_bin, x_val = toy_bin, bins = 2)
> table(toy_bin_freq)
toy_bin_freq
[1e+03,1.35e+03] (1.35e+03,2e+03]
3 3
As expected, both bins have 3 elements.
Applying this function to the titanic Fare
predictor:
> table(train_X_bin$Fare_freq)
[0,7.91] (7.91,14.5] (14.5,31] (31,512]
223 224 222 222
> table(test_X_bin$Fare_freq)
[0,7.91] (7.91,14.5] (14.5,31] (31,512]
114 96 99 109
Next, we do integer encoding because the levels have a logical order between them:
> train_X_bin$Fare_freq <- as.numeric(train_X_bin$Fare_freq)
> test_X_bin$Fare_freq <- as.numeric(test_X_bin$Fare_freq)
> str(train_X_bin)
'data.frame': 891 obs. of 2 variables:
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Fare_freq: num 1 4 2 4 2 2 4 3 2 3 ...
> str(test_X_bin)
'data.frame': 418 obs. of 2 variables:
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Fare_freq: num 1 1 2 2 2 2 1 3 1 3 ...
Each group has the same interval length. The number of elements in each bin is changed accordingly.
We need a little helper function to compute the interval breakpoints:
breaks <- function(x, bins = 5) {
range <- range(x)
breaks <- seq(range[1], range[2], length.out = bins + 1)
breaks[1] <- breaks[1] - diff(range) * .001
breaks[bins + 1] <- breaks[bins + 1] + diff(range) * .001
return(breaks)
}
The following R function uses the helper function and returns the bins of equal length:
bin_data_interval <- function(x_train, x_val, bins = 5) {
cut(x_val, breaks(x_train, bins))
}
Applying this to the toy example gives the following result:
breaks(toy_bin, bins = 2)
[1] 999 1500 2001
> toy_bin_inter <- bin_data_interval(x_train = toy_bin, x_val = toy_bin, bins = 2)
> table(toy_bin_inter)
toy_bin_inter
(999,1.5e+03] (1.5e+03,2e+03]
4 2
As expected, the first bin ranges from 1000 to 1500, and the second bin ranges from 1500 to 2000.
Applying this function to the titanic Fare
predictor:
> train_X_bin$Fare_inter <- bin_data_interval(x_train = train_X_bin$Fare, x_val = train_X_bin$Fare, bins = 4)
> test_X_bin$Fare_inter <- bin_data_interval(x_train = train_X_bin$Fare, x_val = test_X_bin$Fare, bins = 4)
> table(train_X_bin$Fare_inter)
(-0.512,128] (128,256] (256,384] (384,513]
853 29 6 3
> table(test_X_bin$Fare_inter)
(-0.512,128] (128,256] (256,384] (384,513]
389 21 7 1
Next, we do integer encoding because the levels have a logical order between them:
> train_X_bin$Fare_inter <- as.numeric(train_X_bin$Fare_inter)
> test_X_bin$Fare_inter <- as.numeric(test_X_bin$Fare_inter)
> str(train_X_bin)
'data.frame': 891 obs. of 2 variables:
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Fare_inter: num 1 1 1 1 1 1 1 1 1 1 ...
> str(test_X_bin)
'data.frame': 418 obs. of 2 variables:
$ Fare : num 7.83 7 9.69 8.66 12.29 ...
$ Fare_inter: num 1 1 1 1 1 1 1 1 1 1 ...
LotArea
and MSSubClass
(see boilerplate). Both are numerical.LotArea
predictor, with 3 bins. Add the binned predictor to both train_X_bin
and test_X_bin
as LotArea_freq
.MSSubClass
predictor, with 6 bins. Add the binned predictor to both train_X_bin
and test_X_bin
as MSSubClass_inter
.Assume that:
train_X
, train_y
, and test_X
datasets have been loaded