This step should be performed on data without missing values.

For this exercise, we take a subset of the Titanic dataset: Fare. (to continue the exercise, we also quickly impute the missing values in the test set).

train_X_bin <- train_X$Fare
test_X_bin <- test_X$Fare
test_X_bin$Fare[is.na(test_X_bin$Fare)] <- mean(train_X$Fare, na.rm sRUE)

We will also use the toy example from the theory lecture.

toy_bin <- c(1000, 1200, 1300, 2000, 1800, 1400)

Binning refers to translating a numeric variable into a set of two or more discrete bins/categories.
There are 2 main binning methods:

A. Equal frequency binning

Each group has the same number of observations per bin. The length of the bins is changed accordingly.

The following R function implements this. Again, the bins in train & test should be equal:

bin_data_frequency <- function(x_train, x_val, bins = 5) {
  cut(x_val, breaks = quantile(x_train, seq(0, 1, 1 / bins)), include.lowest = TRUE)
}

Applying this to the toy example gives the following result:

> toy_bin_freq <- bin_data_frequency(x_train = toy_bin, x_val = toy_bin, bins = 2)
> table(toy_bin_freq)
toy_bin_freq
[1e+03,1.35e+03] (1.35e+03,2e+03] 
               3                3 

As expected, both bins have 3 elements.

Applying this function to the titanic Fare predictor:

> table(train_X_bin$Fare_freq)

   [0,7.91] (7.91,14.5]   (14.5,31]    (31,512] 
        223         224         222         222 
> table(test_X_bin$Fare_freq)

   [0,7.91] (7.91,14.5]   (14.5,31]    (31,512] 
        114          96          99         109 

Next, we do integer encoding because the levels have a logical order between them:

> train_X_bin$Fare_freq <- as.numeric(train_X_bin$Fare_freq)
> test_X_bin$Fare_freq <- as.numeric(test_X_bin$Fare_freq)
> str(train_X_bin)
'data.frame':	891 obs. of  2 variables:
 $ Fare     : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Fare_freq: num  1 4 2 4 2 2 4 3 2 3 ...
> str(test_X_bin)
'data.frame':	418 obs. of  2 variables:
 $ Fare     : num  7.83 7 9.69 8.66 12.29 ...
 $ Fare_freq: num  1 1 2 2 2 2 1 3 1 3 ...



B. Equal interval binning

Each group has the same interval length. The number of elements in each bin is changed accordingly.

We need a little helper function to compute the interval breakpoints:

breaks <- function(x, bins = 5) {
  range <- range(x)
  breaks <- seq(range[1], range[2], length.out = bins + 1)
  breaks[1] <- breaks[1] - diff(range) * .001
  breaks[bins + 1] <- breaks[bins + 1] + diff(range) * .001
  return(breaks)
}

The following R function uses the helper function and returns the bins of equal length:

bin_data_interval <- function(x_train, x_val, bins = 5) {
  cut(x_val, breaks(x_train, bins))
}

Applying this to the toy example gives the following result:

 breaks(toy_bin, bins = 2)
[1]  999 1500 2001
> toy_bin_inter <- bin_data_interval(x_train = toy_bin, x_val = toy_bin, bins = 2)
> table(toy_bin_inter)
toy_bin_inter
  (999,1.5e+03] (1.5e+03,2e+03] 
              4               2 

As expected, the first bin ranges from 1000 to 1500, and the second bin ranges from 1500 to 2000.

Applying this function to the titanic Fare predictor:

> train_X_bin$Fare_inter <- bin_data_interval(x_train = train_X_bin$Fare, x_val = train_X_bin$Fare, bins = 4)
> test_X_bin$Fare_inter <- bin_data_interval(x_train = train_X_bin$Fare, x_val = test_X_bin$Fare, bins = 4)
> table(train_X_bin$Fare_inter)

(-0.512,128]    (128,256]    (256,384]    (384,513] 
         853           29            6            3 
> table(test_X_bin$Fare_inter)

(-0.512,128]    (128,256]    (256,384]    (384,513] 
         389           21            7            1 

Next, we do integer encoding because the levels have a logical order between them:

> train_X_bin$Fare_inter <- as.numeric(train_X_bin$Fare_inter)
> test_X_bin$Fare_inter <- as.numeric(test_X_bin$Fare_inter)
> str(train_X_bin)
'data.frame':	891 obs. of  2 variables:
 $ Fare      : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Fare_inter: num  1 1 1 1 1 1 1 1 1 1 ...
> str(test_X_bin)
'data.frame':	418 obs. of  2 variables:
 $ Fare      : num  7.83 7 9.69 8.66 12.29 ...
 $ Fare_inter: num  1 1 1 1 1 1 1 1 1 1 ...



Questions

  1. Start from the House Prices dataset you imported in exercise 1 and apply the same preprocessing steps (see boilerplate).
  2. Take a subset of 2 predictors: LotArea and MSSubClass (see boilerplate). Both are numerical.
  3. Apply equal frequency binning on the LotArea predictor, with 3 bins. Add the binned predictor to both train_X_bin and test_X_bin as LotArea_freq.
  4. Apply equal interval binning on the MSSubClass predictor, with 6 bins. Add the binned predictor to both train_X_bin and test_X_bin as MSSubClass_inter.

Assume that: