This step should be performed on data without missing values. However, in this exercise, we don’t do that to avoid dependencies on previous exercises.

For this exercise, we take a subset of the Titanic dataset: 1 numeric predictor (SibSp), 1 ordinal predictor (Pclass), and 2 nominal predictors (Sex and Embarked).

> train_X_encode <- train_X[, c("SibSp", "Pclass", "Sex", "Embarked")]
> test_X_encode <- test_X[, c("SibSp", "Pclass", "Sex", "Embarked")]
> str(train_X_encode)
'data.frame':	891 obs. of  4 variables:
 $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Sex     : chr  "male" "female" "female" "female" ...
 $ Embarked: chr  "S" "C" "S" "S" ...

Integer Encoding (ordinal data)

We are lucky because the ordinal predictor ticket class Pclass (1st, 2nd, or 3rd) is already coded as an integer.

To illustrate integer encoding, we consider the example from the theory lecture.

df_train <- data.frame(breakfast = c("every day", "never", "rarely", "most days", "never"))
df_test <- data.frame(breakfast = c("never", "rarely", "most days", "never", "twice a day"))

Note that the test set has a level "twice a day" that does not occur in the training set. We need to find all the levels and order them correctly. Next, we convert the character to a factor using the defined levels, and then convert to numeric.

> union(unique(df_train$breakfast), unique(df_test$breakfast))
[1] "every day"   "never"       "rarely"      "most days"   "twice a day"
> breakfast_levels <- c("never", "rarely", "most days", "every day", "twice a day") # in correct order!
> df_train$breakfast_ordinal <- as.numeric(factor(df_train$breakfast, levels = breakfast_levels))
> df_test$breakfast_ordinal <- as.numeric(factor(df_test$breakfast, levels = breakfast_levels))

This gives the following result:

> df_train
  breakfast breakfast_ordinal
every day                 4
   never                 1
  rarely                 2
most days                 3
   never                 1
> df_test
    breakfast breakfast_ordinal
     never                 1
    rarely                 2
 most days                 3
     never                 1
twice a day                 5

One-hot Encoding (nominal data)

The subset has 2 nominal predictors: Sex ("male" and "female") and Embarked ("S", "C", "Q", "").

A. Models that can handle factors

Some models can deal with the data type “factor” automatically, creating dummies when necessary. Examples are linear/logistic regression, GAM, RF, boosting.

In this case, one-hot encoding is not necessary. You should just convert the predictors to factors. Again, make sure train and test set have the same levels.

> sex_levels <- union(unique(train_X_encode$Sex), unique(test_X_encode$Sex))
> train_X_encode$Sex <- factor(train_X_encode$Sex, levels = sex_levels)
> test_X_encode$Sex <- factor(test_X_encode$Sex, levels = sex_levels)
> 
> embarked_levels <- union(unique(train_X_encode$Embarked), unique(test_X_encode$Embarked))
> train_X_encode$Embarked <- factor(train_X_encode$Embarked, levels = embarked_levels)
> test_X_encode$Embarked <- factor(test_X_encode$Embarked, levels = embarked_levels)

B. Models that cannot handle factors

Other models cannot deal with the data type “factor” automatically, e.g. lasso/ridge, SVM, NN. Here, we need to one-hot encode the predictors.

B1. model.matrix()

The easiest method is to use the model.matrix() function, as seen in the lab session. This function converts all non-numeric predictors to numeric. It assumes that the non-numeric predictors are nominal and creates one-hot encodings. Again, we need to make sure that train and test set have the same columns. One-way to accomplish this is to merge all separate dataframes (train_X_encode, test_X_encode, train_y). Since we do not have test_y, we use some placeholder.

df_merge <- cbind(X = rbind(train_X_encode, test_X_encode),
                  y = c(train_y, rep(0, nrow(test_X_encode))))
train_X_encode <- model.matrix(y ~ . - 1, data = df_merge)[seq(1, nrow(train_X_encode)),]
test_X_encode <- model.matrix(y ~ . - 1, data = df_merge)[seq(nrow(train_X_encode) + 1, nrow(test_X_encode)),]

This returns:

> head(train_X_encode, 3)
  X.SibSp X.Pclass X.Sexfemale X.Sexmale X.EmbarkedC X.EmbarkedQ X.EmbarkedS
1       1        3           0         1           0           0           1
2       1        1           1         0           1           0           0
3       0        3           1         0           0           0           1

The approach is easy to use but does not allow for much customization. For example, what if we want to treat ordinal predictors differently? What is a predictor has a lot of levels and explodes the number of columns?

B2. dummy package

A more flexible approach is to use the dummy package. This goes as follows:

library(dummy)
# get categories and dummies
cats <- categories(train_X_encode[, c("Sex", "Embarked")])
# apply on train set (exclude reference categories)
dummies_train <- dummy(train_X_encode[, c("Sex", "Embarked")],
                       object = cats)
dummies_train <- subset(dummies_train, select = -c(Sex_female, Embarked_))
# apply on test set (exclude reference categories)
dummies_test <- dummy(test_X_encode[, c("Sex", "Embarked")], object = cats)
dummies_test <- subset(dummies_test, select = -c(Sex_female, Embarked_))

Then, we remove the original predictors and merge them with the other predictors:

## merge with overall training set
train_X_encode <- subset(train_X_encode, select = -c(Sex, Embarked))
train_X_encode <- cbind(train_X_encode, dummies_train)
## merge with overall test set
test_X_encode <- subset(test_X_encode, select = -c(Sex, Embarked))
test_X_encode <- cbind(test_X_encode, dummies_test)

This results in:

> head(train_X_encode, 3)
  SibSp Pclass Sex_male Embarked_C Embarked_Q Embarked_S
1     1      3        1          0          0          1
2     1      1        0          1          0          0
3     0      3        0          0          0          1

If a predictor has high cardinality (lot of levels), we can select the subset (e.g. top 10) with highest frequency.
cats <- categories(train_X_encode[, c("Sex", "Embarked")], p = 10)

Questions

Start from the House Prices dataset you imported in exercise 1 and apply the same preprocessing steps (see boilerplate).
Take a subset of 3 predictors: LotArea, LandSlope, and Neighborhood (see boilerplate). LotArea is numerical; LandSlope and Neighborhood are nominal.
Create one-hot encodings for LandSlope, and Neighborhood:
1. Only take the top 10 categories. Store them in cats.
2. Store the intermediary train dummies in dummies_train. Don’t forget to remove the reference categories (remove the first category).
3. Store the intermediary test dummies in dummies_test. Don’t forget to remove the reference categories (remove the first category).
4. Merge the train dummies with the overall training set. Don’t forget to remove the original predictors. Overwrite train_X_encode.
5. Merge the test dummies with the overall test set. Don’t forget to remove the original predictors. Overwrite test_X_encode.
Inspect the structure of train_X_encode and test_X_encode. Are the columns as expected?

Assume that:

The dummy library has been loaded
The train_X, train_y, and test_X datasets have been loaded