This step should be performed on data without missing values. However, in this exercise, we don’t do that to avoid dependencies on previous exercises.
For this exercise, we take a subset of the Titanic dataset: 1 numeric predictor (SibSp
), 1 ordinal predictor (Pclass
), and 2 nominal predictors (Sex
and Embarked
).
> train_X_encode <- train_X[, c("SibSp", "Pclass", "Sex", "Embarked")]
> test_X_encode <- test_X[, c("SibSp", "Pclass", "Sex", "Embarked")]
> str(train_X_encode)
'data.frame': 891 obs. of 4 variables:
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Sex : chr "male" "female" "female" "female" ...
$ Embarked: chr "S" "C" "S" "S" ...
We are lucky because the ordinal predictor ticket class Pclass
(1st, 2nd, or 3rd) is already coded as an integer.
To illustrate integer encoding, we consider the example from the theory lecture.
df_train <- data.frame(breakfast = c("every day", "never", "rarely", "most days", "never"))
df_test <- data.frame(breakfast = c("never", "rarely", "most days", "never", "twice a day"))
Note that the test set has a level "twice a day"
that does not occur in the training set. We need to find all the levels and order them correctly.
Next, we convert the character to a factor using the defined levels, and then convert to numeric.
> union(unique(df_train$breakfast), unique(df_test$breakfast))
[1] "every day" "never" "rarely" "most days" "twice a day"
> breakfast_levels <- c("never", "rarely", "most days", "every day", "twice a day") # in correct order!
> df_train$breakfast_ordinal <- as.numeric(factor(df_train$breakfast, levels = breakfast_levels))
> df_test$breakfast_ordinal <- as.numeric(factor(df_test$breakfast, levels = breakfast_levels))
This gives the following result:
> df_train
breakfast breakfast_ordinal
1 every day 4
2 never 1
3 rarely 2
4 most days 3
5 never 1
> df_test
breakfast breakfast_ordinal
1 never 1
2 rarely 2
3 most days 3
4 never 1
5 twice a day 5
The subset has 2 nominal predictors: Sex
("male"
and "female"
) and Embarked
("S"
, "C"
, "Q"
, ""
).
Some models can deal with the data type “factor” automatically, creating dummies when necessary. Examples are linear/logistic regression, GAM, RF, boosting.
In this case, one-hot encoding is not necessary. You should just convert the predictors to factors. Again, make sure train and test set have the same levels.
> sex_levels <- union(unique(train_X_encode$Sex), unique(test_X_encode$Sex))
> train_X_encode$Sex <- factor(train_X_encode$Sex, levels = sex_levels)
> test_X_encode$Sex <- factor(test_X_encode$Sex, levels = sex_levels)
>
> embarked_levels <- union(unique(train_X_encode$Embarked), unique(test_X_encode$Embarked))
> train_X_encode$Embarked <- factor(train_X_encode$Embarked, levels = embarked_levels)
> test_X_encode$Embarked <- factor(test_X_encode$Embarked, levels = embarked_levels)
Other models cannot deal with the data type “factor” automatically, e.g. lasso/ridge, SVM, NN. Here, we need to one-hot encode the predictors.
The easiest method is to use the model.matrix()
function, as seen in the lab session. This function converts all non-numeric predictors to numeric.
It assumes that the non-numeric predictors are nominal and creates one-hot encodings.
Again, we need to make sure that train and test set have the same columns. One-way to accomplish this is to merge all separate dataframes (train_X_encode
, test_X_encode
, train_y
).
Since we do not have test_y
, we use some placeholder.
df_merge <- cbind(X = rbind(train_X_encode, test_X_encode),
y = c(train_y, rep(0, nrow(test_X_encode))))
train_X_encode <- model.matrix(y ~ . - 1, data = df_merge)[seq(1, nrow(train_X_encode)),]
test_X_encode <- model.matrix(y ~ . - 1, data = df_merge)[seq(nrow(train_X_encode) + 1, nrow(test_X_encode)),]
This returns:
> head(train_X_encode, 3)
X.SibSp X.Pclass X.Sexfemale X.Sexmale X.EmbarkedC X.EmbarkedQ X.EmbarkedS
1 1 3 0 1 0 0 1
2 1 1 1 0 1 0 0
3 0 3 1 0 0 0 1
The approach is easy to use but does not allow for much customization. For example, what if we want to treat ordinal predictors differently? What is a predictor has a lot of levels and explodes the number of columns?
A more flexible approach is to use the dummy
package. This goes as follows:
library(dummy)
# get categories and dummies
cats <- categories(train_X_encode[, c("Sex", "Embarked")])
# apply on train set (exclude reference categories)
dummies_train <- dummy(train_X_encode[, c("Sex", "Embarked")],
object = cats)
dummies_train <- subset(dummies_train, select = -c(Sex_female, Embarked_))
# apply on test set (exclude reference categories)
dummies_test <- dummy(test_X_encode[, c("Sex", "Embarked")], object = cats)
dummies_test <- subset(dummies_test, select = -c(Sex_female, Embarked_))
Then, we remove the original predictors and merge them with the other predictors:
## merge with overall training set
train_X_encode <- subset(train_X_encode, select = -c(Sex, Embarked))
train_X_encode <- cbind(train_X_encode, dummies_train)
## merge with overall test set
test_X_encode <- subset(test_X_encode, select = -c(Sex, Embarked))
test_X_encode <- cbind(test_X_encode, dummies_test)
This results in:
> head(train_X_encode, 3)
SibSp Pclass Sex_male Embarked_C Embarked_Q Embarked_S
1 1 3 1 0 0 1
2 1 1 0 1 0 0
3 0 3 0 0 0 1
If a predictor has high cardinality (lot of levels), we can select the subset (e.g. top 10) with highest frequency.
cats <- categories(train_X_encode[, c("Sex", "Embarked")], p = 10)
LotArea
, LandSlope
, and Neighborhood
(see boilerplate). LotArea
is numerical; LandSlope
and Neighborhood
are nominal.LandSlope
, and Neighborhood
:
cats
.dummies_train
. Don’t forget to remove the reference categories (remove the first category).dummies_test
. Don’t forget to remove the reference categories (remove the first category).train_X_encode
.test_X_encode
.train_X_encode
and test_X_encode
. Are the columns as expected?Assume that:
dummy
library has been loadedtrain_X
, train_y
, and test_X
datasets have been loaded