This exercise demonstrates how to detect and handle missing values.
First, create new dataframes to avoid overwriting the existing dataframes.
train_X_impute <- train_X
test_X_impute <- test_X
Missing values (NA
s) can be detected by computing the percentage of missing values per column, both for the training and test set.
For the titanic data, the training set has missing values for Age
, whereas the test set has missing values for Age
and Fare
.
> colMeans(is.na(train_X_impute))
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0.0000000 0.0000000 0.0000000 0.0000000 0.1986532 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
> colMeans(is.na(test_X_impute))
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0.000000000 0.000000000 0.000000000 0.000000000 0.205741627 0.000000000 0.000000000 0.000000000 0.002392344 0.000000000 0.000000000
If a predictor has a very high percentage of
NA
s, it might be better to remove it.
Overall, there are 3 ways to handle missing values.
You could simply remove all observations which have a missing for any of the predictors.
Depending on the amount of NA
s, this can drastically reduce the size of the data.
Furthermore, this is not an appropriate method for the group assignment because we need to provide a prediction for each of the observations in the test set.
> nrow(train_X_impute)
[1] 891
> train_X_impute <- na.omit(train_X_impute)
> nrow(train_X_impute)
[1] 714
> test_X_impute <- na.omit(test_X_impute)
A very popular method is to impute the missing values with information from the training set (!!!). For numeric predictors, this can be the mean or median. For categorical predictors, this can be the mode.
Imputing the missing values of Age
goes as follows. Note that we impute NA
s in the test set with the mean of the training set. Otherwise, we would have data leakage.
train_X_impute$Age[is.na(train_X_impute$Age)] <- mean(train_X$Age, na.rm = TRUE)
test_X_impute$Age[is.na(test_X_impute$Age)] <- mean(train_X$Age, na.rm = TRUE)
This code can become quite cumbersome when applied to a large number of predictors. Therefore, we write a function that captures this logic.
impute <- function(x, method = mean, val = NULL) {
if (is.null(val)) {
val <- method(x, na.rm = TRUE)
}
x[is.na(x)] <- val
return(x)
}
> impute(c(1, 2, NA, 6)) # by default impute mean
[1] 1 2 3 6
> impute(c(1, 2, NA, 6), method = median) # could also impute median
[1] 1 2 2 6
> impute(c(1, 2, NA, 6), val = 1) # when val is supplied, impute with val
[1] 1 2 1 6
Using the function impute
, imputing Age
goes as follows. For the training set, we compute the mean on the training set itself with the argument method
.
For the test set, we supply the mean of the training set with the argument val
.
train_X_impute$Age <- impute(train_X_impute$Age, method = mean)
test_X_impute$Age <- impute(test_X_impute$Age, val = mean(train_X$Age, na.rm = TRUE))
The function can also be used with categorical data. However, first, we write a function that returns the mode of a column.
modus <- function(x, na.rm = FALSE) {
if (na.rm) x <- x[!is.na(x)]
ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}
> impute(c("2", "2", NA, "6"), modus)
[1] "2" "2" "2" "6"
Using the function impute
, imputing Sex
goes as follows (Sex
does not have NA
s but we show the code nonetheless):
train_X_impute$Sex <- impute(train_X_impute$Sex, method = modus)
test_X_impute$Sex <- impute(test_X_impute$Sex, val = modus(train_X$Sex, na.rm = TRUE))
The function can be applied to multiple columns by using the apply
family.
The code below imputes all numerical columns with the mean and all categorical columns with the mode.
# impute all numeric variables
num.cols <- sapply(train_X_impute, is.numeric) # all numeric columns
train_X_impute[, num.cols] <- lapply(train_X_impute[, num.cols],
FUN = impute,
method = mean)
test_X_impute[, num.cols] <- mapply(test_X_impute[, num.cols],
FUN = impute,
val = colMeans(train_X[, num.cols], na.rm = TRUE))
# impute all categorical variables
cat.cols <- !num.cols
train_X_impute[, cat.cols] <- lapply(train_X_impute[, cat.cols],
FUN = impute,
method = modus)
test_X_impute[, cat.cols] <- mapply(test_X_impute[, cat.cols],
FUN = impute,
val = sapply(train_X[, cat.cols], modus, na.rm = TRUE))
For informative missing values, adding NA
indicators might increase model performance.
naFlag <- function(df, df_val = NULL) {
if (is.null(df_val)) {
df_val <- df
}
mask <- sapply(df_val, anyNA)
out <- lapply(df[mask], function(x)as.numeric(is.na(x)))
if (length(out) > 0) names(out) <- paste0(names(out), "_flag")
return(as.data.frame(out))
}
Using the function naFlag
, adding NA
indicators goes as follows. It is important to note that both train & test set should have the same columns.
Therefore, we check the NA
s in train set and only add NA
indicators for columns that have NA
s there.
For the training set, we only need to supply train_X
.
For the test set, we need to supply test_X
and train_X
because we want NA
indicators based on the training set.
> str(naFlag(df = train_X))
'data.frame': 891 obs. of 1 variable:
$ Age_flag: num 0 0 0 0 0 1 0 0 0 0 ...
> str(naFlag(df = test_X, df_val = train_X))
'data.frame': 418 obs. of 1 variable:
$ Age_flag: num 0 0 0 0 0 0 0 0 0 0 ...
(If you incorrectly create NA
indicators based on the test, you would find an indicator for Age
and Fare
)
Finally, you can add the NA
indicators to the normal predictors.
train_X_impute <- cbind(train_X_impute,
naFlag(df = train_X))
test_X_impute <- cbind(test_X_impute,
naFlag(df = test_X, df_val = train_X))
train_X
in train_X_impute
and test_X
in test_X_impute
.train_X_impute
and test_X_impute
.LotFrontage
predictor with the median (not mean!) method. Overwrite the column in train_X_impute
and test_X_impute
.MSZoning
and Utilities
predictors with the mode. Overwrite the columns in train_X_impute
and test_X_impute
.NA
s left in the data (only Alley
should have missing values left but we ignore it in this exercise).Assume that:
train_X
, train_y
, and test_X
datasets have been loaded