This exercise demonstrates how to detect and handle missing values.

First, create new dataframes to avoid overwriting the existing dataframes.

train_X_impute <- train_X
test_X_impute <- test_X

Detecting

Missing values (NAs) can be detected by computing the percentage of missing values per column, both for the training and test set. For the titanic data, the training set has missing values for Age, whereas the test set has missing values for Age and Fare.

> colMeans(is.na(train_X_impute))
PassengerId      Pclass        Name         Sex         Age       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
  0.0000000   0.0000000   0.0000000   0.0000000   0.1986532   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
> colMeans(is.na(test_X_impute))
PassengerId      Pclass        Name         Sex         Age       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
0.000000000 0.000000000 0.000000000 0.000000000 0.205741627 0.000000000 0.000000000 0.000000000 0.002392344 0.000000000 0.000000000 

If a predictor has a very high percentage of NAs, it might be better to remove it.

Handling

Overall, there are 3 ways to handle missing values.

A. Drop

You could simply remove all observations which have a missing for any of the predictors. Depending on the amount of NAs, this can drastically reduce the size of the data. Furthermore, this is not an appropriate method for the group assignment because we need to provide a prediction for each of the observations in the test set.

> nrow(train_X_impute)
[1] 891
> train_X_impute <- na.omit(train_X_impute)
> nrow(train_X_impute)
[1] 714
> test_X_impute <- na.omit(test_X_impute)

B. Imputation with constant

A very popular method is to impute the missing values with information from the training set (!!!). For numeric predictors, this can be the mean or median. For categorical predictors, this can be the mode. Imputing the missing values of Age goes as follows. Note that we impute NAs in the test set with the mean of the training set. Otherwise, we would have data leakage.

train_X_impute$Age[is.na(train_X_impute$Age)] <- mean(train_X$Age, na.rm = TRUE)
test_X_impute$Age[is.na(test_X_impute$Age)] <- mean(train_X$Age, na.rm = TRUE)

This code can become quite cumbersome when applied to a large number of predictors. Therefore, we write a function that captures this logic.

impute <- function(x, method = mean, val = NULL) {
  if (is.null(val)) {
    val <- method(x, na.rm = TRUE)
  }
  x[is.na(x)] <- val
  return(x)
}

> impute(c(1, 2, NA, 6)) # by default impute mean
[1] 1 2 3 6
> impute(c(1, 2, NA, 6), method = median) # could also impute median
[1] 1 2 2 6
> impute(c(1, 2, NA, 6), val = 1) # when val is supplied, impute with val
[1] 1 2 1 6

Using the function impute, imputing Age goes as follows. For the training set, we compute the mean on the training set itself with the argument method. For the test set, we supply the mean of the training set with the argument val.

train_X_impute$Age <- impute(train_X_impute$Age, method = mean) 
test_X_impute$Age <- impute(test_X_impute$Age, val = mean(train_X$Age, na.rm = TRUE))

The function can also be used with categorical data. However, first, we write a function that returns the mode of a column.

modus <- function(x, na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}

> impute(c("2", "2", NA, "6"), modus)
[1] "2" "2" "2" "6"

Using the function impute, imputing Sex goes as follows (Sex does not have NAs but we show the code nonetheless):

train_X_impute$Sex <- impute(train_X_impute$Sex, method = modus) 
test_X_impute$Sex <- impute(test_X_impute$Sex, val = modus(train_X$Sex, na.rm = TRUE))

The function can be applied to multiple columns by using the apply family. The code below imputes all numerical columns with the mean and all categorical columns with the mode.

# impute all numeric variables
num.cols <- sapply(train_X_impute, is.numeric) # all numeric columns
train_X_impute[, num.cols] <- lapply(train_X_impute[, num.cols],
                                     FUN = impute,
                                     method = mean)
test_X_impute[, num.cols] <- mapply(test_X_impute[, num.cols],
                                    FUN = impute,
                                    val = colMeans(train_X[, num.cols], na.rm = TRUE))

# impute all categorical variables
cat.cols <- !num.cols
train_X_impute[, cat.cols] <- lapply(train_X_impute[, cat.cols],
                                     FUN = impute,
                                     method = modus)
test_X_impute[, cat.cols] <- mapply(test_X_impute[, cat.cols],
                                    FUN = impute,
                                    val = sapply(train_X[, cat.cols], modus, na.rm = TRUE))

C. Imputation + NA indicator

For informative missing values, adding NA indicators might increase model performance.

naFlag <- function(df, df_val = NULL) {
  if (is.null(df_val)) {
    df_val <- df
  }
  mask <- sapply(df_val, anyNA)
  out <- lapply(df[mask], function(x)as.numeric(is.na(x)))
  if (length(out) > 0) names(out) <- paste0(names(out), "_flag")
  return(as.data.frame(out))
}

Using the function naFlag, adding NA indicators goes as follows. It is important to note that both train & test set should have the same columns. Therefore, we check the NAs in train set and only add NA indicators for columns that have NAs there.
For the training set, we only need to supply train_X. For the test set, we need to supply test_X and train_X because we want NA indicators based on the training set.

> str(naFlag(df = train_X))
'data.frame':	891 obs. of  1 variable:
 $ Age_flag: num  0 0 0 0 0 1 0 0 0 0 ...
> str(naFlag(df = test_X, df_val = train_X))
'data.frame':	418 obs. of  1 variable:
 $ Age_flag: num  0 0 0 0 0 0 0 0 0 0 ...

(If you incorrectly create NA indicators based on the test, you would find an indicator for Age and Fare)

Finally, you can add the NA indicators to the normal predictors.

train_X_impute <- cbind(train_X_impute,
                        naFlag(df = train_X))
test_X_impute <- cbind(test_X_impute,
                       naFlag(df = test_X, df_val = train_X))

Questions

Start from the House Prices dataset you imported in exercise 1 and apply the same preprocessing steps (see boilerplate).
Copy train_X in train_X_impute and test_X in test_X_impute.
Detect the missing values in train_X_impute and test_X_impute.
Numerical: impute the LotFrontage predictor with the median (not mean!) method. Overwrite the column in train_X_impute and test_X_impute.
Categorical: impute the MSZoning and Utilities predictors with the mode. Overwrite the columns in train_X_impute and test_X_impute.
Check whether there are NAs left in the data (only Alley should have missing values left but we ignore it in this exercise).

Assume that:

The train_X, train_y, and test_X datasets have been loaded