This exercise uses the Titanic dataset.

Standardization

This exercise shows how to standardize the data, i.e. rescale the distribution of the data to \(\mathcal{N}(\mu=0, \sigma=1)\).

First, create new dataframes to avoid overwriting the existing dataframes.

train_X_scale <- train_X
test_X_scale <- test_X

Generally, you should scale all numerical columns and integer encoded columns. That is, do not scale the one-hot encoded columns! When searching for all columns of type numeric with is.numeric, be sure to exclude the one-hot encoded columns. Here, as an example, we remove the ID column from the selection (this should not be used in a model).

> num.cols <- sapply(train_X_scale, is.numeric)
> num.cols[names(num.cols) %in% c("PassengerId")] <- FALSE
> num.cols
PassengerId      Pclass        Name         Sex         Age       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
      FALSE        TRUE       FALSE       FALSE        TRUE        TRUE        TRUE       FALSE        TRUE       FALSE       FALSE

We continue the example by only standardizing 2 predictors: Pclass and Parch.

scale_cols <- c("Pclass", "Parch")

We get the mean \(\mu\) and standard deviation \(\sigma\) for both columns in the training set. Then, we standardize the training set and use \(\mu\) and \(\sigma\) to standardize the test set.

# apply on training set
mean_train <- colMeans(train_X_scale[, scale_cols])
sd_train <- sapply(train_X_scale[, scale_cols], sd)
train_X_scale[, scale_cols] <- scale(train_X_scale[, scale_cols], center = TRUE, scale = TRUE)

# apply on test set
test_X_scale[, scale_cols] <- scale(test_X_scale[, scale_cols], center = mean_train, scale = sd_train)

Now, we check the distributions:

> colMeans(train_X_scale[, scale_cols])
       Pclass         Parch 
-9.270549e-17  2.728831e-17 
> sapply(train_X_scale[, scale_cols], sd)
Pclass  Parch 
     1      1 
> 
> colMeans(test_X_scale[, scale_cols])
     Pclass       Parch 
-0.05154075  0.01333749 
> sapply(test_X_scale[, scale_cols], sd)
  Pclass    Parch 
1.006897 1.217567

As expected, in the training set, the mean and sd are exactly 0 and 1, respectively. In the test set, the values are close to a standard normal distribution.

Questions

Start from the House Prices dataset you imported in exercise 1 and apply the same preprocessing steps (see boilerplate).
Copy train_X in train_X_scale and test_X in test_X_scale.
Standardize 2 columns: MSSubClass and LotArea. Store the column names in scale_cols.
Store the mean and sd for both columns in mean_train and sd_train.
Scale the columns and overwrite their values in train_X_scale and test_X_scale.
Check the distribution of the columns. Are they (approximately) standard normal?

Assume that:

The train_X, train_y, and test_X datasets have been loaded