This exercise uses the Titanic dataset.
This exercise shows how to standardize the data, i.e. rescale the distribution of the data to \(\mathcal{N}(\mu=0, \sigma=1)\).
First, create new dataframes to avoid overwriting the existing dataframes.
train_X_scale <- train_X
test_X_scale <- test_X
Generally, you should scale all numerical columns and integer encoded columns. That is, do not scale the one-hot encoded columns!
When searching for all columns of type numeric with is.numeric
, be sure to exclude the one-hot encoded columns. Here, as an example, we remove the ID column from the selection (this should not be used in a model).
> num.cols <- sapply(train_X_scale, is.numeric)
> num.cols[names(num.cols) %in% c("PassengerId")] <- FALSE
> num.cols
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
We continue the example by only standardizing 2 predictors: Pclass
and Parch
.
scale_cols <- c("Pclass", "Parch")
We get the mean \(\mu\) and standard deviation \(\sigma\) for both columns in the training set. Then, we standardize the training set and use \(\mu\) and \(\sigma\) to standardize the test set.
# apply on training set
mean_train <- colMeans(train_X_scale[, scale_cols])
sd_train <- sapply(train_X_scale[, scale_cols], sd)
train_X_scale[, scale_cols] <- scale(train_X_scale[, scale_cols], center = TRUE, scale = TRUE)
# apply on test set
test_X_scale[, scale_cols] <- scale(test_X_scale[, scale_cols], center = mean_train, scale = sd_train)
Now, we check the distributions:
> colMeans(train_X_scale[, scale_cols])
Pclass Parch
-9.270549e-17 2.728831e-17
> sapply(train_X_scale[, scale_cols], sd)
Pclass Parch
1 1
>
> colMeans(test_X_scale[, scale_cols])
Pclass Parch
-0.05154075 0.01333749
> sapply(test_X_scale[, scale_cols], sd)
Pclass Parch
1.006897 1.217567
As expected, in the training set, the mean and sd are exactly 0 and 1, respectively. In the test set,
the values are close to a standard normal distribution.
train_X
in train_X_scale
and test_X
in test_X_scale
.MSSubClass
and LotArea
. Store the column names in scale_cols
.mean_train
and sd_train
.train_X_scale
and test_X_scale
.Assume that:
train_X
, train_y
, and test_X
datasets have been loaded