This exercise demonstrates how to import data from a csv file and how to do Exploratory Data Analysis (EDA). This allows you to get a first understanding of the data.

For most of the exercises in this session, the code examples are applied on data from the Titanic competition, available here¹ on Kaggle. Afterwards, you will apply the code to the House Prices competition, also available on Kaggle. These data formats are similar to what you receive for the group assignment.

Data Import

The idea of the Titanic competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck (i.e. binary classification).

As always, Kaggle provides 2 csv files: train.csv & test.csv. The training set contains dependent & independent variables and is used to train the model. The test set contains only the independent variable and you need to predict the dependent variable for each observation in the test set.

We read the csv files as follows:

# read training set
train <- read.csv("0X_dataPreprocessing/data/train.csv")
str(train)
# read test set
test_X <- read.csv("0X_dataPreprocessing/data/test.csv")
str(test_X)

Next, we split the independent & dependent variables in the training set.

# separate dep and indep vars
train_X <- subset(train, select = -c(Survived))
train_y <- train$Survived

The structure of train_X is as follows:

> str(train_X)
'data.frame':	891 obs. of  11 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

The structure of test_X is similar to this. The variable train_y is a numeric vector. Note that we do not have test_y available. In order to check the performance of our model, we need to submit our predictions to Kaggle.

> str(train_y)
 int [1:891] 0 1 1 1 0 0 0 0 1 1 ...

EDA

A. Visual data exploration

Different plots can be made depending on whether the variable is numerical or categorical.

# histogram (numerical)
hist(train_X$Age,breaks = 30)
# boxplot (numerical)
boxplot(train_X$Age)
# barplot (categorical)
barplot(table(train_X$Embarked))

B. Statistical data exploration

Again, different statistics can be computed depending on whether the variable is numerical or categorical. Statistics can either be univariate or bivariate.

# univariate (numerical)
mean(train_X$Age, na.rm = TRUE)
sd(train_X$Age, na.rm = TRUE)
median(train_X$Age, na.rm = TRUE)
quantile(train_X$Age, na.rm = TRUE)
summary(train_X)

# univariate (categorical)
unique(train_X$Embarked)
table(train_X$Embarked)
prop.table(table(train_X$Embarked))

# bivariate(numerical)
cor(train_X$SibSp, train_X$Fare) # correlation coefficient

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

The following exercises will work with data from the House Prices competition, available here² on Kaggle. This is a regression problem where we predict the sales price of houses.

Read in the House Prices dataset. Store the data in train and test_X.

For the Dodona exercises, the data should be in your working directory. The easy solution is to set your working directory to the current file and paste the data in the same folder than your R script. (for the group assignment, you should store data and code in dedicated folders data and src -> see theory lecture)

For train, only retain the first 20 predictors & the target (last column) with train <- train[, -seq(21,ncol(train)-1)].
For test_X, only retain the first 20 predictors (there is no target) with test_X <- test_X[, -seq(21,ncol(test_X))].
Split the independent & dependent variables in the training set. Store the data in train_X and train_y. (note: identify the dependent variable based on the data description).
Inspect the structure of train_X, train_y, and test_X.
Visual data exploration:
1. Create a boxplot of the training data of LotFrontage.
2. Create a barplot of the training data of HouseStyle.
Statistical data exploration:
1. Create a summary of the training data of LotFrontage. Store the result in LotFrontage_summary.
2. Create a proportion table of the training data of HouseStyle. Store the result in HouseStyle_prop.
3. What is the correlation coefficient of the training data of predictors OverallQual and LotArea? Store the result in QualArea_cor.