This exercise demonstrates how to import data from a csv file and how to do Exploratory Data Analysis (EDA). This allows you to get a first understanding of the data.
For most of the exercises in this session, the code examples are applied on data from the Titanic competition, available here1 on Kaggle.
Afterwards, you will apply the code to the House Prices competition, also available on Kaggle.
These data formats are similar to what you receive for the group assignment.
The idea of the Titanic competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck (i.e. binary classification).
As always, Kaggle provides 2 csv files: train.csv
& test.csv
.
The training set contains dependent & independent variables and is used to train the model.
The test set contains only the independent variable and you need to predict the dependent variable for each observation in the test set.
We read the csv files as follows:
# read training set
train <- read.csv("0X_dataPreprocessing/data/train.csv")
str(train)
# read test set
test_X <- read.csv("0X_dataPreprocessing/data/test.csv")
str(test_X)
Next, we split the independent & dependent variables in the training set.
# separate dep and indep vars
train_X <- subset(train, select = -c(Survived))
train_y <- train$Survived
The structure of train_X
is as follows:
> str(train_X)
'data.frame': 891 obs. of 11 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
The structure of test_X
is similar to this.
The variable train_y
is a numeric vector. Note that we do not have test_y
available.
In order to check the performance of our model, we need to submit our predictions to Kaggle.
> str(train_y)
int [1:891] 0 1 1 1 0 0 0 0 1 1 ...
Different plots can be made depending on whether the variable is numerical or categorical.
# histogram (numerical)
hist(train_X$Age,breaks = 30)
# boxplot (numerical)
boxplot(train_X$Age)
# barplot (categorical)
barplot(table(train_X$Embarked))
Again, different statistics can be computed depending on whether the variable is numerical or categorical. Statistics can either be univariate or bivariate.
# univariate (numerical)
mean(train_X$Age, na.rm = TRUE)
sd(train_X$Age, na.rm = TRUE)
median(train_X$Age, na.rm = TRUE)
quantile(train_X$Age, na.rm = TRUE)
summary(train_X)
# univariate (categorical)
unique(train_X$Embarked)
table(train_X$Embarked)
prop.table(table(train_X$Embarked))
# bivariate(numerical)
cor(train_X$SibSp, train_X$Fare) # correlation coefficient
Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.
The following exercises will work with data from the House Prices competition, available here2 on Kaggle. This is a regression problem where we predict the sales price of houses.
train
and test_X
.
For the Dodona exercises, the data should be in your working directory. The easy solution is to set your working directory to the current file and paste the data in the same folder than your R script. (for the group assignment, you should store data and code in dedicated folders
data
andsrc
-> see theory lecture)
For
train
, only retain the first 20 predictors & the target (last column) withtrain <- train[, -seq(21,ncol(train)-1)]
.
Fortest_X
, only retain the first 20 predictors (there is no target) withtest_X <- test_X[, -seq(21,ncol(test_X))]
.
train_X
and train_y
.
(note: identify the dependent variable based on the data description).train_X
, train_y
, and test_X
.LotFrontage
.HouseStyle
.LotFrontage
. Store the result in LotFrontage_summary
.HouseStyle
. Store the result in HouseStyle_prop
.OverallQual
and LotArea
?
Store the result in QualArea_cor
.