Using the Boston
data set,
you will develop a model to predict whether a given suburb has a crime rate above or below the median.
Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.
Do the following preprocessing steps:
Create a binary variable, crime01
, that contains a 1 if crime
contains
a value above its median, and a 0 if crime
contains a value below
its median. You can compute the median using the median()
function.
Use the data.frame()
function to create a single data set containing both crime01
and
the other Boston
variables. Add crime01
as the last column in the new dataset. Store the result in the variable data
.
Explore the data graphically in order to investigate the association between crime01
and the other features.
Which of the other features seem most likely to be useful in predicting crime01
?
For example, you can make pairwise scatterplots with pairs()
.
Do a train-test split:
sample()
function.
Take 70% of the data (354 rows) in the training set and the other 30% in the test set. Use a seed value of 1.
Store the indices of the training set in train
.data.test
that only contains the test observations (dependent + independent variables).crime01.test
that only contains the test observations.
Perform logistic regression on the training data in order to predict crime01
using the variables nox
, rad
, and dis
.
What is the test error (NOT accuracy) of the model obtained on the test data?
Store the model in glm.fit
, the predictions in glm.pred
and the test error in glm.error
.
Use a threshold of 0.5 to classify predicted probabilities as 0 or 1 (numeric vector).
Assume that:
MASS
library has been loadedBoston
dataset has been loaded and attached