Using the Boston data set, you will develop a model to predict whether a given suburb has a crime rate above or below the median.

plot

Questions

Some of the exercises are not tested by Dodona (for example the plots), but it is still useful to try them.

  1. Do the following preprocessing steps:

    1. Create a binary variable, crime01, that contains a 1 if crime contains a value above its median, and a 0 if crime contains a value below its median. You can compute the median using the median() function.

    2. Use the data.frame() function to create a single data set containing both crime01 and the other Boston variables. Add crime01 as the last column in the new dataset. Store the result in the variable data.

  2. Explore the data graphically in order to investigate the association between crime01 and the other features. Which of the other features seem most likely to be useful in predicting crime01? For example, you can make pairwise scatterplots with pairs().

  3. Do a train-test split:

    1. Split the data into a training set and a test set with the sample() function. Take 70% of the data (354 rows) in the training set and the other 30% in the test set. Use a seed value of 1. Store the indices of the training set in train.
    2. Create a hold out dataset data.test that only contains the test observations (dependent + independent variables).
    3. Create a hold out dependent variable crime01.test that only contains the test observations.

  4. Perform logistic regression on the training data in order to predict crime01 using the variables nox, rad, and dis. What is the test error (NOT accuracy) of the model obtained on the test data? Store the model in glm.fit, the predictions in glm.pred and the test error in glm.error. Use a threshold of 0.5 to classify predicted probabilities as 0 or 1 (numeric vector).


Assume that: