It is claimed that in the case of data that is just barely linearly separable, a support vector classifier with a small value of “cost” that misclassifies a couple of training observations may perform better on test data than one with a huge value of “cost” that does not misclassify any training observations. You will now investigate that claim.
First we generate two-class data with \(p = 2\) in such a way that the classes are just barely linearly separable. We randomly generate 1000 points and scatter them across line \(x = y\) with wide margin. We also create noisy points along the line \(5x − 4y − 50 = 0\). These points make the classes barely separable and also shift the maximum margin classifier.
set.seed(1)
x.one <- runif(500, 0, 90)
y.one <- runif(500, x.one + 10, 100)
x.one.noise <- runif(50, 20, 80)
y.one.noise <- 5/4 * (x.one.noise - 10) + 0.1
x.zero <- runif(500, 10, 100)
y.zero <- runif(500, 0, x.zero - 10)
x.zero.noise <- runif(50, 20, 80)
y.zero.noise <- 5/4 * (x.zero.noise - 10) - 0.1
class.one <- seq(1, 550)
x <- c(x.one, x.one.noise, x.zero, x.zero.noise)
y <- c(y.one, y.one.noise, y.zero, y.zero.noise)
plot(x[class.one], y[class.one], col = "blue", pch = "+", ylim = c(0, 100))
points(x[-class.one], y[-class.one], col = "red", pch = 4)
z <- rep(0, 1100)
z[class.one] <- 1
data <- data.frame(x = x, y = y, z = as.factor(z))

The data set is stored in the dataframe data, has one dependent variable z and two independent variables x and y.
We will use this data set as the training data for our classifiers.
Compute the cross-validation error rates for support vector classifiers with a range of cost values using the tune() function.
Use the following range: c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000).
Store the output of the tune() function in tune.out and set.seed(2) before running the cross-validation.
How many training errors are misclassified for each value of cost considered,
and how does this relate to the cross-validation errors obtained?
Create a dataframe train.misclass with two columns,
the first column is named “cost” and contains all the values of cost tested in the cross-validation
(this is stored in tune.out$performance$cost),
the second column is named “n_misclassified” and contains the number of misclassified observations
(you can find the mean error rate of each cross-validation in tune.out$performance$error).
x.test <- runif(1000, 0, 100)
class.one <- sample(1000, 500)
y.test <- rep(NA, 1000)
# Set y > x for class.one
for (i in class.one) {
y.test[i] <- runif(1, x.test[i], 100)
}
# set y < x for class.zero
for (i in setdiff(1:1000, class.one)) {
y.test[i] <- runif(1, 0, x.test[i])
}
plot(x.test[class.one], y.test[class.one], col = "blue", pch = "+")
points(x.test[-class.one], y.test[-class.one], col = "red", pch = 4)
set.seed(3)
z.test <- rep(0, 1000)
z.test[class.one] <- 1
data.test <- data.frame(x = x.test, y = y.test, z = as.factor(z.test))

We also define the range of values we want to test for cost.
costs <- c(0.01, 0.1, 1, 5, 10, 100, 1000, 10000)
Compute the test errors corresponding to each of the values of cost considered.
test.err with the same length as costslength(costs) and each iteration:
test.err.test.misclass with two columns,
the first column is named “cost” and contains all the values of cost used for training,
the second column is named “n_misclassified” and contains the number of misclassified test observations according to each cost value.Assume that:
e1071 library has been loaded