We now examine the Khan data set, which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. For each tissue sample, gene expression measurements are available. The data set consists of training data, xtrain and ytrain, and testing data, xtest and ytest.

We examine the dimension of the data:

library(ISLR2)
names(Khan)
[1] "xtrain" "xtest"  "ytrain" "ytest" 

dim(Khan$xtrain)
[1]   63 2308

dim(Khan$xtest)
[1]   20 2308

length(Khan$ytrain)
[1] 63

length(Khan$ytest)
[1] 20

This data set consists of expression measurements for 2,308 genes. The training and test sets consist of 63 and 20 observations respectively.

table(Khan$ytrain)
 1  2  3  4 
 8 23 12 20 

table(Khan$ytest)
 1 2 3 4 
 3 6 6 5 

Here we see the amount of observations per cancer subtype. There seems to be some class imbalance.

We will use a support vector approach to predict cancer subtype using gene expression measurements. In this data set, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

dat <- data.frame(x = Khan$xtrain, y = as.factor(Khan$ytrain))
out <- svm(y ~ ., data = dat, kernel = "linear", cost = 10)
summary(out)

Call:
svm(formula = y ~ ., data = dat, kernel = "linear", 
    cost = 10)
Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 
Number of Support Vectors:  58
 ( 20 20 11 7 )
Number of Classes:  4 
Levels: 
 1 2 3 4

table(out$fitted, dat$y)
   
     1  2  3  4
  1  8  0  0  0
  2  0 23  0  0
  3  0  0 12  0
  4  0  0  0 20

We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes. We are most interested not in the support vector classifier’s performance on the training observations, but rather its performance on the test observations.

dat.test <- data.frame(x = Khan$xtest, y = as.factor(Khan$ytest))
pred.test <- predict(out, newdata = dat.test)
table(pred.test, dat.test$y)
       
pred.test 1 2 3 4
        1 3 0 0 0
        2 0 6 2 0
        3 0 0 4 0
        4 0 0 0 5

We see that using cost=10 yields two test set errors on this data.