We now examine the Khan
data set, which consists of a number of tissue
samples corresponding to four distinct types of small round blue cell tumors.
For each tissue sample, gene expression measurements are available.
The data set consists of training data, xtrain
and ytrain
, and testing data,
xtest
and ytest
.
We examine the dimension of the data:
library(ISLR2)
names(Khan)
[1] "xtrain" "xtest" "ytrain" "ytest"
dim(Khan$xtrain)
[1] 63 2308
dim(Khan$xtest)
[1] 20 2308
length(Khan$ytrain)
[1] 63
length(Khan$ytest)
[1] 20
This data set consists of expression measurements for 2,308 genes. The training and test sets consist of 63 and 20 observations respectively.
table(Khan$ytrain)
1 2 3 4
8 23 12 20
table(Khan$ytest)
1 2 3 4
3 6 6 5
Here we see the amount of observations per cancer subtype. There seems to be some class imbalance.
We will use a support vector approach to predict cancer subtype using gene expression measurements. In this data set, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.
dat <- data.frame(x = Khan$xtrain, y = as.factor(Khan$ytrain))
out <- svm(y ~ ., data = dat, kernel = "linear", cost = 10)
summary(out)
Call:
svm(formula = y ~ ., data = dat, kernel = "linear",
cost = 10)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
Number of Support Vectors: 58
( 20 20 11 7 )
Number of Classes: 4
Levels:
1 2 3 4
table(out$fitted, dat$y)
1 2 3 4
1 8 0 0 0
2 0 23 0 0
3 0 0 12 0
4 0 0 0 20
We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes. We are most interested not in the support vector classifier’s performance on the training observations, but rather its performance on the test observations.
dat.test <- data.frame(x = Khan$xtest, y = as.factor(Khan$ytest))
pred.test <- predict(out, newdata = dat.test)
table(pred.test, dat.test$y)
pred.test 1 2 3 4
1 3 0 0 0
2 0 6 2 0
3 0 0 4 0
4 0 0 0 5
We see that using cost=10
yields two test set errors on this data.