Here, we implement the re-sampling approach to hypothesis testing using
the Khan
dataset, which we investigated in Section 13.5. First, we merge
the training and testing data, which results in observations on 83 patients
for 2,308 genes.
> attach(Khan)
> x <- rbind(xtrain, xtest)
> y <- c(as.numeric(ytrain), as.numeric(ytest))
> dim(x)
[1] 83 2308
> table(y)
y
1 2 3 4
11 29 18 25
There are four classes of cancer. For each gene, we compare the mean expression in the second class (rhabdomyosarcoma) to the mean expression in the fourth class (Burkitt’s lymphoma). Performing a standard two-sample t-test on the 11th gene produces a test-statistic of \(−2.09\) and an associated p-value of \(0.0412\), suggesting modest evidence of a difference in mean expression levels between the two cancer types.
> x <- as.matrix(x)
> x1 <- x[which(y == 2),]
> x2 <- x[which(y == 4),]
> n1 <- nrow(x1)
> n2 <- nrow(x2)
> t.out <- t.test(x1[, 11], x2[, 11], var.equal = TRUE)
> TT <- t.out$statistic
> TT
t
-2.093633
> t.out$p.value
[1] 0.04118644
However, this p-value relies on the assumption that under the null hypothesis of no difference between the two groups, the test statistic follows a t-distribution with \(29 + 25 - 2 = 52\) degrees of freedom. Instead of using this theoretical null distribution, we can randomly split the \(54\) patients into two groups of \(29\) and \(25\), and compute a new test statistic. Under the null hypothesis of no difference between the groups, this new test statistic should have the same distribution as our original one. Repeating this process 10,000 times allows us to approximate the null distribution of the test statistic. We compute the fraction of the time that our observed test statistic exceeds the test statistics obtained via re-sampling.
> set.seed(1)
> B <- 10000
> Tbs <- rep(NA, B)
> for (b in 1:B) {
+ dat <- sample(c(x1[, 11], x2[, 11]))
+ Tbs[b] <- t.test(dat[1:n1], dat[(n1 + 1):(n1 + n2)], var.equal = TRUE)$statistic
+ }
> mean((abs(Tbs) >= abs(TT)))
[1] 0.0416
This fraction, \(0.0416\), is our re-sampling-based p-value. It is almost identical to the p-value of \(0.0412\) obtained using the theoretical null distribution.
We can plot a histogram of the re-sampling-based test statistics.
hist(Tbs, breaks = 100, xlim = c(-4.2, 4.2), main = "",
xlab = "Null Distribution of Test Statistic", col = 7)
lines(seq(-4.2, 4.2, len = 1000), dt(seq(-4.2, 4.2, len = 1000), df = (n1 + n2 - 2)) * 1000, col = 2, lwd = 3)
abline(v = TT, col = 4, lwd = 2)
text(TT + 0.5, 350, paste("T = ", round(TT, 4), sep = ""), col = 4)
TT
and p-value in tt.pvalue
.rs.pvalue
. Use a seed value of 1.Assume that:
ISLR2
library has been loadedKhan
dataset has been loaded and attached