There is a gene expression1 data set that consists of 40 tissue samples with measurements on 1000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

Questions

  1. Load the data using read.csv() in the variable genes. You will need to select header = F as one of the arguments.

  2. Look at the dimensions of the data set, are they in the correct format for the cor() function?. Store the dimensions in genes.dim.

  3. Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into two groups? Do your results depend on the type of linkage used? Store the correlation matrix in genes.cor and the correlation-based distance matrix in dist.cor. Store the models in hc.complete, hc.single and hc.average.

Do not include the setwd() statement in your submission on Dodona