Term Document Matrices

Spliting the data

SentimentReal$label <- as.factor(SentimentReal$label)

Next, we create the training and the test set. We will do this with the sample function and set the seed to guarantee the same subsets.

set.seed(1000) 

ind <- sample(x = nrow(SentimentReal), size = nrow(SentimentReal), replace = FALSE)
train <- SentimentReal[1:floor(length(ind)*.60),]
test <- SentimentReal[(floor(length(ind)*.60)+1):(length(ind)),]

Corpora

First, we convert the training- and test set to corpora.

corpus_train <- Corpus(VectorSource(train$message))
corpus_test <- Corpus(VectorSource(test$message))

N-grams

Next, we will make N-grams for the training- and test set. We will restrict to onegrams, but it can be adapted to N-grams. We use a new function: NgramTokenizer, which is further wrapped in the custom Tokenizer function. Here, you can specify which degree of n-gram you want to include. Mindegree = 1 and maxdegree = 3 will include onegrams, bigrams and trigrams. Next, we use this function in the control argument of the DocumentTermMatrix.

Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = mindegree, max = maxdegree))

Training set

Now we can create the dtm of the training set. Note that we create a classic TF matrix and not a TF-IDF matrix.

dtm_train <- DocumentTermMatrix(corpus_train, control = list(tokenize = Tokenizer,
                                                          weighting = function(x) weightTf(x),
                                                          removeNumbers=TRUE,
                                                          removePunctuation=TRUE,
                                                          stripWhitespace= TRUE))
dtm_train
<<DocumentTermMatrix (documents: 57, terms: 794)>>
Non-/sparse entries: 1343/43915
Sparsity           : 97%

Test set

The training and test set have to be prepared in the same way.

dtm_test <- DocumentTermMatrix(corpus_test, control = list(tokenize = Tokenizer,
                                                         weighting = function(x) weightTf(x),
                                                         removeNumbers=TRUE,
                                                         removePunctuation=TRUE,
                                                         stripWhitespace= TRUE))
dtm_test
<<DocumentTermMatrix (documents: 38, terms: 718)>>
Non-/sparse entries: 1122/26162
Sparsity           : 96%

We reform the test DTM to have the same terms as the training case. Remember that our test set should contain the same elements as our training dataset.

prepareTest <- function (train, test) {
  Intersect <- test[,intersect(colnames(test), colnames(train))]
  diffCol <- dtm_train[,setdiff(colnames(train),colnames(test))]
  newCols <- as.simple_triplet_matrix(matrix(0,nrow=test$nrow,ncol=diffCol$ncol))
  newCols$dimnames <- diffCol$dimnames
  testNew<-cbind(Intersect,newCols)
  testNew<- testNew[,colnames(train)]
}

dtm_test <- prepareTest(dtm_train, dtm_test)
dtm_test
A 38x794 simple triplet matrix.

Finally, we convert the term document matrices to common sparse matrices to efficiently apply the SVD algorithm in the next exercise. i are the row indices, j the column indices, and v the values.

dtm.to.sm <- function(dtm) {sparseMatrix(i=dtm$i, j=dtm$j, x=dtm$v,dims=c(dtm$nrow, dtm$ncol))}

sm_train <- dtm.to.sm(dtm_train)
sm_test <- dtm.to.sm(dtm_test)

Exercise

Create a common sparse matrix for both the training as the test set of the oxfam dataset and store them as sm_train_oxfam and sm_train_oxfam, respectively. Note that the oxfamdata has the same structure and the same column names as the SentimentReal dataset. However, do not forget to first create document by term matrices for both the training as the test set.

To download the SentimentReal dataset click here¹.

To download the oxfam dataset click here².

Assume that:

The train_oxfam and test_oxfam variables are given.
The functions Tokenizer, prepareTest, and dtm.to.sm are given.
The SnowballC, slam, tm, and Matrix packages are loaded.