SentimentReal$label <- as.factor(SentimentReal$label)
Next, we create the training and the test set.
We will do this with the sample
function and
set the seed to guarantee the same subsets.
set.seed(1000)
ind <- sample(x = nrow(SentimentReal), size = nrow(SentimentReal), replace = FALSE)
train <- SentimentReal[1:floor(length(ind)*.60),]
test <- SentimentReal[(floor(length(ind)*.60)+1):(length(ind)),]
First, we convert the training- and test set to corpora.
corpus_train <- Corpus(VectorSource(train$message))
corpus_test <- Corpus(VectorSource(test$message))
Next, we will make N-grams for the training- and test set.
We will restrict to onegrams, but it can be adapted to N-grams.
We use a new function: NgramTokenizer
, which is further
wrapped in the custom Tokenizer
function.
Here, you can specify which degree of n-gram you want to include.
Mindegree = 1 and maxdegree = 3 will include onegrams, bigrams and trigrams.
Next, we use this function in the control argument of the DocumentTermMatrix
.
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = mindegree, max = maxdegree))
Now we can create the dtm of the training set. Note that we create a classic TF matrix and not a TF-IDF matrix.
dtm_train <- DocumentTermMatrix(corpus_train, control = list(tokenize = Tokenizer,
weighting = function(x) weightTf(x),
removeNumbers=TRUE,
removePunctuation=TRUE,
stripWhitespace= TRUE))
dtm_train
<<DocumentTermMatrix (documents: 57, terms: 794)>>
Non-/sparse entries: 1343/43915
Sparsity : 97%
The training and test set have to be prepared in the same way.
dtm_test <- DocumentTermMatrix(corpus_test, control = list(tokenize = Tokenizer,
weighting = function(x) weightTf(x),
removeNumbers=TRUE,
removePunctuation=TRUE,
stripWhitespace= TRUE))
dtm_test
<<DocumentTermMatrix (documents: 38, terms: 718)>>
Non-/sparse entries: 1122/26162
Sparsity : 96%
We reform the test DTM to have the same terms as the training case. Remember that our test set should contain the same elements as our training dataset.
prepareTest <- function (train, test) {
Intersect <- test[,intersect(colnames(test), colnames(train))]
diffCol <- dtm_train[,setdiff(colnames(train),colnames(test))]
newCols <- as.simple_triplet_matrix(matrix(0,nrow=test$nrow,ncol=diffCol$ncol))
newCols$dimnames <- diffCol$dimnames
testNew<-cbind(Intersect,newCols)
testNew<- testNew[,colnames(train)]
}
dtm_test <- prepareTest(dtm_train, dtm_test)
dtm_test
A 38x794 simple triplet matrix.
Finally, we convert the term document matrices to common sparse matrices to efficiently apply the SVD algorithm in the next exercise. i are the row indices, j the column indices, and v the values.
dtm.to.sm <- function(dtm) {sparseMatrix(i=dtm$i, j=dtm$j, x=dtm$v,dims=c(dtm$nrow, dtm$ncol))}
sm_train <- dtm.to.sm(dtm_train)
sm_test <- dtm.to.sm(dtm_test)
Create a common sparse matrix for both the training as the test set
of the oxfam
dataset and store them as
sm_train_oxfam
and sm_train_oxfam
, respectively.
Note that the oxfamdata
has the same structure and the same
column names as the SentimentReal
dataset. However, do not forget
to first create document by term matrices for both the training as the test set.
To download the SentimentReal
dataset click
here1.
To download the oxfam
dataset click
here2.
Assume that:
train_oxfam
and test_oxfam
variables are given.Tokenizer
, prepareTest
, and dtm.to.sm
are given.