After the extraction of the tweets and the mining of the data, it is time to estimate the importance of the words using random forest.

p_load(randomForest)
rf <- randomForest(x,y, importance=TRUE)

The following code block extracts the feature importance.

imp <- importance(rf) %>% 
  as_tibble(., rownames = 'Features') %>% 
  arrange(desc(`%IncMSE`))
imp
    Features     `%IncMSE`  IncNodePurity
 1  report       0.9        1.75 
 2  appleevent   8.39       1.46 
 3  indie        6.79       0.393
 4  gamedev      6.64       5.85 
 5  hosted       6.05       0.335
 6  indiegame    5.83       4.53 
 7  amp          5.42       4.38 
 8  iphonepro    5.42       0.963
 9  due          4.46       0.773
10  day          3.38       0.718

The variable importance is plotted.

varImpPlot(rf,type=1)

plot

It can be interesting to use scatterplots or partial plots to determine which important words should or should not be used.

partialPlot(rf,as.data.frame(x),imp$Features[1],
            main = paste('PDP on the word:',imp$Features[1]))

plot

Exercise

Consider the tweets about #PS5 from the exercise before. Build a random forest model and store this in rf. The random forest model needs to have 50 trees. Then, extract the feature importance and store this in feature_imp.

To download the document term matrix as a tibble click: here1

To download the retweet count click: here2


Assume that: