Before we start with the re-tweet analysis, note that both variable importances and partial plots will be explained more in detail in Lecture 6. Partial plots allow you to uncover the relationship between predictor and response. For the variable importances, we will use the mean decrease in mean squared error (MSE) in this case.

if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(httr, rtweet, tidyverse, wordcloud2, tm, tidytext)

We have extracted several tweets¹ that contained the hashtag “#apple” and we can load them in as follows.

load('tweets.Rdata')

Let’s have a look at what tweets were obtained.

tweets_data(tweets)

   user_id  status_id  created_at           screen_name  text   source  display_text_wi~  reply_to_status~         
 1 100145~  14566168~  2021-11-05 13:38:08  Vittorino1~  "Ogg~  Twitt~  180               NA              
 2 100145~  14566158~  2021-11-05 13:34:10  Vittorino1~  "Ogg~  Twitt~  226               NA              
 3 100145~  14566155~  2021-11-05 13:33:15  Vittorino1~  "Da ~  Twitt~  179               NA              
 4 100145~  14566166~  2021-11-05 13:37:22  Vittorino1~  "Sec~  Twitt~  259               NA              
 5 100145~  14566153~  2021-11-05 13:32:13  Vittorino1~  "Da ~  Twitt~  169               NA              
 6 100145~  14566161~  2021-11-05 13:35:20  Vittorino1~  "Da ~  Twitt~  186               NA              
 7 100145~  14566149~  2021-11-05 13:30:47  Vittorino1~  "Da ~  Twitt~  152               NA              
 8 915373~  14566168~  2021-11-05 13:38:04  akhiljoseph  "Xia~  Hoots~  279               NA              
 9 994676~  14566167~  2021-11-05 13:38:01  DroidTrack~  "FrA~  Droid~  140               NA              
10 734619~  14566165~  2021-11-05 13:36:54  Technobugg~  "Xia~  Hoots~  279               NA

To continue, the tweet texts are extracted as well as the time of creation. More tweets are scraped, but here it is limited up to 200 tweets.

text <- tweets[1:200,] %>% pull(text)

Now the scraping has been done it is time to mine the obtained data.

First, store both text and retweet count.

text <- iconv(text,'latin1', 'ascii', sub = '')
retweetCount <- tweets[1:200,] %>% pull(retweet_count)

Second, load text mining packages.

p_load(SnowballC, slam, tm)

Third, transform the data into the tm package format.

myCorpus <- Corpus(VectorSource(text))

Fourth, transform all values to lower case.

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

Fifth, remove punctuation.

myCorpus <- tm_map(myCorpus, removePunctuation)

Sixth, remove numbers.

myCorpus <- tm_map(myCorpus, removeNumbers)

Seventh, remove stopwords.

myStopwords <- c(stopwords('english'))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

Eighth, create a document term matrix.

(myDtm <- DocumentTermMatrix(myCorpus, 
                             control = list(wordLengths = c(2, Inf))))

Finally, look at the dtm.

inspect(myDtm)

      Terms
Docs  apple de  ios  ipad   iphone  macbook news newsfusionapps pro via
144   1     5   0    0      0       0       0    0              0   0
15    1     2   0    0      3       0       0    0              0   0
152   1     2   0    0      0       0       0    0              0   0
154   1     0   0    0      0       0       0    0              0   0
178   3     4   0    0      0       0       0    0              0   0
194   3     0   0    0      0       0       0    0              0   0
195   1     0   0    0      0       0       0    0              0   0
2     2     0   0    0      0       0       0    0              0   0
27    1     2   0    4      0       0       0    0              0   0
4     1     0   0    0      2       0       0    0              0   0

range(as_tibble(as.matrix(myDtm)))

[1] 0 5

x <- as_tibble(as.matrix(myDtm))
y <- retweetCount

dim(x)

[1]  200 1763

length(y)

[1] 200

Exercise

Extract the first 150 tweet texts about the #PS5 and store them in tweet_text. Then, store both texts and retweet count in text and re_count. Next, create a document term matrix and store this in mydtm.

To download the tweets click: here²

To download the intermediate steps stored in myCorpus click: here³

Assume that:

The needed packages (tidyverse, rtweet…) have been loaded.
The tweets have been stored in tweets and have been loaded.
The preprocessing steps have been done and have been stored in myCorpus.