Before we start with the re-tweet analysis, note that both variable importances and partial plots will be explained more in detail in Lecture 6. Partial plots allow you to uncover the relationship between predictor and response. For the variable importances, we will use the mean decrease in mean squared error (MSE) in this case.
if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(httr, rtweet, tidyverse, wordcloud2, tm, tidytext)
We have extracted several tweets1 that contained the hashtag “#apple” and we can load them in as follows.
load('tweets.Rdata')
Let’s have a look at what tweets were obtained.
tweets_data(tweets)
user_id status_id created_at screen_name text source display_text_wi~ reply_to_status~
1 100145~ 14566168~ 2021-11-05 13:38:08 Vittorino1~ "Ogg~ Twitt~ 180 NA
2 100145~ 14566158~ 2021-11-05 13:34:10 Vittorino1~ "Ogg~ Twitt~ 226 NA
3 100145~ 14566155~ 2021-11-05 13:33:15 Vittorino1~ "Da ~ Twitt~ 179 NA
4 100145~ 14566166~ 2021-11-05 13:37:22 Vittorino1~ "Sec~ Twitt~ 259 NA
5 100145~ 14566153~ 2021-11-05 13:32:13 Vittorino1~ "Da ~ Twitt~ 169 NA
6 100145~ 14566161~ 2021-11-05 13:35:20 Vittorino1~ "Da ~ Twitt~ 186 NA
7 100145~ 14566149~ 2021-11-05 13:30:47 Vittorino1~ "Da ~ Twitt~ 152 NA
8 915373~ 14566168~ 2021-11-05 13:38:04 akhiljoseph "Xia~ Hoots~ 279 NA
9 994676~ 14566167~ 2021-11-05 13:38:01 DroidTrack~ "FrA~ Droid~ 140 NA
10 734619~ 14566165~ 2021-11-05 13:36:54 Technobugg~ "Xia~ Hoots~ 279 NA
To continue, the tweet texts are extracted as well as the time of creation. More tweets are scraped, but here it is limited up to 200 tweets.
text <- tweets[1:200,] %>% pull(text)
Now the scraping has been done it is time to mine the obtained data.
First, store both text and retweet count.
text <- iconv(text,'latin1', 'ascii', sub = '')
retweetCount <- tweets[1:200,] %>% pull(retweet_count)
Second, load text mining packages.
p_load(SnowballC, slam, tm)
Third, transform the data into the tm package format.
myCorpus <- Corpus(VectorSource(text))
Fourth, transform all values to lower case.
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
Fifth, remove punctuation.
myCorpus <- tm_map(myCorpus, removePunctuation)
Sixth, remove numbers.
myCorpus <- tm_map(myCorpus, removeNumbers)
Seventh, remove stopwords.
myStopwords <- c(stopwords('english'))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
Eighth, create a document term matrix.
(myDtm <- DocumentTermMatrix(myCorpus,
control = list(wordLengths = c(2, Inf))))
Finally, look at the dtm.
inspect(myDtm)
Terms
Docs apple de ios ipad iphone macbook news newsfusionapps pro via
144 1 5 0 0 0 0 0 0 0 0
15 1 2 0 0 3 0 0 0 0 0
152 1 2 0 0 0 0 0 0 0 0
154 1 0 0 0 0 0 0 0 0 0
178 3 4 0 0 0 0 0 0 0 0
194 3 0 0 0 0 0 0 0 0 0
195 1 0 0 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0
27 1 2 0 4 0 0 0 0 0 0
4 1 0 0 0 2 0 0 0 0 0
range(as_tibble(as.matrix(myDtm)))
[1] 0 5
x <- as_tibble(as.matrix(myDtm))
y <- retweetCount
dim(x)
[1] 200 1763
length(y)
[1] 200
Extract the first 150 tweet texts about the #PS5 and store them in tweet_text
. Then, store both texts and retweet count in text
and re_count
. Next, create a document term matrix and store this in mydtm
.
To download the tweets click: here2
To download the intermediate steps stored in myCorpus click: here3
Assume that:
tweets
and have been loaded.myCorpus
.