LDA

In this exercise, we’ll be exploring topic modeling using Latent Dirichlet Allocation (LDA). We’ll be working with a dataset of tweets from Joe Biden’s timeline.

Setting Up the Environment

First, let’s load the necessary libraries.

if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(rtweet, httr,tidyverse, reshape2)

Loading the Data

In this example, the tweets¹ on the timeline of Joe Biden are scraped.

load("joe_biden.Rdata")

Next, we’ll separate the tweets into individual objects and set the proper encoding.

tweets <- user %>% select(text)

Preparing for Topic Modeling

We’ll also need to load the required packages for topic modeling.

p_load(wordcloud, tm, topicmodels, topicdoc, tidytext, textclean)

The input of LDA is a document term matrix (tf). Instead of making a DTM function with the tm package, a tidytext approach will be used to create a dtm.

Cleaning the Tweets

First, we’ll clean the tweets.

text_clean <- tweets$text %>%  
  str_to_lower() %>% # to lower case 
  str_replace_all("[[:punct:]]","") %>% # replace punctuation 
  str_replace_all("[[:digit:]]", "") %>% # remove numbers 
  str_squish() # remove whitespace 

Creating a Document by Term Matrix

The creation of a document by term matrix starts with the creation of a tibble.

text_df <- tibble(doc= 1:length(text_clean), text = text_clean)

Next, we’ll make a word frequency table.

freq <- text_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(doc,word, name = "freq", sort = TRUE)

Joining, by = "word"

Then, we’ll cast the dtm from this word count table. This actually creates a tm object.

dtm <- freq %>%
  cast_dtm(doc, word, freq)
dtm

<<DocumentTermMatrix (documents: 600, terms: 2296)>>
Non-/sparse entries: 7729/1369871
Sparsity           : 99%
Maximal term length: 21
Weighting          : term frequency (tf)

Determining the Optimal Number of Topics

It’s very important to know the optimal number of topics K. To do so, we’ll iterate k over a number of values and use the AIC to select the best model (lower is better).

We’ll set a seed for the LDA algorithm so that the results are predictable and comparable. This uses the VEM optimization algorithm as defined by the inventor. You can also choose to perform Gibbs sampling (method option).

ldas <- list()
j <- 0
for (i in 2:10) {
  j <- j+1
  print(i)
  ldas[[j]] <- LDA(x = dtm, k = i, control = list(seed = 1234))
}

In the next code block, we’ll test the performance using the AIC.

(AICs <- data.frame(k = 2:10, aic = map_dbl(ldas, AIC)))

    k      aic
1   2  121569.0
2   3  126151.9
3   4  120927.5
4   5  123711.5
5   6  127166.0
6   7  130319.9
7   8  133704.0
8   9  137225.1
9  10  140595.9

(K <- AICs$k[which.min(AICs$aic)])

[1] 4

Creating the LDA Model

Finally, we’ll create the LDA model.

topicmodel <- LDA(x = dtm, k = K, control = list(seed = 1234))

Exercise

The 50 latest tweets of Amanda Gorman (@TheAmandaGorman) are inspected. Find out what the optimal value is for K ranging from 2 to 4 and store this in K. Store the AIC values in AICs. Next, create the final LDA model and store this model in topicmodel. In the LDA function, use a seed value equal to 1234.

To download the document by term matrix click: here²

Assume that:

The tweets have been scraped.
The document by term matrix has been loaded into dtm.
The packages to perform topic modeling have been loaded.