Topics and Documents

In this exercise, we’ll explore the relationship between topics and documents. This relationship can be understood by examining the per document per topic matrix. We’ll extract this matrix using the tidy function with the matrix option set to gamma. Note that this gamma matrix corresponds to the alpha matrix discussed in the lecture slides.

(doc_topic <- tidy(topicmodel, matrix = 'gamma'))
# A tibble: 2,400 x 3
   document topic     gamma
   <chr>    <int>     <dbl>
 1 10           1   0.00161
 2 15           1   0.00245
 3 65           1   0.318  
 4 105          1   0.00385
 5 119          1   0.990  
 6 147          1   0.994  
 7 171          1   0.00331
 8 210          1   0.480  
 9 229          1   0.00290
10 256          1   0.00759
# ... with 2,390 more rows

Each row in the output represents a document and a topic. For instance, 31.8 percent of the words in document 65 are generated by topic 1.

Identifying Top Topics for Each Document

To identify the top topic for each document, we can use the following code:

(topics_gamma <- doc_topic %>% arrange(desc(gamma)))

user_topic <- topics_gamma %>%
  group_by(document) %>%
  top_n(1, gamma)
# A tibble: 2,400 x 3
   document topic   gamma
   <chr>    <int>   <dbl>
 1 10           3   0.995
 2 415          3   0.995
 3 492          3   0.995
 4 362          2   0.995
 5 521          3   0.995
 6 503          3   0.995
 7 465          3   0.994
 8 43           4   0.994
 9 594          4   0.994
10 291          3   0.994
# ... with 2,390 more rows

Adding Original Tweets for Context

By adding the original tweets, we can gain a better understanding of what the topics are about.

user_topic_tweet <- user_topic %>% 
  add_column(Tweets = tweets$text[as.numeric(user_topic$document)])
user_topic_tweet %>% slice_head()                           
# A tibble: 600 x 4
# Groups:   document [600]
   document topic   gamma     Tweets                                                  
   <chr>    <int>   <dbl>     <chr>                                                   
 1 1            4   0.652     "An important part of our economic plan is promoting fa~
 2 10           3   0.995     "It’s time for Congress to pass the Junk Fee Prevention~
 3 100          4   0.993     "Earlier, I gathered my team for a briefing on the extr~
 4 101          2   0.966     "America welcomes you, Mr. President. https://t.co/emqW~
 5 102          2   0.861     "For military families who suffered the ultimate loss, ~
 6 103          2   0.990     "Yesterday, we lit the first-ever permanent White House~
 7 104          1   0.509     "Join Jill and me as we host a Hanukkah Reception at th~
 8 105          4   0.988     "Thanks to American ingenuity, American engineers, and ~
 9 106          1   0.987     "Representatives of the people took a vote on the floor~
10 107          3   0.987     "Thanks to the Inflation Reduction Act, out-of-pocket e~
# ... with 590 more rows

Exercise

Now, look at the per document per topic matrix and store this in document_topic. After that, get the top topic for each document. Store the intermediate code in topic_gamma and the final top topics in top_topic. Finally, get the original tweets to see what the topics are about. Store this in top_topic_tweets. Get only the head of this output.

To download the topicmodel click: here1

To download the tweets click: here2


Assume that: