In this exercise, we’ll explore the relationship between topics and documents.
This relationship can be understood by examining the per document per topic matrix.
We’ll extract this matrix using the tidy
function with the matrix option set to gamma
.
Note that this gamma
matrix corresponds to the alpha
matrix discussed in the lecture slides.
(doc_topic <- tidy(topicmodel, matrix = 'gamma'))
# A tibble: 2,400 x 3
document topic gamma
<chr> <int> <dbl>
1 10 1 0.00161
2 15 1 0.00245
3 65 1 0.318
4 105 1 0.00385
5 119 1 0.990
6 147 1 0.994
7 171 1 0.00331
8 210 1 0.480
9 229 1 0.00290
10 256 1 0.00759
# ... with 2,390 more rows
Each row in the output represents a document and a topic. For instance, 31.8 percent of the words in document 65 are generated by topic 1.
To identify the top topic for each document, we can use the following code:
(topics_gamma <- doc_topic %>% arrange(desc(gamma)))
user_topic <- topics_gamma %>%
group_by(document) %>%
top_n(1, gamma)
# A tibble: 2,400 x 3
document topic gamma
<chr> <int> <dbl>
1 10 3 0.995
2 415 3 0.995
3 492 3 0.995
4 362 2 0.995
5 521 3 0.995
6 503 3 0.995
7 465 3 0.994
8 43 4 0.994
9 594 4 0.994
10 291 3 0.994
# ... with 2,390 more rows
By adding the original tweets, we can gain a better understanding of what the topics are about.
user_topic_tweet <- user_topic %>%
add_column(Tweets = tweets$text[as.numeric(user_topic$document)])
user_topic_tweet %>% slice_head()
# A tibble: 600 x 4
# Groups: document [600]
document topic gamma Tweets
<chr> <int> <dbl> <chr>
1 1 4 0.652 "An important part of our economic plan is promoting fa~
2 10 3 0.995 "It’s time for Congress to pass the Junk Fee Prevention~
3 100 4 0.993 "Earlier, I gathered my team for a briefing on the extr~
4 101 2 0.966 "America welcomes you, Mr. President. https://t.co/emqW~
5 102 2 0.861 "For military families who suffered the ultimate loss, ~
6 103 2 0.990 "Yesterday, we lit the first-ever permanent White House~
7 104 1 0.509 "Join Jill and me as we host a Hanukkah Reception at th~
8 105 4 0.988 "Thanks to American ingenuity, American engineers, and ~
9 106 1 0.987 "Representatives of the people took a vote on the floor~
10 107 3 0.987 "Thanks to the Inflation Reduction Act, out-of-pocket e~
# ... with 590 more rows
Now, look at the per document per topic matrix and store this in document_topic
.
After that, get the top topic for each document.
Store the intermediate code in topic_gamma
and the final top topics in top_topic
.
Finally, get the original tweets
to see what the topics are about.
Store this in top_topic_tweets
. Get only the head of this output.
To download the topicmodel click: here1
To download the tweets click: here2
Assume that:
topicmodel
from the previous exercise has been loaded.tweets
have been loaded.