GloVe, unlike Word2Vec, requires additional preprocessing as it’s built on a co-occurrence matrix. We’ll be using the text2vec package to create this matrix.
First, let’s load the necessary libraries.
if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(text2vec,Rtsne,scales,ggrepel,tidyverse,tm)
We’ll be using a dataset of product reviews to illustrate the topic.
reviews <- read_delim(
"reviews.csv",
col_names = FALSE,
delim = "\n"
)
Rows: 86 Columns: 1
Next, we’ll tokenize the reviews, breaking them down into individual words.
tokens <- space_tokenizer(
reviews[, 1] %>%
tolower() %>%
removePunctuation() %>%
removeWords(words = stopwords()) %>%
stripWhitespace()
)
We’ll then build our vocabulary. The terms will be unigrams.
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
We’ll remove sparse words (those that appear less than 3 times). This is done using term_count_min, which represents the minimum acceptable counts of a word in the entire corpora to include in the model.
vocab <- prune_vocabulary(vocab, term_count_min = 3L)
We’ll use the pruned vocabulary for vectorization.
vectorizer <- vocab_vectorizer(vocab)
We’ll create the co-occurrence matrix next. We’ll use a skip gram window (skip_grams_window) of 5 for context words. Skip_gram_window is the window size to search for a word-word co-occurrence.
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
We’ll fit the GloVe model next. This can take several minutes. There are several important parameters:
x_max
: The maximum number of co-occurrences to use in the weighting function (default:100)init
: The addition of initialized words embeddings and biases (default:random)Rank
: Word vector sizeglove <- GloVe$new(
rank = 86,
x_max = 5
)
Once the model is initialized, we’ll fit the model using the AdaGrad optimizer. The most important parameter is:
n_iter
: The number of training iterations to run (10 - 20 for small corpora). This can be performed in parallel (n_threads).word_vectors_main <- glove$fit_transform(
tcm,
n_iter = 20
)
INFO [14:03:03.158] epoch 1, loss 0.2426
INFO [14:03:03.210] epoch 2, loss 0.0833
INFO [14:03:03.238] epoch 3, loss 0.0487
INFO [14:03:03.250] epoch 4, loss 0.0326
INFO [14:03:03.266] epoch 5, loss 0.0232
INFO [14:03:03.286] epoch 6, loss 0.0172
INFO [14:03:03.298] epoch 7, loss 0.0131
INFO [14:03:03.314] epoch 8, loss 0.0101
INFO [14:03:03.326] epoch 9, loss 0.0080
INFO [14:03:03.342] epoch 10, loss 0.0063
INFO [14:03:03.358] epoch 11, loss 0.0051
INFO [14:03:03.374] epoch 12, loss 0.0041
INFO [14:03:03.390] epoch 13, loss 0.0034
INFO [14:03:03.406] epoch 14, loss 0.0028
INFO [14:03:03.418] epoch 15, loss 0.0023
INFO [14:03:03.434] epoch 16, loss 0.0019
INFO [14:03:03.450] epoch 17, loss 0.0016
INFO [14:03:03.466] epoch 18, loss 0.0013
INFO [14:03:03.482] epoch 19, loss 0.0011
INFO [14:03:03.494] epoch 20, loss 0.0009
We’ll obtain the word vectors next. Remember, GloVe has two embedding layers.
word_vectors_components <- glove$components
Typically, either the context or the main word vectors should work. However, the authors suggest using the sum/mean of both.
word_vectors <- word_vectors_main + t(word_vectors_components)
Let’s look at the words related to ‘ipad’.
ipad <- word_vectors["ipad", ,drop = FALSE]
We’ll calculate the cosine similarity between all pairs. The cosine similarity is a measure of correlation (the higher the more alike). Remember, the cosine similarity is bounded between -1 and 1, with higher values indicating more similarity.
cos_sim <- sim2(x = word_vectors, y = ipad,
method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 10)
ipad iphone laptop apple charger games
1.0000000 0.5509735 0.5506747 0.5090965 0.5051790 0.4783950
computer dvds roku tirelessly
0.4704620 0.4696555 0.4532639 0.4466548
Let’s look at some analogies: ‘apple’ is to ‘ipad’ as ‘samsung’ is to…? However, these analogies don’t work here. We need more data.
test <- word_vectors["apple", , drop = FALSE] -
word_vectors["ipad", , drop = FALSE] +
word_vectors["samsung", , drop = FALSE]
cos_sim_test <- sim2(x = word_vectors, y = test,
method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = TRUE), 10)
samsung apple tratando tree
0.7242738 0.5202035 0.4438243 0.4005868
flatscreen vibrate dormhotel fullyfunctional
0.3909859 0.3830234 0.3819627 0.3811861
lemon housea
0.3807331 0.3790463