GloVe

GloVe, unlike Word2Vec, requires additional preprocessing as it’s built on a co-occurrence matrix. We’ll be using the text2vec package to create this matrix.

Setting Up the Environment

First, let’s load the necessary libraries.

if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(text2vec,Rtsne,scales,ggrepel,tidyverse,tm)

Loading the Data

We’ll be using a dataset of product reviews to illustrate the topic.

reviews <- read_delim(
  "reviews.csv",
  col_names = FALSE,
  delim = "\n"
)

Rows: 86 Columns: 1

Tokenization

Next, we’ll tokenize the reviews, breaking them down into individual words.

tokens <- space_tokenizer(
  reviews[, 1] %>%
    tolower() %>%
    removePunctuation() %>%
    removeWords(words = stopwords()) %>%
    stripWhitespace()
)

Building the Vocabulary

We’ll then build our vocabulary. The terms will be unigrams.

it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)

Pruning the Vocabulary

We’ll remove sparse words (those that appear less than 3 times). This is done using term_count_min, which represents the minimum acceptable counts of a word in the entire corpora to include in the model.

vocab <- prune_vocabulary(vocab, term_count_min = 3L)

Vectorization

We’ll use the pruned vocabulary for vectorization.

vectorizer <- vocab_vectorizer(vocab)

Creating the Co-occurrence Matrix

We’ll create the co-occurrence matrix next. We’ll use a skip gram window (skip_grams_window) of 5 for context words. Skip_gram_window is the window size to search for a word-word co-occurrence.

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Fitting the GloVe Model

We’ll fit the GloVe model next. This can take several minutes. There are several important parameters:

x_max : The maximum number of co-occurrences to use in the weighting function (default:100)
init : The addition of initialized words embeddings and biases (default:random)
Rank : Word vector size

glove <- GloVe$new(
  rank = 86,
  x_max = 5
)

Once the model is initialized, we’ll fit the model using the AdaGrad optimizer. The most important parameter is:

n_iter : The number of training iterations to run (10 - 20 for small corpora). This can be performed in parallel (n_threads).

word_vectors_main <- glove$fit_transform(
  tcm,
  n_iter = 20
)

INFO  [14:03:03.158] epoch 1, loss 0.2426 
INFO  [14:03:03.210] epoch 2, loss 0.0833 
INFO  [14:03:03.238] epoch 3, loss 0.0487 
INFO  [14:03:03.250] epoch 4, loss 0.0326 
INFO  [14:03:03.266] epoch 5, loss 0.0232 
INFO  [14:03:03.286] epoch 6, loss 0.0172 
INFO  [14:03:03.298] epoch 7, loss 0.0131 
INFO  [14:03:03.314] epoch 8, loss 0.0101 
INFO  [14:03:03.326] epoch 9, loss 0.0080 
INFO  [14:03:03.342] epoch 10, loss 0.0063 
INFO  [14:03:03.358] epoch 11, loss 0.0051 
INFO  [14:03:03.374] epoch 12, loss 0.0041 
INFO  [14:03:03.390] epoch 13, loss 0.0034 
INFO  [14:03:03.406] epoch 14, loss 0.0028 
INFO  [14:03:03.418] epoch 15, loss 0.0023 
INFO  [14:03:03.434] epoch 16, loss 0.0019 
INFO  [14:03:03.450] epoch 17, loss 0.0016 
INFO  [14:03:03.466] epoch 18, loss 0.0013 
INFO  [14:03:03.482] epoch 19, loss 0.0011 
INFO  [14:03:03.494] epoch 20, loss 0.0009

Obtaining Word Vectors

We’ll obtain the word vectors next. Remember, GloVe has two embedding layers.

word_vectors_components <- glove$components

Typically, either the context or the main word vectors should work. However, the authors suggest using the sum/mean of both.

word_vectors <- word_vectors_main + t(word_vectors_components)

Analyzing Word Similarities

Let’s look at the words related to ‘ipad’.

ipad <- word_vectors["ipad", ,drop = FALSE] 

We’ll calculate the cosine similarity between all pairs. The cosine similarity is a measure of correlation (the higher the more alike). Remember, the cosine similarity is bounded between -1 and 1, with higher values indicating more similarity.

cos_sim <- sim2(x = word_vectors, y = ipad, 
               method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 10)

ipad        iphone     laptop      apple     charger    games 
1.0000000   0.5509735  0.5506747  0.5090965  0.5051790  0.4783950 

computer    dvds       roku       tirelessly 
0.4704620   0.4696555  0.4532639  0.4466548

Word Analogies

Let’s look at some analogies: ‘apple’ is to ‘ipad’ as ‘samsung’ is to…? However, these analogies don’t work here. We need more data.

test <- word_vectors["apple", , drop = FALSE] -
  word_vectors["ipad", , drop = FALSE] +
  word_vectors["samsung", , drop = FALSE]

cos_sim_test <- sim2(x = word_vectors, y = test, 
                     method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = TRUE), 10)

samsung         apple        tratando      tree 
0.7242738       0.5202035    0.4438243     0.4005868 

flatscreen      vibrate      dormhotel     fullyfunctional 
0.3909859       0.3830234    0.3819627     0.3811861 
          
lemon          housea 
0.3807331      0.3790463