Introduction to Word Embeddings

In this exercise, we will explore the concept of word embeddings. We will use the word2vec package, which allows us to train word embeddings using multiple threads on character data or data in a text file. The embeddings are then used to find relations between words.

if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(word2vec, text2vec, Rtsne, scales, ggrepel, tidyverse, tm)

Data Preparation

We will use product reviews data to illustrate the topic of word embeddings. The data is loaded into the environment using the following code:

reviews <- read_delim("productreviews.csv", 
                    col_names=FALSE, 
                    delim="\n")
Rows: 86 Columns: 1

Preprocessing

The word2vec model does not require extensive preprocessing. A simple normalization to lower case often suffices.

reviews <- str_to_lower(reviews$X1)

Building the Model

After preprocessing, we can build the model. The model parameters are as follows:

set.seed(1234)
model <- word2vec(x = reviews, 
                  type = "skip-gram", 
                  dim = 15, 
                  iter = 50, 
                  window = 5)

Studying the Embeddings

The embeddings can be studied using the following code:

embedding <- as.matrix(model)
head(embedding)
            [,1]        [,2]      [,3]         [,4]        [,5]       [,6]      [,7]
keep       -0.81864822  1.0854750 -2.67629504 -0.02170475 -0.4665074  0.3635763 0.7391611
horizontal -0.03751362 -0.7017022 -0.60359395  0.51619065 -0.1937486  1.0689538 1.6927575
few        -0.81527597  1.8935199 -1.07263422  0.39440507 -0.3323326 -0.4575967 0.2835988
built      -0.53252131  0.9398831 -0.90304583  0.01131631 -0.9355348  0.2979309 0.6236096
securely    1.16438735  0.3943901  0.28336731  0.72866017 -0.2707204  0.3751619 0.5749562
far         2.00076103  1.1405702  0.04167805  0.90028983  0.8213167  0.9303665 0.6135947
            [,8]        [,9]      [,10]        [,11]     [,12]        [,13]      [,14]
keep       -1.0751289 -0.09889881 0.72038585  1.5735097 -0.809497654 -0.26805308 -0.4004193
horizontal -1.1288217  1.14523578 0.32264298  0.6913526  0.329556882 -0.02390322 -2.5183733
few        -1.4347241 -0.81478894 0.59061855 -0.2143336 -0.032968488 -0.46669358 -2.0040767
built       0.6280455 -0.83900195 0.02309081  1.9162899 -0.232022464 -0.79194987 -2.3922527
securely   -0.9959249 -0.50447804 1.47701073  2.4078233  0.009302461 -0.98155111 -1.4548302
far        -1.0985045  0.05075052 0.12063552  1.6007921 -1.112638950 -0.49066821 -1.2755157
            [,15]
keep       -0.2105033
horizontal  0.4588534
few        -1.3004779
built      -0.6997061
securely    0.1649463
far        -0.2740761

The rows are the words and the columns the embedding size.

Predicting Embeddings

To get the embedding of individual functions, use the predict() function.

predict(model, c("solid", "ipad"), type = "embedding")
      [,1]       [,2]      [,3]       [,4]       [,5]       [,6]      [,7]       [,8]
solid 1.7472452  0.8801448 -0.6404629 -0.1135473 0.90142429 0.7708905 0.3954582  0.1154729
ipad  0.7736455 -0.1574108 -1.3182052 -0.3376513 0.04045169 0.9067312 0.3646089 -2.2979188
      [,9]       [,10]     [,11]     [,12]       [,13]        [,14]     [,15]
solid -1.3315314 0.4655769 1.670573 -0.89814436 -0.868761599 -1.663749  0.2506217
ipad   0.3507335 0.5662079 1.416128  0.06520224  0.006791097 -1.879313 -0.5501941

Finding Nearest Words

It is also possible to use the predict() function to find the nearest words. The nearest option makes use of cosine similarity.

lookslike <- predict(model, c("solid", "ipad"), type = "nearest", top_n = 5)
lookslike
$solid
  term1    term2      similarity   rank
1 solid    cover      0.9142381    1
2 solid    small      0.9098393    2
3 solid    then       0.9036511    3
4 solid    far        0.8976265    4
5 solid    portable   0.8936615    5

$ipad
   term1  term2   similarity   rank
1  ipad   the     0.9315773    1
2  ipad   and     0.9157794    2
3  ipad   to      0.9061200    3
4  ipad   it      0.9033046    4
5  ipad   then    0.8975880    5

Removing Stopwords

It looks like there are a lot of stopwords, so we delete them using the stopwords() function from the tm package.

model <- word2vec(x = reviews, 
                  type = "skip-gram", 
                  dim = 15, 
                  iter = 50, 
                  window = 5, 
                  stopwords = stopwords())

A second look at the nearest words gives.

lookslike <- predict(model, c("solid", "ipad"), type = "nearest", top_n = 5)
lookslike
$solid
   term1   term2   similarity   rank
1  solid   well    0.9856240    1
2  solid   works   0.9615405    2
3  solid   pretty  0.9569099    3
4  solid   set     0.9527774    4
5  solid   little  0.9491088    5

$ipad
   term1   term2   similarity   rank
1  ipad    will    0.9670351    1
2  ipad    take    0.9624067    2
3  ipad    air     0.9621316    3
4  ipad    doesnt  0.9558193    4
5  ipad    go      0.9483591    5

Search for some analogies: ipad is to air, as samsung is to … However, these analogies do not work here. More data is needed.

wv <- predict(model, newdata = c("ipad", "air", "samsung"), type = "embedding")
wv <- wv["ipad", ] - wv["air", ] + wv["samsung", ]
predict(model, newdata = wv, type = "nearest", top_n = 3)
   term     similarity   rank
1  10       0.9069780    1
2  highly   0.9064944    2
3  samsung  0.9056956    3

Note that the framework is compatible with the original word2vec model implementation. In order to use external models which are not trained and saved with this R package, you need to set normalize = TRUE in read.word2vec.

Note

There exists a fun game to play with word embeddings of the open-source LLaMa model. You start off with the 4 elements and by combining their embeddings, you can find the nearest words to the result. Play around with the website to get an intuition of how word embeddings work. Infinite Craft1

Exercise

A trip to Rome is coming up. You want to analyse the hotel reviews before you leave, specifically whether the location is good enough. Build the word2vec model and delete stopwords. Use the type skip-gram. All other parameters stay the same. Then, find the 3 nearest words for location and store them into lookalike.

To download the document by term matrix click: here2


Assume that: