In this exercise, we will explore the concept of word embeddings.
We will use the word2vec
package,
which allows us to train word embeddings using multiple threads on character data or data in a text file.
The embeddings are then used to find relations between words.
if (!require("pacman")) install.packages("pacman") ; require("pacman")
p_load(word2vec, text2vec, Rtsne, scales, ggrepel, tidyverse, tm)
We will use product reviews data to illustrate the topic of word embeddings. The data is loaded into the environment using the following code:
reviews <- read_delim("productreviews.csv",
col_names=FALSE,
delim="\n")
Rows: 86 Columns: 1
The word2vec model does not require extensive preprocessing. A simple normalization to lower case often suffices.
reviews <- str_to_lower(reviews$X1)
After preprocessing, we can build the model. The model parameters are as follows:
skip-gram
: trains each context against the word.window
: how far, in terms of words behind and ahead of the input word, do we go.dim
: the amount of dimensions that are being used. The default value is 128.iter
: the amount of iterations that are performed.set.seed(1234)
model <- word2vec(x = reviews,
type = "skip-gram",
dim = 15,
iter = 50,
window = 5)
The embeddings can be studied using the following code:
embedding <- as.matrix(model)
head(embedding)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
keep -0.81864822 1.0854750 -2.67629504 -0.02170475 -0.4665074 0.3635763 0.7391611
horizontal -0.03751362 -0.7017022 -0.60359395 0.51619065 -0.1937486 1.0689538 1.6927575
few -0.81527597 1.8935199 -1.07263422 0.39440507 -0.3323326 -0.4575967 0.2835988
built -0.53252131 0.9398831 -0.90304583 0.01131631 -0.9355348 0.2979309 0.6236096
securely 1.16438735 0.3943901 0.28336731 0.72866017 -0.2707204 0.3751619 0.5749562
far 2.00076103 1.1405702 0.04167805 0.90028983 0.8213167 0.9303665 0.6135947
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
keep -1.0751289 -0.09889881 0.72038585 1.5735097 -0.809497654 -0.26805308 -0.4004193
horizontal -1.1288217 1.14523578 0.32264298 0.6913526 0.329556882 -0.02390322 -2.5183733
few -1.4347241 -0.81478894 0.59061855 -0.2143336 -0.032968488 -0.46669358 -2.0040767
built 0.6280455 -0.83900195 0.02309081 1.9162899 -0.232022464 -0.79194987 -2.3922527
securely -0.9959249 -0.50447804 1.47701073 2.4078233 0.009302461 -0.98155111 -1.4548302
far -1.0985045 0.05075052 0.12063552 1.6007921 -1.112638950 -0.49066821 -1.2755157
[,15]
keep -0.2105033
horizontal 0.4588534
few -1.3004779
built -0.6997061
securely 0.1649463
far -0.2740761
The rows are the words and the columns the embedding size.
To get the embedding of individual functions, use the predict()
function.
predict(model, c("solid", "ipad"), type = "embedding")
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
solid 1.7472452 0.8801448 -0.6404629 -0.1135473 0.90142429 0.7708905 0.3954582 0.1154729
ipad 0.7736455 -0.1574108 -1.3182052 -0.3376513 0.04045169 0.9067312 0.3646089 -2.2979188
[,9] [,10] [,11] [,12] [,13] [,14] [,15]
solid -1.3315314 0.4655769 1.670573 -0.89814436 -0.868761599 -1.663749 0.2506217
ipad 0.3507335 0.5662079 1.416128 0.06520224 0.006791097 -1.879313 -0.5501941
It is also possible to use the predict()
function to find the nearest words.
The nearest
option makes use of cosine similarity.
lookslike <- predict(model, c("solid", "ipad"), type = "nearest", top_n = 5)
lookslike
$solid
term1 term2 similarity rank
1 solid cover 0.9142381 1
2 solid small 0.9098393 2
3 solid then 0.9036511 3
4 solid far 0.8976265 4
5 solid portable 0.8936615 5
$ipad
term1 term2 similarity rank
1 ipad the 0.9315773 1
2 ipad and 0.9157794 2
3 ipad to 0.9061200 3
4 ipad it 0.9033046 4
5 ipad then 0.8975880 5
It looks like there are a lot of stopwords, so we delete them using the stopwords()
function from the tm package.
model <- word2vec(x = reviews,
type = "skip-gram",
dim = 15,
iter = 50,
window = 5,
stopwords = stopwords())
A second look at the nearest words gives.
lookslike <- predict(model, c("solid", "ipad"), type = "nearest", top_n = 5)
lookslike
$solid
term1 term2 similarity rank
1 solid well 0.9856240 1
2 solid works 0.9615405 2
3 solid pretty 0.9569099 3
4 solid set 0.9527774 4
5 solid little 0.9491088 5
$ipad
term1 term2 similarity rank
1 ipad will 0.9670351 1
2 ipad take 0.9624067 2
3 ipad air 0.9621316 3
4 ipad doesnt 0.9558193 4
5 ipad go 0.9483591 5
Search for some analogies: ipad is to air, as samsung is to … However, these analogies do not work here. More data is needed.
wv <- predict(model, newdata = c("ipad", "air", "samsung"), type = "embedding")
wv <- wv["ipad", ] - wv["air", ] + wv["samsung", ]
predict(model, newdata = wv, type = "nearest", top_n = 3)
term similarity rank
1 10 0.9069780 1
2 highly 0.9064944 2
3 samsung 0.9056956 3
Note that the framework is compatible with the original word2vec model implementation.
In order to use external models which are not trained and saved with this R package,
you need to set normalize = TRUE
in read.word2vec
.
There exists a fun game to play with word embeddings of the open-source LLaMa model. You start off with the 4 elements and by combining their embeddings, you can find the nearest words to the result. Play around with the website to get an intuition of how word embeddings work. Infinite Craft1
A trip to Rome is coming up. You want to analyse the hotel reviews before you leave,
specifically whether the location is good enough. Build the word2vec model and delete stopwords.
Use the type skip-gram
. All other parameters stay the same.
Then, find the 3 nearest words for location and store them into lookalike
.
To download the document by term matrix click: here2
Assume that:
reviews
.