if (!require("pacman")) install.packages("pacman"); require("pacman")
p_load(tidyverse, word2vec)
The doc2vec
function inside the word2vec
package computes the generalization of the word vectors
as your document vector.
The model depends on the trained Word2Vec model.
Before building a Doc2Vec model, the proper Doc2Vec model should be selected.
# Load product reviews data
reviews <- read_delim(
"productreviews.csv",
col_names = FALSE,
delim = "\n"
)
Instead of training word vectors, we will use the pre-trained word embeddings from Word2Vec.
There are several good ones available on this website1.
Download the English Wikipedia Dump of November 20212
here and use the binary file with normalize = TRUE
.
# Load the pre-trained word2vec model
model <- read.word2vec(file = 'model.bin', normalize = TRUE)
# Prepare data for doc2vec
doc <- data.frame(doc_id = 1:nrow(reviews), text = reviews$X1)
# Preprocess the text data
doc$text <- doc %>%
pull(text) %>%
str_to_lower() %>%
str_replace_all("[[:punct:]]", " ") %>%
str_replace_all("[^[:alpha:]]", " ") %>%
str_squish()
# Run the doc2vec model
doc_emb <- doc2vec(model, doc, type = "embedding")
The resulting doc_emb contains embeddings with 86 documents and 300 columns (embedding size), which is equal to the embedding size of the word vectors.
# Display the first few rows and columns of doc_emb
doc_emb[1:5, 1:5]
[,1] [,2] [,3] [,4] [,5]
1 0.5521684 0.33229590 -1.732286 0.7612352 -1.326712
2 0.5623162 0.57552676 -2.047334 0.6136263 -1.408552
3 0.7643956 -0.21663195 -1.297652 0.1428017 -1.544830
4 0.6840651 -0.10747597 -1.551191 0.5610230 -1.600742
5 0.8023079 -0.04767234 -1.599701 -0.4178068 -1.973022
These embeddings can be used as lexical features in any predictive model.
This finds the document in the dataset that is most similar to the given sentence.
new <- doc2vec(model, "Apple products are the best")
doc[word2vec_similarity(doc_emb, new) %>% which.max(),]
doc_id text
44 44 this is a very nice product the keys are responsive and...