Doc2Vec through Word2Vec in R

Load the packages

if (!require("pacman")) install.packages("pacman"); require("pacman")
p_load(tidyverse, word2vec)

Introduction

The doc2vec function inside the word2vec package computes the generalization of the word vectors as your document vector. The model depends on the trained Word2Vec model. Before building a Doc2Vec model, the proper Doc2Vec model should be selected.

Working with Product Reviews Data

# Load product reviews data
reviews <- read_delim(
  "productreviews.csv",
  col_names = FALSE,
  delim = "\n"
)

Using Word Embeddings to construct Document Embeddings

Instead of training word vectors, we will use the pre-trained word embeddings from Word2Vec. There are several good ones available on this website1. Download the English Wikipedia Dump of November 20212 here and use the binary file with normalize = TRUE.

# Load the pre-trained word2vec model
model <- read.word2vec(file = 'model.bin', normalize = TRUE)

# Prepare data for doc2vec
doc <- data.frame(doc_id = 1:nrow(reviews), text = reviews$X1)

# Preprocess the text data
doc$text <- doc %>%
  pull(text) %>%
  str_to_lower() %>%
  str_replace_all("[[:punct:]]", " ") %>%
  str_replace_all("[^[:alpha:]]", " ") %>%
  str_squish()

# Run the doc2vec model
doc_emb <- doc2vec(model, doc, type = "embedding")

The resulting doc_emb contains embeddings with 86 documents and 300 columns (embedding size), which is equal to the embedding size of the word vectors.

# Display the first few rows and columns of doc_emb
doc_emb[1:5, 1:5]
       [,1]        [,2]      [,3]       [,4]      [,5]
1 0.5521684  0.33229590 -1.732286  0.7612352 -1.326712
2 0.5623162  0.57552676 -2.047334  0.6136263 -1.408552
3 0.7643956 -0.21663195 -1.297652  0.1428017 -1.544830
4 0.6840651 -0.10747597 -1.551191  0.5610230 -1.600742
5 0.8023079 -0.04767234 -1.599701 -0.4178068 -1.973022

These embeddings can be used as lexical features in any predictive model.

Finding Similar Documents

This finds the document in the dataset that is most similar to the given sentence.

new <- doc2vec(model, "Apple products are the best")
doc[word2vec_similarity(doc_emb, new) %>% which.max(),]
   doc_id   text
44     44   this is a very nice product the keys are responsive and...