Building Paragraph Vectors with doc2vec Package

Instead of building a Doc2Vec based on Word2Vec, you can also choose to build paragraph vectors with the doc2vec package.

# Load the doc2vec package
p_load(doc2vec)

Model Initialization

The package requires a doc_id and text as input. Ensure that there is no newline symbol in the data. Since we already prepared the doc in the correct way, we can use this file. At first, we will build a PV-DM model from scratch.

# Build a PV-DM model from scratch
pvdm <- paragraph2vec(
  x = doc,
  type = "PV-DM",
  dim = 100,
  iter = 20,
  min_count = 5,
  lr = 0.05,
  threads = 1
)

Retrieving Embeddings

The embeddings for both the words and docs can be retrieved. The dimensions of the word and document embeddings are the same.

# Retrieve word embeddings
as.matrix(pvdm, which = "words") %>% .[1:5 ,1:5]
            [,1]        [,2]        [,3]        [,4]        [,5]
</s>  0.14333282  0.15825513 -0.13715845 -0.11738408  0.04893599
      0.08825304 -0.08049262  0.15293245 -0.08376718 -0.10480364
the   0.05162577 -0.06025964  0.01978746 -0.04263477  0.09742799
it   -0.06643348  0.05399752  0.03811104  0.01577955 -0.02346892
a     0.02250784 -0.06664305  0.01829301 -0.12920336  0.11679579
# Retrieve document embeddings
as.matrix(pvdm, which = "docs") %>% .[1:5 ,1:5]
        [,1]         [,2]       [,3]        [,4]        [,5]
1 0.08378936 -0.071290582 0.15677798 -0.10466643 -0.09493607
2 0.09323791 -0.071697682 0.10055041 -0.03386657  0.01626084
3 0.05738901 -0.089625038 0.04649137 -0.12280785  0.01003175
4 0.14255086 -0.053726304 0.03185777 -0.08643609 -0.11152207
5 0.12896720  0.002413007 0.06398371 -0.06594130  0.08091883

Summary of Included Documents and Words

The included documents and words can also be retrieved.

# Retrieve summary of included documents
summary(pvdm, which = "docs") %>% head()
[1] "1" "2" "3" "4" "5" "6"
# Retrieve summary of included words
summary(pvdm, which = "words") %>% head()
[1] "</s>" ""     "the"  "it"   "a"    "and" 

Examining Specific Embeddings

Look at the embeddings of a specific word and document.

# Embeddings of specific words
predict(pvdm, newdata = c("apple", "ipad"),
type = "embedding", which = "words")

# Embeddings of specific documents
predict(pvdm, newdata = c("1", "2", "10"),
type = "embedding", which = "docs")

Finding Nearest Words and Documents

You can also find the nearest words and documents.

# Nearest words
predict(pvdm, newdata = c("samsung", "galaxy"),
        type = "nearest", which = "word2word",
        top_n = 2)
[[1]]
    term1  term2 similarity rank
1 samsung highly  0.9154802    1
2 samsung    tab  0.9146598    2

[[2]]
   term1  term2 similarity rank
1 galaxy    tab  0.9073155    1
2 galaxy easily  0.8510080    2
# Nearest docs of a word
predict(pvdm, newdata = c("samsung", "galaxy"),
        type = "nearest", which = "word2doc",
        top_n = 2)
[[1]]
    term1 term2 similarity rank
1 samsung    55  0.8601123    1
2 samsung    74  0.8037568    2

[[2]]
   term1 term2 similarity rank
1 galaxy  <ff>  0.8357276    1
2 galaxy    23  0.7737346    2
# Nearest docs of a doc
predict(pvdm, newdata = "2",
        type = "nearest", which = "doc2doc",
        top_n = 2)
[[1]]
  term1 term2 similarity rank
1     2    68  0.9007084    1
2     2    52  0.8778872    2