Instead of building a Doc2Vec based on Word2Vec,
you can also choose to build paragraph vectors with the doc2vec
package.
# Load the doc2vec package
p_load(doc2vec)
The package requires a doc_id
and text
as input. Ensure that there is no newline symbol in the data.
Since we already prepared the doc in the correct way, we can use this file.
At first, we will build a PV-DM model from scratch.
# Build a PV-DM model from scratch
pvdm <- paragraph2vec(
x = doc,
type = "PV-DM",
dim = 100,
iter = 20,
min_count = 5,
lr = 0.05,
threads = 1
)
The embeddings for both the words and docs can be retrieved. The dimensions of the word and document embeddings are the same.
# Retrieve word embeddings
as.matrix(pvdm, which = "words") %>% .[1:5 ,1:5]
[,1] [,2] [,3] [,4] [,5]
</s> 0.14333282 0.15825513 -0.13715845 -0.11738408 0.04893599
0.08825304 -0.08049262 0.15293245 -0.08376718 -0.10480364
the 0.05162577 -0.06025964 0.01978746 -0.04263477 0.09742799
it -0.06643348 0.05399752 0.03811104 0.01577955 -0.02346892
a 0.02250784 -0.06664305 0.01829301 -0.12920336 0.11679579
# Retrieve document embeddings
as.matrix(pvdm, which = "docs") %>% .[1:5 ,1:5]
[,1] [,2] [,3] [,4] [,5]
1 0.08378936 -0.071290582 0.15677798 -0.10466643 -0.09493607
2 0.09323791 -0.071697682 0.10055041 -0.03386657 0.01626084
3 0.05738901 -0.089625038 0.04649137 -0.12280785 0.01003175
4 0.14255086 -0.053726304 0.03185777 -0.08643609 -0.11152207
5 0.12896720 0.002413007 0.06398371 -0.06594130 0.08091883
The included documents and words can also be retrieved.
# Retrieve summary of included documents
summary(pvdm, which = "docs") %>% head()
[1] "1" "2" "3" "4" "5" "6"
# Retrieve summary of included words
summary(pvdm, which = "words") %>% head()
[1] "</s>" "" "the" "it" "a" "and"
Look at the embeddings of a specific word and document.
# Embeddings of specific words
predict(pvdm, newdata = c("apple", "ipad"),
type = "embedding", which = "words")
# Embeddings of specific documents
predict(pvdm, newdata = c("1", "2", "10"),
type = "embedding", which = "docs")
You can also find the nearest words and documents.
# Nearest words
predict(pvdm, newdata = c("samsung", "galaxy"),
type = "nearest", which = "word2word",
top_n = 2)
[[1]]
term1 term2 similarity rank
1 samsung highly 0.9154802 1
2 samsung tab 0.9146598 2
[[2]]
term1 term2 similarity rank
1 galaxy tab 0.9073155 1
2 galaxy easily 0.8510080 2
# Nearest docs of a word
predict(pvdm, newdata = c("samsung", "galaxy"),
type = "nearest", which = "word2doc",
top_n = 2)
[[1]]
term1 term2 similarity rank
1 samsung 55 0.8601123 1
2 samsung 74 0.8037568 2
[[2]]
term1 term2 similarity rank
1 galaxy <ff> 0.8357276 1
2 galaxy 23 0.7737346 2
# Nearest docs of a doc
predict(pvdm, newdata = "2",
type = "nearest", which = "doc2doc",
top_n = 2)
[[1]]
term1 term2 similarity rank
1 2 68 0.9007084 1
2 2 52 0.8778872 2