predict.paragraph2vec crashes with words greater than 103 chars long
Ingolifs opened this issue · comments
It took me a little while to hunt down the cause of this crash...
It does this on my machine at the very least. This is on R 3.6.3.
library(doc2vec)
corpus <-data.frame(doc_id=1,text="here are some words for training the model")
model <- paragraph2vec(x = corpus, type = "PV-DM", dim = 10 , iter = 20,min_count=1)
# this text will successfully run
successtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(successtext)
predict(model, newdata = list(a=successtext), type = "embedding", which = "docs")
# this text will cause a crash
failtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(failtext)
predict(model, newdata = list(a=failtext), type = "embedding", which = "docs")
Thanks for the reproducible boom.
There is currently unfortunately a hard limit built into the C++ side of words having maximum length of 100. This can be seen here: https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L11. Another hard limit set at the C++ side is that the maximum number of words a document can have is 1000 (https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L14)
The boom probably occurs at
doc2vec/src/doc2vec/TaggedBrownCorpus.cpp
Line 88 in 3e94756
doc2vec/src/doc2vec/TaggedBrownCorpus.cpp
Line 90 in 3e94756
The package only allows to have words with a maximum length of 100 characters. The crash is caused by strcpy copying a larger amount into a memory block which only allows 100 characters, namely at https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L147
https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L254
I'll explicitely truncate the strings to 100 characters to avoid the crash at the C++ side but keep in mind your words should be smaller than 100 characters in size.