bnosac / doc2vec

Distributed Representations of Sentences and Documents

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

predict.paragraph2vec crashes with words greater than 103 chars long

Ingolifs opened this issue · comments

It took me a little while to hunt down the cause of this crash...

It does this on my machine at the very least. This is on R 3.6.3.


library(doc2vec)

corpus <-data.frame(doc_id=1,text="here are some words for training the model")
model <- paragraph2vec(x = corpus, type = "PV-DM", dim = 10	, iter = 20,min_count=1)

# this text will successfully run
successtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(successtext)
predict(model, newdata = list(a=successtext), type = "embedding", which = "docs")

# this text will cause a crash
failtext <- "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
nchar(failtext)
predict(model, newdata = list(a=failtext), type = "embedding", which = "docs")

Thanks for the reproducible boom.

There is currently unfortunately a hard limit built into the C++ side of words having maximum length of 100. This can be seen here: https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L11. Another hard limit set at the C++ side is that the maximum number of words a document can have is 1000 (https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/common_define.h#L14)

The boom probably occurs at

m_tag = (char *)calloc(MAX_STRING, sizeof(char));
or
for(int i = 0; i < MAX_SENTENCE_LENGTH; i++) m_words[i] = (char *)calloc(MAX_STRING, sizeof(char));

The package only allows to have words with a maximum length of 100 characters. The crash is caused by strcpy copying a larger amount into a memory block which only allows 100 characters, namely at https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L147
https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L254
I'll explicitely truncate the strings to 100 characters to avoid the crash at the C++ side but keep in mind your words should be smaller than 100 characters in size.

Fix using substr in commit 7f584f9