paragraph vector trained by negative sampling
This project requires a template library for linear algebra: Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page)
An online demo is available at: http://www.logos.t.u-tokyo.ac.jp/~hassy/implementations/paragraph_vector/
- speedup the code
- make it possible to train unknown paragraphs, such as paragprahs in test data
-
modify the line in Makefile to use Eigen
EIGEN_LOCATION=$$HOME/local/eigen #Change this line -
run the command "make" or run the script "sample.sh"
-
train a model using your corpus which should have a paragraph (or document, sentence) in each line
./paragraph_vector -input input.txt -output result
(run "./paragraph_vector -help" or see Utils.hpp for other options) -
use the resulting files for your purpose
result.bin
result.pv: each line has a paragraph ID and real values of its vector representation
result.wv: each line has a word and real values of its vector representation
Quoc Le, Tomas Mikolov. Distributed Representations of Sentences and Documents. 2014. Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196.