The repository contains some python scripts for training and inferring test document vectors using paragraph vectors or doc2vec.
Requirements
- Gensim: If you need to load pre-trained word embeddings when training doc2vec, check out my forked version of Gensim; if not feel free to use the canonical one
Pre-Trained Doc2Vec Models
Pre-Trained Word2Vec Models
For reproducibility we also released the pre-trained word2vec skip-gram models on Wikipedia and AP News:
Directory Structure and Files
- train_model.py: example python script to train some toy data
- infer_test.py: example python script to infer test document vectors using trained model
- toy_data: directory containing some toy train/test documents and pre-trained word embeddings
Publications
- Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.