lxc-xx / doc2vec

The repository contains some python scripts for training and inferring test document vectors using paragraph vectors or doc2vec.

Requirements

Gensim: If you need to load pre-trained word embeddings when training doc2vec, check out my forked version of Gensim; if not feel free to use the canonical one

Pre-Trained Doc2Vec Models

Pre-Trained Word2Vec Models

For reproducibility we also released the pre-trained word2vec skip-gram models on Wikipedia and AP News:

Directory Structure and Files

train_model.py: example python script to train some toy data
infer_test.py: example python script to infer test document vectors using trained model
toy_data: directory containing some toy train/test documents and pre-trained word embeddings

Publications

Jey Han Lau and Timothy Baldwin (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.

About

Apache License 2.0

Languages

Language:Python 100.0%