hitwsl / multivec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MultiVec

C++ implementation of word2vec, bivec, and paragraph vector.

Features

Monolingual model

  • Most of word2vec's features [1, 6]
  • Evaluation on the analogical reasoning task (multithreaded version of word2vec's compute-accuracy)
  • Batch and online paragraph vector [2]
  • Save & load full model, including configuration and vocabulary
  • Python wrapper

Bilingual model

  • Bivec-like training with a parallel corpus [3, 7]
  • Save & load full model
  • Trains two monolingual models, which can be exported and used by MultiVec
  • Python wrapper

Dependencies

  • GCC 4.4+
  • CMake 2.6+
  • Python and Numpy headers for the Python wrapper

Installation

git clone https://github.com/eske/multivec.git
mkdir multivec/build
cd multivec/build
cmake ..
make
cd ..

The bin directory should now contain 4 binaries:

  • multivec-mono which is used to generate monolingual models;
  • multivec-bi to generate bilingual models;
  • word2vec which is a modified version of word2vec that matches our user interface;
  • compute-accuracy to evaluate word embeddings on the analogical reasoning task (multithreaded version of word2vec's compute-accuracy program).

Python wrapper

cd python-wrapper
make

Use from Python (multivec.so must be in the PYTHONPATH, e.g. working directory):

python2
>>> from multivec import BiModel
>>> model.load('models/news.en-fr.bin')
>>> model.src_model
<monomodel.MonoModel object at 0x7fbd5585d138>
>>> model.src_model.word_vec('france')
array([ 0.33916989,  1.50113714, -1.37295866, -1.49976909, -1.75945604,
        0.17705017,  1.73590481,  2.26124287, -1.98765969, -2.01758456,
       -1.00831568, -0.47787675, -0.19950299, -2.3867569 , -0.01307649,
       ... ], dtype=float32)

Usage examples

First create two directories data and models at the root of the project, where you will put the text corpora and trained models. The script scripts/prepare.sh can be used to pre-process a corpus (punctuation normalization, tokenization and lowercasing).

mkdir data
mkdir models
wget http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz -P data
tar xzf data/de-en.tgz -C data
scripts/prepare.py data/europarl-v7.de-en.en en > data/europarl.en
scripts/prepare.py data/europarl-v7.de-en.de de > data/europarl.de

To train a monolingual model using text corpus data/europarl.en:

bin/multivec-mono --train data/europarl.en --save models/europarl.en.bin --threads 16

To train a bilingual model using parallel corpus data/europarl.en, data/europarl.de:

bin/multivec-bi --train-src data/europarl.en --train-trg data/europarl.de --save models/europarl.en-de.bin --threads 16

To load a bilingual model and export it to source and target monolingual models:

bin/multivec-bi --load models/europarl.en-de.bin --save-src models/europarl.en.bin --save-trg models/europarl.en.bin

To evaluate a trained English model on the analogical reasoning task, first export the model to the word2vec format, then use compute-accuracy:

bin/multivec-mono --load models/europarl.en.bin --save-vectors-bin models/vectors.bin
bin/compute-accuracy models/vectors.bin 0 < word2vec/questions-words.txt

TODO

  • better software architecture for paragraph vector/online paragraph vector
  • paragraph vector: DBOW model (similar to skip-gram)
  • paragraph vector: option to concatenate, sum or average with word vectors on projection layer.
  • incremental training: possibility to train without erasing the model
  • GIZA alignment for bilingual model
  • bilingual paragraph vector training

Acknowledgement

This toolkit is part of the project KEHATH (https://kehath.imag.fr/) funded by the French National Research Agency.

Scientific paper

When you use this toolkit, please cite:

@InProceedings{MultiVecLREC2016,
Title                    = {{MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP}},
Author                   = {Alexandre Bérard and Christophe Servan and Olivier Pietquin and Laurent Besacier},
Booktitle                = {The 10th edition of the Language Resources and Evaluation Conference (LREC 2016)},
Year                     = {2016},
Month                    = {May}
}

References

  1. Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al. (2013)
  2. Distributed Representations of Sentences and Documents, Le and Mikolov (2014)
  3. Bilingual Word Representations with Monolingual Quality in Mind, Luong et al. (2015)
  4. Learning Distributed Representations for Multilingual Text Sequences, Pham et al. (2015)
  5. BilBOWA: Fast Bilingual Distributed Representations without Word Alignments, Gouws et al. (2014)
  6. Word2vec project
  7. Bivec project

About

License:Apache License 2.0


Languages

Language:C++ 67.0%Language:Java 11.9%Language:Shell 10.7%Language:Perl 4.7%Language:Python 2.1%Language:CMake 1.5%Language:xBase 1.2%Language:JavaScript 0.5%Language:Charity 0.4%Language:Makefile 0.1%Language:Emacs Lisp 0.0%