word2vec_variations

Introduction

These files with exception of the ones cited in the Dependencies section were coded by me during my internship at LIP6. One of the objectives of this software is to create an interface in Python to word2vec, as the latter is coded in C. Moreover, another objective was to make it easy to vary the parameters that can influence it and then run the evaluation tasks without having to write long bash scripts. It's important to note that the Python code calls the compiled C and Perl code using the TODO library.

Two tasks are used for benchmarking. The first one is proposed by Mikolov and al. and is based in analogy reasoning, it's referenced in the files by compute-accuracy or ca as the former is the name of the file that executes this task. The other task is the SemEval-2012 Task 2 the objective is to measure degrees of relational similarity. This task is implemented by the SE_Test1.py file, but it uses the scoring algorithms supplied by challenge organizers (See Dependencies section).

Then, after running the benchmarks with varying parameters, evolution plots can be generated by the log_analyzer.py file. Moreover, one can run PCA and Rank analysis using pca_success.py.

Furthermore, we'd like to point that word2vec-g is a modified version of word2vec that allows the loading of a starting network configuration and compute-accuracy-w is mod of compute-accuracy the only difference between those files is that the former outputs what are exactly the questions with correct answers, this is used by pca_success.py .

Finally, pseudo-classes can be obtained from WordNet using word_net_tree.py, in our approach those are letter used to create a new corpus, run the learning over it and get representations to the pseudo-classes, then run the learning over the original corpus using a start configuration given by those pseudo-classes representations. This start configuration is given by generate_net.py.

Usage Instructions

For all the Python files the -h instruction can be passed to show usage instructions. Moreover, the Tests directory has a comprehensive usage patterns for almost all the files.

Anyways, a typical usage without entering in the WordNet derived pseudo-classes world is the following one:

Run make_logs.py to for a given configuration and variations;
Repeat the last step some times;
Average those logs using average_logs.py;
Generate the plots using log_analyzer.py;
Sort the results using sort_results.py.

Steps 2 and 3 can be skipped for a fast evaluation.

Dependencies

This project uses Python 2.7.x. Except for the files in word2vec folder that are in C as they're the Mikolov et al. implementation from https://code.google.com/p/word2vec/, from exception to compute-accuracy-w.c and word2vec-g.c that are modified versions of their word2vec counterparts. In addition, the compilation of these files was added to the existing Makefile.

So for the Python code the following dependencies are required:

Numpy
matplotlib
Sklearn
NLTK

One important point is that we need the WordNet corpus provided by NLTK, but as it seems, there is a loop in the tree model in their version of WordNet, and this can make word_net_tree.py enter an infinite loop. Fortunately, this seems to be corrected in the newest WordNet release. So we recommend the following approach:

Install NLTK;
Install the WordNet corpora using NLTK (reference);
Find where the corpora was installed and replace the files with the newest version from WordNet site, version 3.1 has been shown to be stable in our tests.

Moreover, the files in the SE2012 are from the SemEval-2012 Task 2 challenge and we call most of their Perl scripts from within SE_Test1.py so a Perl interpreter is needed and the Statistics::RankCorrelation module too.

Furthermore, a corpus is needed for learning, the algorithm proposed by Mikolov et al. works better with large corpora. There is a small corpus called text8 that can be downloaded here. However, as you can find, if you check the Tests folder we didn't use this one. Our choice was for a larger corpus, that we called text9, as the other one, text data is extracted from Wikipedia. Truth be told, text8 is a cropped version of text9. Therefore, instruction to download and clean Wikipedia articles is described here where text9 is called fil9 and the raw Wikipedia data enwik9.

Finally, in the Scripts folder, there are some scripts with the words fb or freebase these are related to the freebase triplets kwnown as FB15k, more information can be found in their site.

License

License information can be found in the LICENSE file. Moreover, all the dependencies have their own license description.

gustavo-momente / word2vec_variations

word2vec_variations

Introduction

Usage Instructions

Dependencies

License

About

Languages