26hzhang/Word2VecfJava

Word2VecfJava

It is a Java implementation of the paper: Dependency Based Word Embeddings, published by Levy et al. in ACL, and extensions.

This algorithm uses the Skip-Gram method and train with shallow neural network, the input corpus is pre-processed by Stanford Dependency Parser. For more information of word embedding technique, it is better to search the related information online. Usage already shown in examples.

Requirements

DL4J, its GitHub page: [link], and Maven source: [link].
ND4J, its GitHub page: [link], and Maven source: [link].
Stanford NLP, its GitHub page: [link], and Maven sources: [link] (For Maven, please import both corenlp and corenlp with classifier models snippets).
Guava, its Maven sources: [link].

Notes

The Word2Vecf project is a modification of the original Word2Vec proposed by Mikolov, allowing:

performing multiple iterations over the data.
the use of arbitrary context features.
dumping the context vectors at the end of the process

Unlike the original Word2Vec project, which can be used directly, the Word2Vecf needs some pre-computations, since the Word2Vecf DOES NOT handle vocabulary construction and DOES NOT read a sentence or paragraph as input directly.

The expected files are:

word_vocabulary: file mapping words (strings) to their counts.
context_vocabulary: file mapping contexts (strings) to their counts, used for constructing the sampling table for the negative training.
training_data: textual file of word-context pairs. each pair takes a separate line. the format of a pair is "(word context)", i.e. space delimited, where and are strings. if we want to prefer some contexts over the others, we should construct the training data to contain the bias.

In order to make the project more usable, the pre-computations are implemented inside the project too. Since the Word2Vecf project is dependency-based word embeddings, the stanford dependency parser is used, more usage information can be found in its website.

Semantic Property Task

WordSim353: The WordSim353 set contains 353 word pairs. It was constructed by asking human subjects to rate the degree of semantic similarity or relatedness between two words on a numerical scale. The performance is measured by the Pearson correlation of the two word embeddings’ cosine distance and the average score given by the participants. [pdf]
TOEFL: The TOEFL set contains 80 multiple-choice synonym questions, each with 4 candidates. For example, the question word levied has choices: imposed (correct), believed, requested and correlated. Choose the nearest neighbor of the question word from the candidates based on the cosine distance and use the accuracy to measure the performance. [pdf]
Analogy: The analogy task has approximately 9K semantic and 10.5K syntactic analogy questions. The question are similar to “man is to (woman) as king is to queen” or “predict is to (predicting) as dance is to dancing”. Following the previous work, using the nearest neighbor of "queen − king + man" in the vocabulary as the answer. Additionally, the accuracy is used to measure the performance. This dataset is relatively large compared to the previous two sets; therefore, the results using this dataset are more stable than those using the previous two datasets. [pdf]

Reference

eikdk/Word2VecJava
word2vec -- google sources, download
Yoav Goldberg/word2vecf
orenmel/lexsub
GoogleNews-vectors-negative300.bin (Pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors))

Other Information

Version Log.

Word2Vecf C Codes Usage

26hzhang / Word2VecfJava