nnakul / word-embeddings

Implementing a feed-forward neural network in conjunction with advanced sampling techniques to learn high-quality distributed vector representations of words.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Word Embeddings

Word embedding is used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers. This project is mainly inspired from the research work of Mikolov et al. published in the paper Efficient Estimation of Word Representations in Vector Space . In the paper, the authors have proposed two novel model architectures for computing continuous vector representations of words from very large data sets, Continuous Bag-of-Words Model and Continuous Skip-gram Model. This project uses the Continuous Skip-gram Model to learn word embeddings from an 11 million words text corpus, having a vocabulary size of 202,000 (without re-sampling and filtering).

In conjunction with the basics discussed in this paper, the project also implements the advanced sampling techniques and extensions of the Continuous Skip-gram Model presented in another paper of Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality . These extensions improve both the quality of the vectors and the training speed. By subsampling of the frequent words, significant speedup is obtained and also more regular word representations are learned.

Negative Sampling

Negative Sampling (NEG) can be seen as an approximation to Noise Contrastive Estimation (NCE). NCE approximates the loss of the softmax as the number of samples (or target classes) increases. NEG simplifies NCE and does away with this guarantee, as the objective of NEG is to learn high-quality word representations rather than achieving low perplexity on a test set, as is the goal in language modelling.

NEG uses a logistic loss function to minimise the negative log-likelihood of words in the training set. The task is to use logistic regression in order to distinguish the true target from a k-size subset of all possible targets, the subset being constructed considering a noise distribution over all the targets. The targets in this k-subset are called negative samples and the noise distribution over all the samples is based on their frequency in the training corpus (as described in the paper). In this project, 10 negative samples are chosen for every training data sample while computing the forward loss (k = 10).

Training

The model was trained over 9 epochs, with a learning rate of 0.003. Training took almost 5 hours.

                

Performance

The following function is the most original objective function that would have been used to compute loss in the uppermost layer of the network if we did not use any sampling-based or approximated softmax-based approaches. The following loss function (and the other equation) is used to deduce the performance of this model on some test corpora, of variable sizes (performance being the sigmoid of the loss's reciprocal).

                

                                

Besides on the training corpus, the performance of the model was evaluated on 6 test corpora, of increasing word counts. For comparison, a Gensim Word2Vec Model was trained with the same training corpus and this model was also evaluated on all the corpora. The performance chart of the two models is shown below.

                

The performance of My Model is better than the performance of Gensim's Model on all the corpora. This might be due to the noise-filtering and re-sampling done in My Model's architecture to first refine the training corpus, that might not be done before training of the Gensim's Model. The performance of Gensim's Model on the test corpora is almost the same as on the training corpus (excellent fitting). My Model does not exhibit drastic fluctuations in its performance with increasing size of the corpus, but the performance on any test corpus is observed to be less than that on the training corpus.

Analogical Reasoning

Analogical Reasoning tasks check the ability of the model to automatically organize concepts and learn implicitly the relationships between them, as during the training we did not provide any supervised information about, say, what a capital city means or how a father-mother semantic relationship is related to a son-daughter semantic relationship. This model performs considerably well on the analogical reasoning tasks.
                                                                                                     

Improvements

The implementation does not deal with phrases. For example, the meanings of Canada and Air cannot be easily combined to obtain Air Canada. Distributed Representations of Words and Phrases and their Compositionality paper presents a simple data-driven method for finding phrases in text and shows that learning good vector representations for millions of phrases in the training corpus is possible.

About

Implementing a feed-forward neural network in conjunction with advanced sampling techniques to learn high-quality distributed vector representations of words.

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 74.8%Language:Python 25.2%