Introduction

In this repository we train three language models on the canonical Penn Treebank (PTB) corpus. This corpus is split into training and validation sets of approximately 929K and 73K tokens, respectively. We implement (1) a traditional trigram model with linear interpolation, (2) a neural probabilistic language model as described by (Bengio et al., 2003), and (3) a regularized Recurrent Neural Network (RNN) with Long-Short-Term Memory (LSTM) units following (Zaremba et al., 2015). We also experiment with a series of modifications to the LSTM model and achieve a perplexity of 92.9 on the validation set with a multi-layer model.

Problem Description

In the Stanford Sentiment Treebank sentiment classification task, we are provided with a corpus of sentences taken from movie reviews. Each sentence has been tagged as either positive, negative, or neutral; we follow (Kim, 2014) in removing the neutral examples and formulating the task as a binary decision between positive and negative sentences.

Model and Algorithms

For each model variant, we formalize prediction for test case as

where is an activation function, are our learned weights, is our feature vector for input , and is a bias vector. Note that in this problem set, for all since we only consider positive and negative sentiment inputs in the SST-2 dataset. We use Pytorch for all model implementations, and all models are trained for 10 epochs each using batches of size 10, a learning rate of 1e-4, the Adam optimizer, and the negative log likelihood loss function. The only exception to this setup was for multinomial naive Bayes, which was fit in one epoch with learning parameter = 1.0.

Multinomial Naive Bayes

Let be the feature count vector for training case with classification label . is the set of features, and represents the number of occurrences of feature in training case . Define the count vectors of as = + , for the smoothing parameter . We follow Wang and Manning in binarizing the counts; = . With regard to Equation (1), is the log-count ratio between the number of positive and negative examples, = log(/) where and are the number of positive and negative training cases in the training dataset, is the number of occurrences of input , and is a binary indicator function that maps to 1 if greater than 0 and 0 otherwise. We consider only unigrams as features.

Logistic Regression

We learn weight and bias matrices and such to optimize

where is -dimensional vector representing bag-of-words unigram counts for each training sample. In our implementation, we represent and with a single fully-connected layer, which maps directly to a two unit output layer under sigmoid activation.

Continuous Bag-of-Words

In the CBOW architecture, each word in a sentence input of word-length is mapped to a -dimensional embedding vector. The embedding vectors for all words are averaged to produce a single feature vector that represents the entire input. In particular,

where and , and = for all are the dimensions of the sentence embedding and word embeddings, respectively. This encoding is then passed into a single fully-connected layer that maps directly to two output units, representing output classes, under softmax activation.

Convolutional Neural Network

Let be the -dimensional word vector correspond to the -th word in the input sentence. After padding all sentences in an input batch to the same length , where is the maximum length sentence of all sentences in the batch, each sentence is then represented as

where is the concatenation operation. Let represent the concatenation of words , , ..., . In Convolutional neural networks, we apply convolution operations with filter size to produce features, where the filter size is effectively the window size of words to convolve over. Let be a feature generated by this operation. Then

where is a bias term and is the rectified linear unit (ReLU) function. Applying filter length size over all possible windows of the words in our input sentence produces the feature map

In our implementation, we convolve over filter sizes and then concatenate the features of each into a single vector. We apply a max-over-time pooling operation (Collobert et al, 2011) to this vector of concatenated feature maps, denoted , and get = max(). We then apply dropout with = 0.50 to as regularization measure against overfitting, pass this into a fully-connected layer and compute the softmax over the output.

Modified CNN

Finally, we also implemented with a series of modifications to the CNN architecture to give a slight performance improvement on the SST-2 dataset. In this implementation, we utilize Stanford’s GloVe pre-trained vectors (Pennington et al., 2014), we make these changes:

Following (Kim, 2014), we use two copies of word embedding table during the convolution and max-pooling steps – one that is non-static, or updated during training as a regular module in the model, and another that is omitted from the optimizer and preserved as static throughout the training run. In the forward pass of the model, these two sets of embeddings are concatenated together along the “channel” dimension, and then passed into the three convolutional layers as a single tensor, with two values for each of the 300 dimensions in GloVe model.
After producing the combined feature vector representing the max-pooled features from the three convolutional kernels, we simply add the non-padded word count of the input as a single extra dimension, producing a 301-dimension tensor which then gets mapped to the 2-unit output. From an engineering standpoint, we find that this marginally improves performance on the SST-2 dataset, where, on average, positive sentences are slightly longer than negative ones – 19.41 words versus 19.17. It’s not clear whether this would hold across different data sets, or if it’s specific to SST-2. (Though it’s also not entirely clear that wouldn’t, and seems to imply an interesting corpus-linguistic question – are “positive” sentences generally longer than “negative” ones?)

Experiments

In addition to the two changes described above, we also experimented with a wide range of other modifications to the CNN architecture, including:

Combining the CBOW model with the CNN architecture by concatenating the maxpooled CNN vectors with the averaged CBOW vector before mapping to the final output units.
Replaced the GloVe embeddings with the GoogleNews embeddings (Mikolov et al., 2013). This idea came from the thought that there might be some useful domain specificity for PTB as these embeddings were trained on news articles.
Implemented “multi-channel” embeddings as described by (Kim, 2014) in the context of CNN architectures. Instead of just using a single embeddings layer that is updated during training, the pre-trained weights matrix is copied into two separate embedding layers: one that is updated during training, and another that is omitted from the optimizer and allowed to remain unchanged during training. During a forward pass word indexes are mapped to each table separately, and then the two tensors are concatenated along the embedding dimension to produce a single, 600-dimension embedding tensor for each token.
Experimented with different approaches to batching. Instead of modeling the corpus as a single, unbroken sequence during training (such as with torchtext’s BPTTIterator), we tried splitting the corpus into individual sentences and then producing separate training cases for each token in each sentence. For example, for the sentence “I like black cats” we produced five contexts:

a. “<SOS> I”

b. “<SOS> I like”

c. “<SOS> I like black”

d. “<SOS> I like black cats”

e. “<SOS> I like black cats <EOS>”

And the model is trained to predict the last token in each context at time step t from the first t-1 tokens. We used PyTorch’s pack_padded_sequence function to handle variable-length inputs to the LSTM. Practically, this was appealing because it makes it easier to engineer a wider range of features from the context before a word – for example, it becomes easy to implement bidirectional LSTMs with both a forward and backward pass over the t-1 context, which, to our knowledge, would be difficult or impossible under the original training regime enforced by BPTTIterator. We realized after trying this, though, that it will never be competitive with BPTTIterator’s continuous representation of the corpus because the sentences in the corpus are grouped by article –- and thus also at a thematic / conceptual level. This means that the model can learn useful information across the sentence boundaries about what type of word should come next.
Experimented with different regularization strategies, such as varying the dropout percentages, applying dropout to the initial embedding layers, etc.

None of these changes improved on the initial single-layer, 1000-unit LSTM. Our best performing model was the one described in Section 3.4. The perplexities we achieved with each of our Section 3 models is described in Table 1.

Model	Perplexity
Linearly Interpolated Trigram	178.03
Neural Language Model (5-gram)	162.2
1-layer LSTM	101.5
3-layer LSTM + connections	92.9

Though the multi-layer LSTM with connections beat the simple LSTM baseline, we were unable to replicate the 78.4 validation perplexity performance described by (Zaremba et al., 2015) using the same corpus and similar architectures. Namely, when using the configurations described in the paper (the 2-layer, 650- and 1500-unit architectures), our models overfit within 5-6 epochs, even when applying dropout in a way that matched the approach described in the paper. In contrast, (Zaremba et al., 2015) mention training for as many as 55 epochs.)

Conclusion

We trained four classes of models – a traditional trigram model with linear interpolation, with weights learned by expectation maximization; a simple neural network language model following (Bengio et al., 2003); a single-layer LSTM baseline; and an extension to this model that uses three layers of different sizes, skip connections for the first two layers, and regularization as described by (Zaremba et al., 2015). The final model achieves a perplexity of 92.9, compared to 78.4 and 82.7 reported by (Zaremba et al., 2015) using roughly equivalent hyperparameters.

References

Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3, pages 1137–1155. 2003.

D. Jurafsky. “Language Modeling: Introduction to N-grams.” Lecture. Stanford University CS124. 2012.

Y. Kim. “Convolutional Neural Networks for Sentence Classification.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. 2014.

T. Mikolov, K. Chen, G. Corrado, J. Dean. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781. 2013.

J. Pennington, R. Socher, C. Manning. “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543. 2014.

W. Zaremba, I. Sutskever, O. Vinyals. 2015. “Recurrent Neural Network Regularization.” arXiv preprint arXiv:1409.2329. 2015.

shayneobrien / language-modeling