An implementation of the MDSENT architectures described in the paper MDSENT at SemEval-2016 Task 4: A Supervised System for Message Polarity Classification by Hang Gao and Tim Oates.
First run the following script
./fetch_and_preprocess.sh
This downloads the following data:
- TREC dataset (question classification task)
- Glove word vectors (Common Clawl 840B) -- PS: the download takes around 2GB
Alternatively, the download and preprocessing scripts can be called individually.
This implementation only supports classification task. To train a network, run,
python2.7 train.py --train data/trec/train --dev data/trec/dev --batch 15 --epoch 10 --save model.pkg --word_vectors data/glove/glove.840B.300d.txt --word_vocab data/trec/wvocab-cased.txt --char_vocab data/trec/cvocab-cased.txt --w_filter 3 100 --w_filter 4 100 --w_filter 5 100 --c_filter 3 100 --c_filter 4 100 --c_filter 5 100 --num_class 6
where:
train
: the location of preprocessed training datadev
: the location of preprocessed development databatch
: the batch sizeepoch
: number of training epochssave
: the location to save trained modelword_vectors
: the location of pre-trained word embeddingschar_vectors
: the location of pre-trained character embeddingsword_vocab
: the location of preprocessed vocabulary for wordschar_vocab
: the location of preprocessed vocabulary for charactersw_filter
: specifies a type of convolutional filter for word-based input with its height(int) and number(int)c_filter
: specifies a type of convolutional filter for character-based input with its height(int) and number(int)num_class
: number of classes for the classification task
To make predictions with a trained model, run,
python2.7 test.py --test data/trec/test --model model.pkg --word_vocab data/trec/wvocab-cased.txt --char_vocab data/trec/cvocab-cased.txt
where:
test
: the location of preprocessed test datamodel
: the location of trained modelword_vocab
: the location of preprocessed vocabulary for wordschar_vocab
: the location of preprocessed vocabulary for characters
PS: the implementation is in Theano, thus it is recommended that floatX is set to float32 in theano flags to avoid possible precision problems.