R-NET in Tensorflow

This repository is a Tensorflow implementation of R-NET, a neural network designed to solve the Question Answering (QA) task.
This implementation is specifically designed for SQuAD , a large-scale dataset drawing attention in the field of QA recently.
If you have any question, contact b03902012@ntu.edu.tw.

Dependency

Python 3.6
Tensorflow-gpu 1.2.1
Numpy 1.13.1
NLTK

Usage

First we need to download SQuAD as well as the pre-trained GloVe word embeddings. This should take roughly 30 minutes, depending on network speed.

cd Data
sh download.sh
cd ..

Data preprocessing, including tokenizing and collection of pre-trained word embeddings, can take about 15 minutes. Two kinds of files, {data/shared}_{train/dev}.json, will be generated and stored in Data.
- shared: including the original and tokenized articles, GloVe word embeddings and character dictionaries.
- data: including the ID, corresponding article id, tokenized question and the answer indices.

python preprocess.py --gen_seq True

Train R-NET by simply executing the following. The program will
1. Read the training data, and then build the model. This should take around an hour, depending on hardware.
2. Train for 12 epochs, by default.
Hyper-arameters can be specified in Models/config.json. The training procedure, including the mean loss and mean EM score for each epoch, will be stored in Results/rnet_training_result.txt. Note that the score appear during training could be lower than the scores from the official evaluator. The models will be stored in Models/save/.

python rnet.py

The evaluation of the model on the dev set can be generated by executing the following. The result will be stored in Results/rnet_prediction.txt. Note that the score appear during evaluation could be lower than the scores from the official evaluator.

python evaluate.py

To get the final official score, you need to use the official evaluation script, which is in the Results directory.

python Results/evaluate-v1.1.py Data/dev-v1.1.json Results/rnet_prediction.txt

Current Results

Model	Dev EM Score	Dev F1 Score
Original Paper	71.1	79.5
My Implementation	60.1	68.9
My Implementation(w/o char emb)	57.8	67.9

You can find the current leaderboard and compare with other models.

Discussion

Reproduction

As shown above, I still fail to reproduce the results. I think there are some technical details that draw my concern:

Data Preprocessing. I have tried two preprocessing approaches, one of which is used in the implementation of Match-LSTM, and the other is used in the implementation of Bi-DAF. While the latter approach includes lots of reasonable processing, I chose the former one empirically since it yields better performance.
No Dropout has yet been applied to my implementation. I am currently conducting experiments on this.
As pointed out in another implementation of R-NET in Keras,

The first formula in (11) of the report contains a strange summand W_v^Q V_r^Q. Both tensors are trainable and are not used anywhere else in the network. We have replaced this product with a single trainable vector.

However, instead of replacing the product with a single trainable vector, I followed the notation and still used two vectors.
Variable sharing. The notation in the original paper was very confusing to me. For example, W_v^P appeared in both equations (4) and (8). In my opinion, they should not be the same since they are multiplied by vectors of total different spaces. As a result, I treat them as different variables empirically.
Hyper-parameters ambiguity. Some hyper-paramters weren't specified in the original paper, including character embedding matrix dimension, truncating of articles and questions, and length of answer span during inference. I set up my own hyper-parameters empirically, mostly following the settings of Match-LSTM and Bi-DAF.
Any other implementation mistakes and bugs.

OOM

The full model could not be trained with NVIDIA Tesla K40m with 12GiB memory. Tensorflow will report serious OOM problem. There are a few possible solutions.

Run with CPU. This can be achieved by assigning a device mask with command line as follows. In fact, my implementation result shown in the previous section was generated by a model trained with CPU. However, this might cause extremely slow training speed. In my experience, it might cost roughly 24 hours per epoch.

CUDA_VISIBLE_DEVICES="" python rnet.py

Reduce hyperparameters. Modifying these parameters might help:
- p_length
- Word embedding dimension: change from 300d GloVe vectors to 100d.
Don't use character embeddings. To achieve this one might have to hack into Models/models_rnet. I'll try to make this a parameter in Models/config.json but this feature won't be soon. According to Bi-DAF, character embeddings don't help much. However, Bi-DAF uses 1D-CNNs to generate the character embeddings, while R-NET uses RNNs. As shown in the previous section, the performance dropped for 2%. Further investigation is needed for this part.

desert0616 / R-NET-in-Tensorflow