Dialogue systems for language learning

What is it?

A dialogue system meant to be used for language learning.

Based on:

Google Neural Machine Tranlation model which is based on Thang Luong's thesis on Neural Machine Translation
Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Created by Tudor Paraschivescu for the Cambridge UROP project "Dialogue systems for language learning". (Working paper available on request to apc38 @ cam.ac.uk)

Dependencies

Overall: tensorflow >= 1.2.1, numpy, nltk Preprocessing: scikit-learn (for train-test split), tqdm (for checking progress)

Tensorflow can be installed by following the tensorflow installation instructions. Note that a virtualenv installation is recommended, with pip install, and that you need pip version >=8.1.

As in requirements.txt, also run: pip install nltk, numpy, scipy, sklearn, six, tqdm

Data

For training a model download the Cornell Movie-Dialogs Corpus

Change to the Dialogue-systems-for-language-learning (root) directory, mkdir -p data/cornell and unzip the 'cornell movie-dialogs corpus' folder from the zip file into it.

Make sure you're in the root directory again and run the script python preprocessing/cornell/simple_pre.py. This will process the Cornell corpus and output to data/cornell/processed/simple. N.B. For the hierarchical model referred to in 'Training' below, I suspect you need to run preprocessing/cornell/hier.py but this is undocumented and I haven't tried it yet.

Also make an output directory in root.

Training

To begin training a simple nmt model run 'chatbot/run.py' using the arguments:

    --src=enc --tgt=dec \
    --vocab_file="<repo-path>/data/cornell/processed/simple/vocab"  \
    --train_prefix="<repo-path>/data/cornell/processed/simple/train" \
    --dev_prefix="<repo-path>/data/cornell/processed/simple/val"  \
    --test_prefix="<repo-path>/data/cornell/processed/simple/test" \
    --out_dir="<repo-path>/output/cornell" \
    --num_train_steps=12000 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

To begin training a hierarchical model run 'chatbot/run.py' using the arguments:

--src=enc --tgt=dec \
--vocab_file="<repo-path>/data/cornell/processed/hier/vocab"  \
--train_prefix="<repo-path>/data/cornell/processed/hier/train" \
--dev_prefix="<repo-path>/data/cornell/processed/hier/val"  \
--test_prefix="<repo-path>/data/cornell/processed/hier/test" \
--out_dir="<repo-path>/output/cornell" \
--num_train_steps=12000 \
--steps_per_stats=100 \
--num_layers=2 \
--num_units=128 \
--dropout=0.2 \
--metrics=bleu \
--architecture=hier \
--context_num_layers=2

This will run the training for 12000 iterations. The hyperparameters used are the standard ones from the NMT guide. The following hyperparameters can be tweaked to change the model:

num_train_steps: Overall training steps executed before stopping.
num_units: Number of units of the hidden layer of the encoder and decoder RNNs
num_layers: Number of layers used by the RNNs
encoder_type: uni | bi. Default is uni. Chooses whether the encoder is unidirectional or bidirectional
residual: Whether to add residual connections
optimizer: sgd | adam. Choose the optimizer used for training.
learning_rate: Default is 1.0. Should change to between 0.001 to 0.0001 if using adam.
start_decay_step, decay_steps, decay_factor: hyperparameters which affect the learning rate decay.
unit_type: lstm | gru | layer_norm_lstm. Type of the RNN cell used.
forget_bias: Forget bias for BasicLSTMCell.
src_reverse: Whether to reverse the source utterance.
num_buckets: Number of bucket in which we put data of similar length.
num_gpus: Number of GPUs of the machine. Default is 1.
metrics: Comma-separated list of evaluations. Can be bleu,rouge,accuracy.
context_num_layer: The number of layers of the context encoder.

For more information all the arguments are parsed in the 'chatbot/argument_parser.py' file.

Chatting

To chat with your model run it with the arguments:

--chat=True \
--chat_logs_output_file="<repo-path>/output/cornell/chat_logs.txt" \
--out_dir="<repo-path>/Chatbot/output/cornell" \
--architecture=hier \
--beam_width=5 \
--top_responses=3 \
--number_token=<number> \
--name_token=<person> \

The hyperparameters which can be tweaked for a different experience are:

chat_logs: The output file where your chat will be recorded. If none is provided then there will be no record of the chat.
out_dir: should always point to the output directory from the training stage.
beam_width: number of nodes expanded in the beam-search.
top_responses: sample over this number of responses. Useful when also using beam_width.
number_token: the token used for replacing numbers. Used for posprocessing, which can be changed by tweaking the          'postprocess_output' method in 'utils/chatbot_utils'.
name_token: similar to number_token, but for names.

Issues

InvalidArgumentError: Multiple OpKernel registrations match NodeDef: tensorflow bug, see this issue. It is a bug in tensorflow, if it persists deactivate beam-search by setting beam_width=0
Import errors: all commands should be run from the home directory.

cainesap / Dialogue-systems-for-language-learning