End-to-End Generative Dialogue

This code implements multi-layer recurrent neural network encoder-decoder models (RNN, GRU, and LSTM) for training/sampling from conversation dialogue. In other words, the model takes a series of sentences of preceding dialogue and attempts to predict a response sentence. All models and their training, evaluation, and sampling are detailed thoroughly in the paper End-to-End Generative Dialogue.

Here is a link to the paper describing this project.

Primary contributors

Kevin Yang

Michael Farrell

Colton Gyulay

Requirements

This code is written in Lua, and an installation of Torch is assumed. Training requires a few packages which can easily be installed through LuaRocks (which comes with a Torch installation). Datasets are formatted and loaded using hdf5, which can be installed using this guide.

$ luarocks install nn
$ luarocks install rnn

If you want to train on an Nvidia GPU using CUDA, you'll need to install the CUDA Toolkit as well as the cutorch and cunn packages:

$ luarocks install cutorch
$ luarocks install cunn

If you'd like to chat with your trained model, you'll need the penlight package:

$ luarocks install penlight

.
├── misc	                  # Miscellaneous files 
├── notebooks               # Notebooks used for testing
├── paper                   # Our writeup about this project
├── slide                   # Presentation slides about this project
├── src                     # All of our code is here
├──.gitignore	
├── README.md
└── current_datafiles.txt   # List of currently used datafiles

Usage

Data

Input data is stored in the data directory. At the moment, this code is only compatible with the MovieTriples dataset, as defined by Serban et al., 2015. Unzip the MovieTriples dataset and place its contents into data/MovieTriples. Note: the MovieTriples dataset is not publicly available, though training on arbitrary dialogue will be supported soon.

Preprocessing is done using Python:

$ cd src
$ python preprocess.py

A limited dataset (limited input length) can also be used for testing:

$ python preprocess.py --seqlength 5

Training

You can start training the model using train.lua.

$ th train.lua -data_file data/conv-train.hdf5 -val_data_file data/conv-val.hdf5 -save_file conv-model -gpuid 1

Here we are setting the flag gpuid to 1, which trains using the GPU. You can train on the CPU by omitting this flag or setting its value to -1. For a full list of model settings, please consult $ th train.lua -help.

Checkpoints are created after each epoch, and are saved within the src directory. Each checkpoint's name indicates the number of completed epochs of training as well as that checkpoint's perplexity, essentially a measure of how confused the checkpoint is in its predictions. The lower the number, the better the checkpoint's predictions (and output text while sampling).

Sampling

Given a checkpoint file, we can generate responses to input dialogue examples:

$ th run_beam.lua -model conv-model_epoch4.00_39.19.t7 -src_file data/dev_src_words.txt -targ_file data/dev_targ_words.txt -output_file pred.txt -src_dict data/src.dict -targ_dict data/targ.dict

Chatting

It's also possible to chat directly with a checkpoint:

$ th chat.lua -model conv-model_epoch4.00_39.19.t7 -targ_dict data/targ.dict

These models have a tendency to respond tersely and vaguely. It's a work in progress!

Advanced Usage

We have implemented support for training the model using Distributed SGD to farm our clients that will simultaneously compute gradients/update parameters, both remotely and locally.

Setup

In order to run code in parallel, you need the Distributed SGD which contains this directory (End-To-End-Generative-Dialogue) as a submodule:

$ git clone --recursive https://github.com/michaelfarrell76/Distributed-SGD.git
$ cd Distributed-SGD/lua-lua

Next make sure that you have the correct requirements installed via the Requirements section of the Distributed SGD repo.

Next setup your data folder in Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue according to the instructions in the data section. Once this is setup correctly, you should have a folder Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue/data/MovieTriples full with the MovieTriples dataset files.

Finally, run the preprocessing code:

$ cd Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue/src
$ python preprocess.py

Running code in parallel

Local

To run a worker with 2 parallel clients on your own machine:

$ th train.lua -data_file data/conv-train.hdf5 -val_data_file data/conv-val.hdf5 -save_file conv-model -parallel -n_proc 2

localhost

You can start a worker that talks to its clients through localhost.

First follow the setup instructions under the 'Remote -localhost' header, through the 'Allow ssh connections' subheader.

Once you've successfully added your ssh to the list of authorized keys, and allowed for remote login, you can run this through localhost with the following command:

$ PATH_TO_SRC=Desktop/GoogleDrive/FinalProject/Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue/src/
$ PATH_TO_TORCH=/Users/michaelfarrell/torch/install/bin/th
$ th train.lua -data_file data/conv-train.hdf5 -val_data_file data/conv-val.hdf5 -save_file conv-model -parallel -n_proc 2 -localhost -extension $PATH_TO_SRC -torch_path $PATH_TO_TORCH

Running remotely on gcloud servers

Ideally, you can have the clients farmed out to different remote computers instead of running locally.

First, zip the data folder that contains the MovieTriples dataset in the main End-To-End-Generative-Dialogue folder:

$ cd End-To-End-Generative-Dialogue
$ zip -r data.zip data

Next navigate to the 'lua-lua' folder and follow the setup instructions under the 'Remote -gcloud' header. Follow these instructions exactly as they are written through the section 'Adding ssh keys again', except in the section labeled 'Setup the disk', instead of running the command

$  source setup_image.sh

in the 'Setup the disk' step, you should run

$  source End-To-End-Generative-Dialogue/src/setup_image_End-to-End.sh

which contains the necessary changes to the setup for this task.

Once the servers are all setup and you are connected to your host server, you should navigate back to the src directory and you can run the train.lua file on remote servers:

$ cd Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue/src
$ EXTENSION=Distributed-SGD/lua-lua/End-To-End-Generative-Dialogue/src/
$ TORCH_PATH=/home/michaelfarrell/torch/install/bin/th
$ th train.lua -data_file data/conv-train.hdf5 -val_data_file data/conv-val.hdf5 -save_file conv-model -parallel -remote -extension $EXTENSION -torch_path $TORCH_PATH -n_proc 12

Running the remote server: If the servers have been initialized, you will first want to connect to one of them:

$ ssh -o "StrictHostKeyChecking no" -i ~/.ssh/gcloud-sshkey $USERNAME@$IP_ADDR

Once connected, you need to again setup an ssh key as listed in the instructions: "Set up an ssh key to connect to our servers" above. Once the key is created and added to the account, then:

$ cd End-To-End-Generative-Dialogue/src
$ th train.lua -data_file data/conv-train.hdf5 -val_data_file data/conv-val.hdf5 -save_file conv-model -parallel -remote -extension End-To-End-Generative-Dialogue/src/ -torch_path /home/michaelfarrell/torch/install/bin/th -n_proc 4

Relevant links

TODO

Preprocessing (preprocess.py)

Add subTle datset cleaning to preprocessing code (and any other additional datasets we may need)
Modify preprocessing code to have longer sequences (rather than just (U_1, U_2, U_3), have (U_1, ..., U_n) for some n. With this we could try to add more memory to the model we currently have now)
Modify preprocessing code to return entire conversations (rather than fixing n, have the entire back and forth of a conversation together. This could be useful for trying to train a model more specific to our objective. This could be used for testing how the model does for a specific conversation )
Finish cleaning up file (i.e. finish factoring code. I started this but things are going to be modified when subTle is added so I never finished. It shouldn't be bad at all)

Parallel (parallel_functions.lua)

Add way to do localhost without password on server
Get working on google servers
Make sure server setup is correctly done

Acknowledgments

Our implementation utilizes code from the following:

michaelfarrell76 / End-To-End-Generative-Dialogue