ChatLearner

A chatbot implemented in TensorFlow based on the new sequence to sequence model, with certain rules integrated.

This chatbot was built on the new seq2seq model (dynamic RNN based) in TensorFlow version 1.3 (require 1.2.1 and later). The code was largely referenced on the tutorial of the new NMT model (https://github.com/tensorflow/nmt).

Due to the changes made on tf.contrib.data API in TensorFlow 1.4, the existing implementation in the main branch won't work in TensorFlow 1.4. Will upgrade soon ...

Highlights and Specialties:

Why do you want to spend time checking this repository? Here are some possible reasons:

The Papaya Data Set for training the chatbot. You can easily find tons of training data online, but you cannot find any with such high quality. See the detailed description below about the data set.
The concise code style and clear implementation of the new seq2seq model based on dynamic RNN (a.k.a. the new NMT model). It is customized for chatbots and much easier to understand compared with the official tutorial.
Some rules are integrated to demo how to combine traditional rule-based chatbots with new deep learning models. No matter how powerful a deep learning model can be, it cannot even answer questions requiring simple arithmetic calculations, and many others. The approach demonstrated here can be easily adapted to retrieve news or other online information. With the rules implemented, it can then properly answer many interesting questions. For example:
- "What time is it now?" or "What day is it today?" or "What's the date yesterday?"
- "Read me a story please." or "Tell me a joke." It can then present stories and jokes randomly and not being limited by the sequence length of the decoder.
- "How much is twelve thousand three hundred four plus two hundred fifty six?" or "What is the sum of five and six?" or "How much is twelve thousand three-hundred and four divided by two-hundred-fifty-six?" or "If x=55 and y=19, how much is y - x?" or "How much do you get if you subtract eight from one hundred?" or even "If x = 99 and y = 228 / x, how much is y?"
If you are not interested in rules, you can easily remove those lines related to knowledgebase.py and functiondata.py.
A SOAP-based web service allows you to present the GUI in Java, while the model is trained and running in Python and TensorFlow.
A simple solution (in-graph) to convert a string tensor to lower case in TensorFlow. It is required if you utilize the new DataSet API (tf.contrib.data.TextLineDataSet) in TensorFlow to load training data from text files.
The repository also contains a chatbot implementation based on the legacy seq2seq model. In case you are interested in that, please check the Legacy_Chatbot branch at https://github.com/bshao001/ChatLearner/tree/Legacy_Chatbot.

Comparison of the new and legacy seq2seq model in TensorFlow:

The main advantage of the new model is speed. Training and inference using the new model are both faster. As the new model is based on the dynamic RNN, a GPU (or CPU) can afford to train it with a larger batch size. With the legacy one, if you could train the model with batch size 64 or 128, then you can train the new model in batch size doubled: 128 or 256, therefore cutting the training time to almost half.
Bucketing can be used to speed up the training for both models. However, it causes extra mess to the old seq2seq model due to different lengths of padding, if you want your model to remember certain question and answer pairs. You have no way to tell your model which bucket was used to train a pair at the inference time as you only have the length of the questions (see an expedient solution in my implementation of the legacy model). The great thing about the new model is that it does not have this problem as its padding does not create extra noises.
Based on my limited observations, I feel that the new NMT model also has slightly larger capacity, which can accommodate larger vocabulary with a same sized model (concerning number layers and number of units).
I haven't found any disadvantages of the new model. If we have to name one, the TensorFlow group did not provide an integrated interface as for the legacy one, the seq2seq.py. Hence, it is a little harder to put the encoder and the decoder together by yourself. However, you may find that implementing beam search becomes easier. It is supported in this implementation. Beam search clearly improves the inference results, and it can also vary the responses (within the same trained model), which makes the chatbot more interesting as well.

Training Data (Papaya Data Set)

The training data are composed of two sets: the first set was handcrafted, and we created the samples in order to maintain a consistent role of the chatbot, who can therefore be trained to be polite, patient, humorous, philosophical, and aware that he is a robot, but pretend to be a 9-year old boy named Papaya; the second set was cleaned from some online resources, including the scenario conversations designed for training robots, and the Cornell movie dialogs.
The training data set is split into three categories: two subsets will be augmented during the training, with different levels or times, while the third will not. The augmented subsets are to train the model with rules to follow, and some knowledge and common senses, while the third subset is just to help to train the language model.
The scenario conversations were extracted and reorganized from http://www.eslfast.com/robot/. If your model can support context, it would work much better by utilizing these conversations.
The original Cornell data set can be found at http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. We cleaned it using a Python script (the script can also be found in the Corpus folder); we then cleaned it manually by quickly searching certain patterns. The final data is available here: https://github.com/bshao001/ChatLearner/blob/master/Data/Corpus/Augment0/cornell_cleaned_new.txt
For the Reddit data, a cleaned subset (about 110K pairs) is included in this repository. The vocab file and model parameters are created and adjusted based on all the included data files. In case you need a larger set, you can also find scripts to parse and clean the Reddit comments in the Corpus/RedditData folder. In order to use those scripts, you need to download a torrent of Reddit comments from a torrent link here. Normally a single month of comments is big enough (can generated 3M pairs of training samples roughly). You can tune the parameters in the scripts based on your needs.
The data files in this data set were already preprocessed with NLTK tokenizer so that they are ready to feed into the model using new dataset API in TensorFlow.

Before You Proceed

Please make sure you have the correct TensorFlow version. It works only with TensorFlow 1.3, neither 1.2 or lower, nor 1.4. As the tf.contrib.data API used here was newly introduced, and it will be changed in version 1.4. I will upgrade the software when TensorFlow 1.4 is officially released. Please stay tuned.
Please make sure you have environment variable PYTHONPATH setup. It needs to point to the project root directory, in which you have chatbot, Data, and webui folder. If you are running in an IDE, such as PyCharm, it will create that for you. But if you run any python scripts in a command line, you have to have that environment variable, otherwise, you get module import errors.
Please make sure you are using the same vocab.txt file for both training and inference/prediction. Keep in mind that your model will never see any words as we do. It's all integers in, integers out, while the words and their orders in vocab.txt help to map between the words and integers.
Spend a little bit time thinking of how big your model should be, what should be the maximum length of the encoder/decoder, the size of the vocabulary set, and how many pairs of the training data you want to use. Be advised that a model has a capacity limit: how much data it can learn or remember. When you have a fixed number of layers, number of units, type of RNN cell (such as GRU), and you decided the encoder/decoder length, it is mainly the vocabulary size that impacts your model's ability to learn, not the number of training samples. If you can manage not to let the vocabulary size to grow when you make use of more training data, it probably will work, but the reality is when you have more training samples, the vocabulary size also increases very quickly, and you may then notice your model cannot accommodate that size of data at all. Feel free to open an issue to discuss if you want.

Training

Other than Python 3.5.2, Numpy, and TensorFlow 1.3. You also need NLTK (Natural Language Toolkit) version 3.2.4, including its data.

Before starting the long training process, you may want to try my trained model. You can download it here. Unzip the .rar file, and copy the Result folder into the Data folder under your project root. A vocab.txt file is also included in case I update it without updating the trained model in the future.

During the training, I really suggest you to try playing with a parameter (colocate_gradients_with_ops) in function tf.gradients. You can find a line like this in modelcreator.py: gradients = tf.gradients(self.train_loss, params). Set colocate_gradients_with_ops=True (adding it) and run the training for at least one epoch, note down the time, and then set it to False (or just remove it) and run the training for at least one epoch and see if the times required for one epoch are significantly different. It is shocking to me at least.

Other than those, training is straightforward. Remember to create a folder named Result under the Data folder first. Then just run the following commands:

cd chatbot
python bottrainer.py

A good GPU is highly recommended for the training as it can be very time-consuming. To train the model using the existing parameters and the Papaya data set with a single GPU (NVIDIA GeForce GTX 1080 Ti), it will take about 5 hours to get the desired perplexity. You can modify the model parameters based on your available computing resources. If you are using a CPU version of TensorFlow, make sure you change num_gpus to 0 in hparams.json file. You will be able to see the training results under Data/Result/ folder. Make sure the following 2 files exist as all these will be required for testing and prediction (the .meta file is optional as the inference model will be created independently):

basic.data-00000-of-00001
basic.index

Testing / Inference

For testing and prediction, we provide a simple command interface and a web-based interface. Note that vocab.txt file (and files in KnowledgeBase, for this chatbot) is also required for inference. In order to quickly check how the trained model performs, use the following command interface:

cd chatbot
python botui.py

Wait until you get the command prompt "> ".

A demo test result is provided as well. Please check it to see how this chatbot behaves now: https://github.com/bshao001/ChatLearner/blob/master/Data/Test/responses.txt

Web Interface

A SOAP-based web service architecture is implemented, with a Python server and a Java client. A nice GUI is also included for your reference. For details, please check: https://github.com/bshao001/ChatLearner/tree/master/webui

References and Credits:

The new NMT model: https://github.com/tensorflow/nmt
Tornado Web Service: https://github.com/rancavil/tornado-webservices
Reddit data parser: https://github.com/pender/chatbot-rnn

adarve / ChatLearner