jann

Hi. I am jann. I am a retrieval-based chatbot. I would make a great baseline.

Allow me to (re)introduce myself

I uses approximate nearest neighbor lookup using Spotify's Annoy (Apache License 2.0) library, over a distributed semantic embedding space (Google's Universal Sentence Encoder (code: Apache License 2.0) from TensorFlow Hub.

Objectives

The goal of jann is to explicitly describes each step of the process of building a semantic similarity retrieval-based text chatbot. It is designed to be able to use diverse text source as input (e.g. Facebook messages, tweets, emails, movie lines, speeches, restaurant reviews, ...) so long as the data is collected in a single text file to be ready for processing.

Install and configure requirements

Note: jann development is tested with Python 3.8.6 on macOS 11.5.2 and Ubuntu 20.04.

To run jann on your local system or a server, you will need to perform the following installation steps.

# OSX: Install homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# OSX: Install wget
brew install wget

# Configure and activate virtual environment
python3.8 -m venv venv
source venv/bin/activate

python --version
# Ensure Python 3.8.10

# Upgrade Pip
pip install --upgrade pip setuptools

# Install requirements
pip install -r requirements.txt

# Install Jann
python setup.py install

# Set environmental variable for TensorFlow Hub
export TFHUB_CACHE_DIR=Jann/data/module

# Make the TFHUB_CACHE_DIR
mkdir -p ${TFHUB_CACHE_DIR}

# Download and unpack the Universal Sentence Encoder Lite model (~25 MB)
wget "https://tfhub.dev/google/universal-sentence-encoder-lite/2?tf-hub-format=compressed" -O ${TFHUB_CACHE_DIR}/module_lite.tar.gz
cd ${TFHUB_CACHE_DIR};
mkdir -p universal-sentence-encoder-lite-2 && tar -zxvf module_lite.tar.gz -C universal-sentence-encoder-lite-2;
cd -

Download Cornell Movie Dialog Database

Download the Cornell Movie Dialog Corpus, and extract to data/CMDC.

# Change directory to CMDC data subdirectory
mkdir -p Jann/data/CMDC
cd Jann/data/CMDC/

# Download the corpus
wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

# Unzip the corpus and move lines and convos to the main directory
unzip cornell_movie_dialogs_corpus.zip
mv cornell\ movie-dialogs\ corpus/movie_lines.txt movie_lines.txt
mv cornell\ movie-dialogs\ corpus/movie_conversations.txt movie_conversations.txt

# Change direcory to jann's main directory
cd -

As an example, we might use the first 50 lines of movie dialogue from the Cornell Movie Dialog Corpus.

You can set the number of lines from the corpus you want to use by changing the parameter export NUMLINES='50' in run_examples/run_CMDC.sh.

Tests

pytest --cov-report=xml --cov-report=html --cov=Jann

You should see all the tests passing.

(simple) Run Basic Example

cd Jann
# make sure that the run code is runnable
chmod +x run_examples/run_CMDC.sh
# run it
./run_examples/run_CMDC.sh

(advanced) Running Model Building

jann is composed of several submodules, each of which can be run in sequence as follows:

# Ensure that the virtual environment is activated
source venv/bin/activate

# Change directory to Jann
cd Jann

# Number of lines from input source to use
export NUMTREES='100'

# Number of neighbors to return
export NUMNEIGHBORS='10'

# Define the environmental variables
export INFILE="data/CMDC/all_lines_50.txt"

# Embed the lines using the encoder (Universal Sentence Encoder)
python embed_lines.py --infile=${INFILE} --verbose

# Process the embeddings and save as unique strings and numpy array
python process_embeddings.py --infile=${INFILE} --verbose

# Index the embeddings using an approximate nearest neighbor (annoy)
python index_embeddings.py --infile=${INFILE} --verbose --num_trees=${NUMTREES}

# Build a simple command line interaction for model testing
python interact_with_model.py --infile=${INFILE} --verbose --num_neighbors=${NUMNEIGHBORS}

Interaction

For interaction with the model, the only files needed are the unique strings (_unique_strings.csv) and the Annoy index (.ann) file.

With the unique strings and the index file you can build a basic interaction.

This is demonstrated in the interact_with_model.py file.

Pairs

Conversational dialogue is composed of sequences of utterances. The sequence can be seen as pairs of utterances: inputs and responses.

Nearest neighbours to a given input will find neighbours which are semantically related to the input. By storing input<>response pairs, rather than only inputs, jann can respond with a response to similar inputs. This example is shown in run_examples/run_CMDC_pairs.sh.

Run Web Server

jann is designed to run as a web service to be queried by a dialogue interface builder. For instance, jann is natively configured to be compatible with Dialogflow Webhook Service. The web service runs using the Flask micro-framework and uses the performance-oriented gunicorn application server to launch the application with 4 workers.

cd Jann

# run the pairs set up and test the interaction
./run_examples/run_CMDC_pairs.sh

# pairs set up will write files needed for web server deployment
# default data_key is all_lines_0

# start development server
python app.py

# or serve the pairs model with gunicorn and 4 workers
gunicorn --bind 0.0.0.0:8000 app:JANN -w 4

Monitoring

It is helpful to see a Flask Monitoring dashboard to monitor statistics on the bot. There is a Flask-MonitoringDashboard which is already installed as part of Jann, see Jann/app.py.

To view the dashboard, navigate to http://0.0.0.0:8000/dashboard. The default user/pass is: admin / admin.

Load / Lag Testing with Locust

Once jann is running, in a new terminal window you can test the load on the server with Locust, as defined in Jann/tests/locustfile.py:

source venv/bin/activate
cd Jann/tests
locust --host=http://0.0.0.0:8000

You can then navigate a web browser to http://0.0.0.0:8089/, and simulate N users spawning at M users per second and making requests to jann.

Testing the model by hand

curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"queryResult": {"queryText": "that sounds really depressing"}}' \
  http://0.0.0.0:8000/model_inference

Response:

{"fulfillmentText":"Oh, come on, man. Tell me you wouldn't love it!"}

Custom Datasets

You can use any dataset you want! Format your source text with a single entry on each line, as follows:

# data/custom_data/example.txt
This is the first line.
This is the second line, a response to the first line.
This is the third line.
This is the fourth line, a response to the third line.

Using other Universal Sentence Encoder embedding modules

There are a collection of Universal Sentence Encoders trained on a variety of data.

Note from TensorFlow Hub: The module performs best effort text input preprocessing, therefore it is not required to preprocess the data before applying the module.

# Standard Model (914 MB)
wget 'https://tfhub.dev/google/universal-sentence-encoder/4?tf-hub-format=compressed' -O module_standard.tar.gz
mkdir -p universal-sentence-encoder && tar -zxvf module_standard.tar.gz -C universal-sentence-encoder

Annoy parameters

There are two parameters for the Approximate Nearest Neighbour:

set n_trees as large as possible given the amount of memory you can afford,
set search_k as large as possible given the time constraints you have for the queries. This parameter is a interaction tradeoff between accuracy and speed.

Run details for GCP serving using nginx and uwsgi

You will need to configure your server with the necessary software:

sudo apt update
sudo apt -y upgrade
sudo apt install unzip python3-pip python3-dev python3-venv build-essential libssl-dev libffi-dev python3-setuptools
sudo apt-get install nginx
git clone https://github.com/korymath/jann
# and follow the installation and configuration steps above
sudo /etc/init.d/nginx start    # start nginx

Then, you can reference a more in-depth guide here. And here is a walkthrough on how to configure nginx on GCP.

You will need the uwsgi_params file, which is available in the nginx directory of the uWSGI distribution, or from the nginx GitHub repository.

uwsgi_param  QUERY_STRING       $query_string;
uwsgi_param  REQUEST_METHOD     $request_method;
uwsgi_param  CONTENT_TYPE       $content_type;
uwsgi_param  CONTENT_LENGTH     $content_length;

uwsgi_param  REQUEST_URI        $request_uri;
uwsgi_param  PATH_INFO          $document_uri;
uwsgi_param  DOCUMENT_ROOT      $document_root;
uwsgi_param  SERVER_PROTOCOL    $server_protocol;
uwsgi_param  REQUEST_SCHEME     $scheme;
uwsgi_param  HTTPS              $https if_not_empty;

uwsgi_param  REMOTE_ADDR        $remote_addr;
uwsgi_param  REMOTE_PORT        $remote_port;
uwsgi_param  SERVER_PORT        $server_port;
uwsgi_param  SERVER_NAME        $server_name;

Copy it into your project directory (e.g. /home/${USER}/jann/uwsgi_params). In a moment we will tell nginx to refer to it.

We will serve our application over HTTP on port 80, so we need to enable it:

sudo ufw allow 'Nginx HTTP'

This will allow HTTP traffic on port 80, the default HTTP port.

We can check the rule has been applied with:

sudo ufw status

# Status: active
# To                         Action      From
# --                         ------      ----
# Nginx HTTP                 ALLOW       Anywhere                  
# Nginx HTTP (v6)            ALLOW       Anywhere (v6)

Make a Systemd unit file:

[Unit]
Description=JANN as a well served Flask application.
After=network.target
[Service]
User=korymath
Group=www-data
WorkingDirectory=/home/korymath/jann/Jann
Environment="PATH=/home/korymath/jann/venv/bin"
ExecStart=/home/korymath/jann/venv/bin/uwsgi --ini wsgi.ini
[Install]
WantedBy=multi-user.target

Then, copy the following into a file on your server, named: /etc/nginx/sites-available/JANN.conf

# JANN.conf
server {
    listen      80;
    server_name 35.209.230.155;
    location / {
        include     /home/korymath/jann/uwsgi_params;
        uwsgi_pass unix:/home/korymath/jann/Jann/jann.sock;
    }
}

Then, we tell nginx how to refer to the server

# link the site configuration to nginx enabled sites 
sudo ln -s /etc/nginx/sites-available/JANN.conf /etc/nginx/sites-enabled/
# restart nginx
sudo systemctl restart nginx
# restart jann
sudo systemctl restart jann

Common Errors/Warnings and Solutions

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

Solution (for OSX 10.13):

pip install --ignore-installed --upgrade https://github.com/lakshayg/tensorflow-build/releases/download/tf1.9.0-macos-py27-py36/tensorflow-1.9.0-cp36-cp36m-macosx_10_13_x86_64.whl

FileNotFoundError

FileNotFoundError: [Errno 2] No such file or directory: 'data/CMDC/movie_lines.txt'

Solution:

Ensure that the input movie lines file is extracted to the correct path

ValueError

ValueError: Signature 'spm_path' is missing from meta graph.

Solution

Currently jann is configured to use the universal-sentence-encoder-lite module from TFHub as it is small, lightweight, and ready for rapid deployment. This module depends on the SentencePiece library and the SentencePiece model published with the module.

You will need to make some minor code adjustments to use the heaviery modules (such as universal-sentence-encoder and universal-sentence-encoder-large.

Start Contributing

The guide for contributors can be found here. It covers everything you need to know to start contributing to jann.

References

Credits

jann is made with love by Kory Mathewson.

Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY.

korymath / jann