zoe

Chatbot experiments

Here is a retrieval-based model of a chatbot. The model is based on embedding vectors (word2vec) and some kind of heuristic. The model predicts a question from a predefined list by a user query. The idea is that this task is similar to search using similarity between a user query and a set of documents. Main idea: calculate the average vector for all words in every sentence/document and use cosine similarity between vectors (via).

Assumptions:

work with short sentences only
user query can contain any (invalid) text data

Input data:

chat history to train/tune the model
labeled data (pairs "user query - correct question") to train/tune model
domain-specific dictionaries - acronyms, terms etc. (optional)

Files:

zoe.py: entry point
nl_processor.py: input data parser/cleaner
predict_question_model.py: model to predict questions using the given sentences similarity metric
sentences_similarity_metric.py: calculates sentences similarity metric based on gensim word vectors

Main steps:

load cleaned/labeled data
call model.fit(); the model will store the list of questions and will find the best similarity threshold to detect a correct question by the given query
call model.predict() to get the predicted question (if any) by the given query

Pros:

recognises semantically close words/sentences
fast and simple
interpretable

Cons:

works bad with acronyms, domain-specific terms, proper names (persons' names, locations etc.)
- can be fixed by adding dictionaries and hooks (easy but not the best solution)
- or by pretraining based on a domain-specific wiki data (requires lots of data and efforts)
works bad with long sentences
can’t recognize negation, sarcasm

TODOs:

support other languages except English
compare by speed/accuracy with other word2vec-based methods:
- gensim.models.KeyedVectors.wmdistance()
- gensim.models.KeyedVectors.n_similarity()
- using smooth inverse frequency
- weighted average w2v vectors (e.g. tf-idf) etc

lankastersky / zoe

zoe

About

Languages