lankastersky / zoe

Chatbot experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zoe

Chatbot experiments

Here is a retrieval-based model of a chatbot. The model is based on embedding vectors (word2vec) and some kind of heuristic. The model predicts a question from a predefined list by a user query. The idea is that this task is similar to search using similarity between a user query and a set of documents. Main idea: calculate the average vector for all words in every sentence/document and use cosine similarity between vectors (via).

Assumptions:

  • work with short sentences only
  • user query can contain any (invalid) text data

Input data:

  • chat history to train/tune the model
  • labeled data (pairs "user query - correct question") to train/tune model
  • domain-specific dictionaries - acronyms, terms etc. (optional)

Files:

  • zoe.py: entry point
  • nl_processor.py: input data parser/cleaner
  • predict_question_model.py: model to predict questions using the given sentences similarity metric
  • sentences_similarity_metric.py: calculates sentences similarity metric based on gensim word vectors

Main steps:

  • load cleaned/labeled data
  • call model.fit(); the model will store the list of questions and will find the best similarity threshold to detect a correct question by the given query
  • call model.predict() to get the predicted question (if any) by the given query

Pros:

  • recognises semantically close words/sentences
  • fast and simple
  • interpretable

Cons:

  • works bad with acronyms, domain-specific terms, proper names (persons' names, locations etc.)
    • can be fixed by adding dictionaries and hooks (easy but not the best solution)
    • or by pretraining based on a domain-specific wiki data (requires lots of data and efforts)
  • works bad with long sentences
  • can’t recognize negation, sarcasm

TODOs:

About

Chatbot experiments


Languages

Language:Python 100.0%