Karthik-Bhaskar/Context-Based-Question-Answering

Context Based Question Answering

Context-Based Question Answering is an easy-to-use Extractive QA search engine, which extracts answers to the question based on the provided context.

Introduction
Motivation
Technologies
Languages
Installation
Demo
Features
Status
License
Credits
Citations
Community Contribution
Contact

Introduction

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

Source: Wikipedia

Extractive QA is a popular task for natural language processing (NLP) research, where models must extract a short snippet from a document in order to answer a natural language question.

Source: Facebook AI

Context-Based Question Answering (CBQA) is an inference web-based Extractive QA search engine, mainly dependent on Haystack and Transformers library.

The CBQA application allows the user to add context and perform Question Answering(QA) in that context.

The main components in this application use Haystack's core components,

FileConverter: Extracts pure text from files (pdf, docx, pptx, html and many more).

PreProcessor: Cleans and splits texts into smaller chunks.

DocumentStore: Database storing the documents, metadata and vectors for our search. We recommend Elasticsearch or FAISS, but have also more light-weight options for fast prototyping (SQL or In-Memory).

Retriever: Fast algorithms that identify candidate documents for a given query from a large collection of documents. Retrievers narrow down the search space significantly and are therefore key for scalable QA. Haystack supports sparse methods (TF-IDF, BM25, custom Elasticsearch queries) and state of the art dense methods (e.g. sentence-transformers and Dense Passage Retrieval)

Reader: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via FARM or Transformers on SQuAD like tasks. You can just load a pretrained model from Hugging Face's model hub or fine-tune it on your own domain data.

Source: Haystack's Key Components docs

In CBQA the allowed formats for adding the context are,

Textual Context (Using the TextBox field)
File Uploads (.pdf, .txt, and .docx)

These contexts are uploaded to a temporary directory for each user for pre-processing and deletes after uploading them to Elasticsearch, which is the only DocumentStore type used in this system.

Each user has a separate Elasticsearch index to store the context documents. Using the PreProcessor and the FileConverter modules from Haystack, the pre-processing and the extraction of text from context files is done.

Elasticsearcher Retriever is used in CBQA to retrieve relevant documents based on the search query.

Transformers-based Readers are used to extracting answers from the retrieved documents. The Readers used in CBQA are the pre-trained Transformers models hosted on the Hugging Face's model hub .

Currently, CBQA provides the interface to perform QA in English and French, using four Transformers based models,

BERT
RoBERTa
DistilBERT
CamemBERT

The interface provides an option to choose the inference device between CPU and GPU.

The output is in a tabular form containing the following headers,

Answers (Extracted answers based on the question and context)
Context (Specifies the context window, related to the answer)
Document Title (Specifies the title of context file, related to the answer)

Motivation

The motivation to implement the CBQA project is to make an easy-to-use interface to perform Question Answering (QA) in multiple languages, with the option for using different pre-trained models. Also, to enable organizations to use this project locally with minimal modifications or none.

Technologies

Python (version 3.7)
Haystack (version 0.7.0)
Transformers (version 4.3.1)
Elasticsearch (version 7.10)
Flask (version 1.1.2)
Gunicorn (version 20.0.4)
Bootstrap (version 4.5.3)
jQuery (version 3.5.1)
HTML, CSS, and JavaScript
Docker

Languages

en: English

fr: French

Installation

Before starting the installation, clone this repository using the following commands in the terminal.

$ git clone https://github.com/Karthik-Bhaskar/Context-Based-Question-Answering.git

$ cd Context-Based-Question-Answering/

You can get started using one of the two options,

Installation using pip
Running as a docker container

Installation using pip

The main dependencies required to run the CBQA application is in the requirements.txt

To install the dependencies,

Create a Python virtual environment with Python version 3.7 and activate it
Install dependency libraries using the following command in the terminal.

$ pip install -r requirements.txt

Before executing the CBQA application, please start the Elasticsearch server. Elasticsearch is the DocumentStore type used in this application. To download and install the Elasticsearch, please check here.

In case you are using the docker environment, run Elasticsearch on docker using the following commands in the terminal. If you want to install the docker engine on your machine, please check here.

$ docker pull docker.elastic.co/elasticsearch/elasticsearch:7.10.0

$ docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0

Make sure the Elasticsearch server is running using the following command in the new terminal.

$ curl http://localhost:9200

You should get a response like the one below.

{
"name" : "facddac422e8",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "a4A3-yBhQdKBlpSDkpRelg",
"version" : {
"number" : "7.10.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "51e9d6f22758d0374a0f3f5c6e8f3a7997850f96",
"build_date" : "2020-11-09T21:30:33.964949Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

After installing the dependency libraries and starting the Elasticsearch server, you are good to go.

To run the application using a WSGI server like Gunicorn. Use the following command in the new terminal.
```
$ gunicorn -w 1 --threads 4 app:app
```
This runs the application in Gunicorn server on http://localhost:8000 with a single worker and 4 threads.
To run the application in the flask development server (Not recommended using this in production). Use the following command in the new terminal.
```
$ python app.py
```
Now the application will be running in the flask development server on http://localhost/5000.

In the application execution cases above, you should see the below statement (date and time will be different) in the terminal after the application has started.

User tracker thread started @  2021-03-04 18:25:20.803277

The above statement means that the thread handling the user connection has started, and the application is ready to accept users.

Note: When performing QA using pre-trained models, at first use, the selected model gets downloaded from the Hugging Face's model hub this may take a while depending on your internet speed. If you are re-starting the application while you are testing, make sure to remove the auto-created temporary user directories in the project and the user index on Elasticsearch (Will be fixed soon)

Running as a docker container

To install the docker engine on your machine, please check here.

This project includes the docker-compose.yml file describing the services to get started quickly. The specified services in the file are Elasticsearch, Kibana, and the CBQA application.

Note: Make sure to increase the resources (Memory) in your docker engine in case you are facing 137 exit code (out of memory).

The docker image for the CBQA application has already been built and hosted on the docker hub.

To start the complete package that includes Elasticsearch, Kibana, and CBQA application as a docker container.

Use the following command in the terminal from the root of this project directory to start the container.

$ docker-compose up

Please wait until all the services have started after pulling the images.

Make sure the Elasticsearch server is running using the following command in the new terminal.

$ curl http://localhost:9200

You should get a response like the one below.

{
"name" : "facddac422e8",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "a4A3-yBhQdKBlpSDkpRelg",
"version" : {
"number" : "7.10.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "51e9d6f22758d0374a0f3f5c6e8f3a7997850f96",
"build_date" : "2020-11-09T21:30:33.964949Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

The CBQA application service in the docker container runs on http://localhost:5000, the container name of the application will be web-app.

Elasticsearch service will be running on http://localhost:9200, and Kibana service will be running on http://localhost:5601.

To stop the container, use the following command in the new terminal.

$ docker-compose down

Demo

Full demo video on YouTube.

Features

Ready:

Easy to use UI to perform QA
Pre-trained models selection
QA Language support (English and French)
Simultaneous user handling support (While using WSGI server)

Future development:

Ability to plugin large scale document corpus for the QA engine
Generative QA support
Additional QA Language support
Improvising UI

Status

License

Credits

Haystack (deepset)
Transformers (Hugging Face)
Pre-trained Transformers Models on Hugging Face's model hub
- roberta-base-squad2 (deepset)
- bert-large-uncased-whole-word-masking-squad2(deepset)
- distilbert-base-cased-distilled-squad
- camembert-base-fquad (illuin)
Elasticsearch and Kibana
Flask
Gunicorn
Bootstrap
DropzoneJS
Docker
Sanjay R Kamath, PhD (Research Scientist, NLP)
Open-Source Community ❤️

Citations

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-1907-11692,
  author    = {Yinhan Liu and
               Myle Ott and
               Naman Goyal and
               Jingfei Du and
               Mandar Joshi and
               Danqi Chen and
               Omer Levy and
               Mike Lewis and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
  journal   = {CoRR},
  volume    = {abs/1907.11692},
  year      = {2019},
  url       = {http://arxiv.org/abs/1907.11692},
  archivePrefix = {arXiv},
  eprint    = {1907.11692},
  timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-1910-01108,
  author    = {Victor Sanh and
               Lysandre Debut and
               Julien Chaumond and
               Thomas Wolf},
  title     = {DistilBERT, a distilled version of {BERT:} smaller, faster, cheaper
               and lighter},
  journal   = {CoRR},
  volume    = {abs/1910.01108},
  year      = {2019},
  url       = {http://arxiv.org/abs/1910.01108},
  archivePrefix = {arXiv},
  eprint    = {1910.01108},
  timestamp = {Tue, 02 Jun 2020 12:48:59 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1910-01108.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{DBLP:journals/corr/abs-1911-03894,
  author    = {Louis Martin and
               Benjamin M{\"{u}}ller and
               Pedro Javier Ortiz Su{\'{a}}rez and
               Yoann Dupont and
               Laurent Romary and
               {\'{E}}ric Villemonte de la Clergerie and
               Djam{\'{e}} Seddah and
               Beno{\^{\i}}t Sagot},
  title     = {CamemBERT: a Tasty French Language Model},
  journal   = {CoRR},
  volume    = {abs/1911.03894},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.03894},
  archivePrefix = {arXiv},
  eprint    = {1911.03894},
  timestamp = {Sun, 01 Dec 2019 20:31:34 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-03894.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{dHoffschmidt2020FQuADFQ,
  title={FQuAD: French Question Answering Dataset},
  author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06071}
}

Community Contribution

You are most welcome to contribute to this project. Be it a small typo correction or a feature enhancement. Currently, there is no structure to follow for contribution, but you can open an Issue under "Community contribution" or contact me to discuss it before getting started 🙂

Contact

Karthik Bhaskar - Feel free to contact me.

Karthik-Bhaskar / Context-Based-Question-Answering

Context Based Question Answering

Table of contents

Introduction

Motivation

Technologies

Languages

Installation

Installation using pip

Running as a docker container

Demo

Features

Status

License

Credits

Citations

Community Contribution

Contact

About

Languages