Machine Learning Design : Searching in Question Answer Dataset

Problem:

Give a query, Efficiently retrieve Similar question from question answer repository.

This can be solved by building text similarity algorithms. Text Similarity can be usefule in many different areas.

Question Answer Forum : Given a collection of question, find questions that are most similar to given queries
Image search : Given text description, retrieve all images matches with text description.

Objective:

Given a query text retrieve most similar question efficiently
Faster response time
Lower service cost
better utilization of in house knowledge base.

First Cut Solution Approach : Keyword based - Lexical Search

In this technique we would rank the results based on how many words they share with query text. In keyword based search we represent text as vector which is assign to one dimensions for each word in corpus.

Vector for entire query is based on numer of times each term in vocab appears. This is also known as "Bag Of Words" representation. TF-IDF and inverted index based approach also comes under this type of search.

                    W1           W2             W3          w4
    Q1 :           How           to            install      pip?
                    D1           D2             D4           D3
                    D2           D4             D3           D5
                    .             .              .             .
                    .             .              .             .
                    .             .              .             .
                    D3           D3             D8........   D13

Each Column have entries of all the documents where Word at column header appears. D3 Document contains all the word mentioned in query so it is more similar to query.

Problem With Keyword based Search:

A lexical search approach would be to rank documents based on how many words they share with the query. But Document may contain different words in different orders, yet it can be similar semantically. Other thing bag of words based representation will not hold word ordering into account, for natural query understanding and to understand context word ordering is important.

Let say we have the following questions

Q1. How to install new packages via python?
Q2. use new library via pip?

Question Q1 and Q2 are worded differently, but they semantically mean the same think, how to install new package using some software

Q3. install elasticsearch
Q4. setup Lucene and Apache Solr
Q5. Install fulltext search engine software

Similarly, Q3, Q4, and Q5 are worded differently, but they all related to same kind of software. Apache Solr, Elasticsearch and Lucene all are full text search engine.

If we had used keyword based search, these questions could not be similar.

How to Improve, What next?

Semantic Search :

We would like to build query representation in way that will capture linguistic content of the query text. We will call it as embedding which is dense numeric vector representation of given text.

These vector captures semantic meaning of the words, closely related vectors should be closed to gather. Synonyms words should be in near distance in vector space.

What's the benefits of Embeddings

The main problem with "Bag of words" vectors are that, they are very high dimensional vectors (d = length of vocab), and they are sparse.
Text Embeddings vectors are dense and lower dimensional vector, which contains semantic meaning of information in text.
The problem with "Bag of words" is that it fails to capture word ordering which is very important in understanding large context.

How to use Embeddings for Similarity Search

Let's say we had collection of questions and answers. A User ask a questions, and we want to retrieve most similar questions from corpus.

For faster search we need data structure for fast retrieval, indexing is solution for this.

We will first train Query Embedding models to generate dense lower dimensional vector for queries, or we can use pretrained model already trained on large dataset like wikipedia or common crawl.
We will then create an index for query embeddings.
at run time user query is passed through query embedding model to get query vector, then we will compare query vector to all the questions vector in dataset using cosine similarity to get top k results.

Implementation on StackOverflow Questions answers dataset

StackSample is Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website I have used 200000 questions from this dataset ( nearly 20% of this dataset.)

I created a script to download dataset, and created embedding model in tensorflow. I used Google's Universal sentence encoder We only used pretrained model, we haven't fine tune this model.

We created following elasticsearch index:

    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "title_vector": {
                "type": "dense_vector",
                "dims": 512
            }
        }
    }`

here "dense_vector" dimension is 512, so we want to make sure that our embedding model will generate 512 dimensional vector.

To index questions, we will pass question through model and generate 512 dimensional vector,and then it is added to "title_vector" field.

{
  "script_score": {
    "query": {"match_all": {}},
    "script": {
      "source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
      "params": {"query_vector": query_vector}
    }
  }
}

We used cosineSimilarity for similarity search.

How to set up project

download and set up docker in system
Set up Elasticsearch

   # Download Elasticsearch v 7.7.0 image from docker
   docker pull docker.elastic.co/elasticsearch/elasticsearch:7.7.0
   docker image ls
   # Docker with 6Gb of Ram, running on Port 9200 --name is <name_of_elastic_instance>
   docker run -m 6G -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name my_elastic_stack docker.elastic.co/elasticsearch/elasticsearch:7.7.0
   docker ps
   docker stats
   # to start docker app run follwing
   docker exec -it my_elastic_stack bash

Set up directories

        pwd
        cd /usr/share/elasticsearch/
        mkdir searchqa
        cd searchqa

Install packages

    # to know linux release version
    # If Fedora or red hat used yml 
    # if ubuntu use sudo apt-get
    
    cat /etc/*-release
    
    yum -y update
    yum install -y python3
    yum install -y vim
    yum -y install wget
    Yum install –y tar
    yum clean all
    
    pip3.6 install --upgrade pip
    pip3.6 --version
    pip3.6 install elasticsearch
    pip3.6 install pandas
    # https://tfhub.dev/google/universal-sentence-encoder/4

    pip3.6 install --upgrade --no-cache-dir tensorflow
    pip3.6 install --upgrade tensorflow-hub

Download Universal sentence encoder model from Tfhub and copy to Docker location

    Download dataset from : https://www.kaggle.com/datasets/stackoverflow/stacksample
    # Copy datazip file from local to docker
    docker cp stack_sampels.zip my_elastic_stack:/usr/share/elasticsearch/searchqa/
    
    mkdir data
    # move all csv file to data folder after unzip
    mv *.csv ./data/
    rm-rf stack_sampels.zip
    
    # test if elasticsearch is running 
    https:://localhost:9200/

Set up Tf Model

    Download model from : https://tfhub.dev/google/universal-sentence-encoder/4
    # Copy zip file from local to docker
    docker cp universal-sentence-encoder_4.tar.gz my_elastic_stack:/usr/share/elasticsearch/searchqa/data/
    yum install –y tar
    tar -xvzf universal-sentence-encoder_4.tar.gz -C ./USE4/

Create ElasticSearch Index

    python index.py

    # to check how many indexes is created
    curl -X GET "localhost:9200/questions-index/_stats?pretty"
    
    # Search by Id
    http://localhost:9200/questions-index/_doc/80

Create A Flask API Search

    pip3.6 install flask
    # set updated locale, if this gives error Google It.
    LC_ALL=en_US
    export LC_ALL
    export FLASK_APP=search_controller.py
    python3.6 -m flask run

Test Python API

  time curl http://127.0.0.1:5000/search/how+to+install+python

Project Diagram

Results

Query : time curl http://127.0.0.1:5000/search/DELETE+FILE+FROM+LINUX
Results :  Type                 Score        Questions
  
 KeyWord [Lexical] Search : 13.654223	Delete numbers from a file
 KeyWord [Lexical] Search : 12.897192	Ignore file from delete during WebDeploy
 KeyWord [Lexical] Search : 12.897192	PHP, delete path from TXT file
 KeyWord [Lexical] Search : 12.219694	Paperclip - delete a file from Amazon S3?
 KeyWord [Lexical] Search : 11.75382	Linux File Logs
 KeyWord [Lexical] Search : 11.609821	change sqlite file size after "DELETE FROM table"
 KeyWord [Lexical] Search : 11.609821	Delete a line from a file in java
 KeyWord [Lexical] Search : 11.609821	Delete a character from a file in C
 KeyWord [Lexical] Search : 11.057932	How to delete lines from file after reading it?
 KeyWord [Lexical] Search : 11.057932	Delete Duplicate records from large csv file C# .Net
====================================================================================
 Semantic Search : 1.6982048	remove certain tag in files under linux?
 Semantic Search : 1.6897635	Removing a file in a Restricted Folder in Linux
 Semantic Search : 1.6710172	Removing almost all directories and files in linux
 Semantic Search : 1.6593639	Delete contents of a directory recursively on Windows
 Semantic Search : 1.6564437	Maillog file in linux
 Semantic Search : 1.6534	Bash: Delete until a specific file
 Semantic Search : 1.6532037	How to only get file name with linux `find`?
 Semantic Search : 1.6449786	linux releative path to fullpath file name
 Semantic Search : 1.6393253	Delete read only files with Ant on windows
 Semantic Search : 1.6357477	Delete unused files


Query : time curl http://127.0.0.1:5000/search/Difference+betwen+lucene+elasticsearch+apache+solr
Results:     Type            Score        Question
 KeyWord [Lexical] Search : 23.839565	ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
 KeyWord [Lexical] Search : 17.337696	Solr lucene and "similar" keywords
 KeyWord [Lexical] Search : 16.623175	searching using Apache Solr
 KeyWord [Lexical] Search : 16.376442	Denormalizing relational data for lucene/solr
 KeyWord [Lexical] Search : 16.376442	Choosing a solr/lucene commit strategy
 KeyWord [Lexical] Search : 16.376442	Version incompatibility between Lucene and Solr
 KeyWord [Lexical] Search : 14.780258	What is the difference betwen including modules and embedding modules?
 KeyWord [Lexical] Search : 14.780258	What is the difference betwen boost::multi_array views and subarrays
 KeyWord [Lexical] Search : 14.74178	Solr/Lucene behaves weird with some word searches
 KeyWord [Lexical] Search : 14.74178	how do I normalise a solr/lucene score?
=====================================================================================================================
 Semantic Search : 1.8460855	ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
 Semantic Search : 1.8184654	Version incompatibility between Lucene and Solr
 Semantic Search : 1.8119385	searching using Apache Solr
 Semantic Search : 1.8102566	solr query analyzers
 Semantic Search : 1.7987325	Neo4j indexing with Lucene and query with SOLR
 Semantic Search : 1.7930639	What is the difference between EdgeNGramTokenizerFactory EdgeNGramFilterFactory in SOLR?
 Semantic Search : 1.7795568	SOLR getting started, little help
 Semantic Search : 1.7773565	Solr lucene and "similar" keywords
 Semantic Search : 1.77633	Solr analyzer default type
 Semantic Search : 1.7696992	How index cifs server with solr

Other Extended Applications:

Case Management in Customer Service Portal
Discussion Forum
Given Query Find answer from Video or Audio

Semantic-Search-On-Question-Answers

High Level Design Of Semantic Search On Stack Overflow question answers dataset.

cr21 / Semantic-Search-On-Question-Answers