Give a query, Efficiently retrieve Similar question from question answer repository.
This can be solved by building text similarity algorithms. Text Similarity can be usefule in many different areas.
- Question Answer Forum : Given a collection of question, find questions that are most similar to given queries
- Image search : Given text description, retrieve all images matches with text description.
- Given a query text retrieve most similar question efficiently
- Faster response time
- Lower service cost
- better utilization of in house knowledge base.
In this technique we would rank the results based on how many words they share with query text. In keyword based search we represent text as vector which is assign to one dimensions for each word in corpus.
Vector for entire query is based on numer of times each term in vocab appears. This is also known as "Bag Of Words" representation. TF-IDF and inverted index based approach also comes under this type of search.
W1 W2 W3 w4
Q1 : How to install pip?
D1 D2 D4 D3
D2 D4 D3 D5
. . . .
. . . .
. . . .
D3 D3 D8........ D13
Each Column have entries of all the documents where Word at column header appears. D3 Document contains all the word mentioned in query so it is more similar to query.
A lexical search approach would be to rank documents based on how many words they share with the query. But Document may contain different words in different orders, yet it can be similar semantically. Other thing bag of words based representation will not hold word ordering into account, for natural query understanding and to understand context word ordering is important.
Let say we have the following questions
Q1. How to install new packages via python?
Q2. use new library via pip?
Question Q1 and Q2 are worded differently, but they semantically mean the same think, how to install new package using some software
Q3. install elasticsearch
Q4. setup Lucene and Apache Solr
Q5. Install fulltext search engine software
Similarly, Q3, Q4, and Q5 are worded differently, but they all related to same kind of software. Apache Solr, Elasticsearch and Lucene all are full text search engine.
If we had used keyword based search, these questions could not be similar.
We would like to build query representation in way that will capture linguistic content of the query text. We will call it as embedding which is dense numeric vector representation of given text.
These vector captures semantic meaning of the words, closely related vectors should be closed to gather. Synonyms words should be in near distance in vector space.
- The main problem with "Bag of words" vectors are that, they are very high dimensional vectors (d = length of vocab), and they are sparse.
- Text Embeddings vectors are dense and lower dimensional vector, which contains semantic meaning of information in text.
- The problem with "Bag of words" is that it fails to capture word ordering which is very important in understanding large context.
Let's say we had collection of questions and answers. A User ask a questions, and we want to retrieve most similar questions from corpus.
For faster search we need data structure for fast retrieval, indexing is solution for this.
- We will first train Query Embedding models to generate dense lower dimensional vector for queries, or we can use pretrained model already trained on large dataset like wikipedia or common crawl.
- We will then create an index for query embeddings.
- at run time user query is passed through query embedding model to get query vector, then we will compare query vector to all the questions vector in dataset using cosine similarity to get top k results.
StackSample is Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website I have used 200000 questions from this dataset ( nearly 20% of this dataset.)
I created a script to download dataset, and created embedding model in tensorflow. I used Google's Universal sentence encoder We only used pretrained model, we haven't fine tune this model.
We created following elasticsearch index:
"mappings": {
"properties": {
"title": {
"type": "text"
},
"title_vector": {
"type": "dense_vector",
"dims": 512
}
}
}`
here "dense_vector" dimension is 512, so we want to make sure that our embedding model will generate 512 dimensional vector.
To index questions, we will pass question through model and generate 512 dimensional vector,and then it is added to "title_vector" field.
{
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
We used cosineSimilarity for similarity search.
- download and set up docker in system
- Set up Elasticsearch
# Download Elasticsearch v 7.7.0 image from docker
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.7.0
docker image ls
# Docker with 6Gb of Ram, running on Port 9200 --name is <name_of_elastic_instance>
docker run -m 6G -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name my_elastic_stack docker.elastic.co/elasticsearch/elasticsearch:7.7.0
docker ps
docker stats
# to start docker app run follwing
docker exec -it my_elastic_stack bash
- Set up directories
pwd
cd /usr/share/elasticsearch/
mkdir searchqa
cd searchqa
- Install packages
# to know linux release version
# If Fedora or red hat used yml
# if ubuntu use sudo apt-get
cat /etc/*-release
yum -y update
yum install -y python3
yum install -y vim
yum -y install wget
Yum install –y tar
yum clean all
pip3.6 install --upgrade pip
pip3.6 --version
pip3.6 install elasticsearch
pip3.6 install pandas
# https://tfhub.dev/google/universal-sentence-encoder/4
pip3.6 install --upgrade --no-cache-dir tensorflow
pip3.6 install --upgrade tensorflow-hub
- Download Universal sentence encoder model from Tfhub and copy to Docker location
Download dataset from : https://www.kaggle.com/datasets/stackoverflow/stacksample
# Copy datazip file from local to docker
docker cp stack_sampels.zip my_elastic_stack:/usr/share/elasticsearch/searchqa/
mkdir data
# move all csv file to data folder after unzip
mv *.csv ./data/
rm-rf stack_sampels.zip
# test if elasticsearch is running
https:://localhost:9200/
- Set up Tf Model
Download model from : https://tfhub.dev/google/universal-sentence-encoder/4
# Copy zip file from local to docker
docker cp universal-sentence-encoder_4.tar.gz my_elastic_stack:/usr/share/elasticsearch/searchqa/data/
yum install –y tar
tar -xvzf universal-sentence-encoder_4.tar.gz -C ./USE4/
- Create ElasticSearch Index
python index.py
# to check how many indexes is created
curl -X GET "localhost:9200/questions-index/_stats?pretty"
# Search by Id
http://localhost:9200/questions-index/_doc/80
- Create A Flask API Search
pip3.6 install flask
# set updated locale, if this gives error Google It.
LC_ALL=en_US
export LC_ALL
export FLASK_APP=search_controller.py
python3.6 -m flask run
- Test Python API
time curl http://127.0.0.1:5000/search/how+to+install+python
Query : time curl http://127.0.0.1:5000/search/DELETE+FILE+FROM+LINUX
Results : Type Score Questions
KeyWord [Lexical] Search : 13.654223 Delete numbers from a file
KeyWord [Lexical] Search : 12.897192 Ignore file from delete during WebDeploy
KeyWord [Lexical] Search : 12.897192 PHP, delete path from TXT file
KeyWord [Lexical] Search : 12.219694 Paperclip - delete a file from Amazon S3?
KeyWord [Lexical] Search : 11.75382 Linux File Logs
KeyWord [Lexical] Search : 11.609821 change sqlite file size after "DELETE FROM table"
KeyWord [Lexical] Search : 11.609821 Delete a line from a file in java
KeyWord [Lexical] Search : 11.609821 Delete a character from a file in C
KeyWord [Lexical] Search : 11.057932 How to delete lines from file after reading it?
KeyWord [Lexical] Search : 11.057932 Delete Duplicate records from large csv file C# .Net
====================================================================================
Semantic Search : 1.6982048 remove certain tag in files under linux?
Semantic Search : 1.6897635 Removing a file in a Restricted Folder in Linux
Semantic Search : 1.6710172 Removing almost all directories and files in linux
Semantic Search : 1.6593639 Delete contents of a directory recursively on Windows
Semantic Search : 1.6564437 Maillog file in linux
Semantic Search : 1.6534 Bash: Delete until a specific file
Semantic Search : 1.6532037 How to only get file name with linux `find`?
Semantic Search : 1.6449786 linux releative path to fullpath file name
Semantic Search : 1.6393253 Delete read only files with Ant on windows
Semantic Search : 1.6357477 Delete unused files
Query : time curl http://127.0.0.1:5000/search/Difference+betwen+lucene+elasticsearch+apache+solr
Results: Type Score Question
KeyWord [Lexical] Search : 23.839565 ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
KeyWord [Lexical] Search : 17.337696 Solr lucene and "similar" keywords
KeyWord [Lexical] Search : 16.623175 searching using Apache Solr
KeyWord [Lexical] Search : 16.376442 Denormalizing relational data for lucene/solr
KeyWord [Lexical] Search : 16.376442 Choosing a solr/lucene commit strategy
KeyWord [Lexical] Search : 16.376442 Version incompatibility between Lucene and Solr
KeyWord [Lexical] Search : 14.780258 What is the difference betwen including modules and embedding modules?
KeyWord [Lexical] Search : 14.780258 What is the difference betwen boost::multi_array views and subarrays
KeyWord [Lexical] Search : 14.74178 Solr/Lucene behaves weird with some word searches
KeyWord [Lexical] Search : 14.74178 how do I normalise a solr/lucene score?
=====================================================================================================================
Semantic Search : 1.8460855 ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
Semantic Search : 1.8184654 Version incompatibility between Lucene and Solr
Semantic Search : 1.8119385 searching using Apache Solr
Semantic Search : 1.8102566 solr query analyzers
Semantic Search : 1.7987325 Neo4j indexing with Lucene and query with SOLR
Semantic Search : 1.7930639 What is the difference between EdgeNGramTokenizerFactory EdgeNGramFilterFactory in SOLR?
Semantic Search : 1.7795568 SOLR getting started, little help
Semantic Search : 1.7773565 Solr lucene and "similar" keywords
Semantic Search : 1.77633 Solr analyzer default type
Semantic Search : 1.7696992 How index cifs server with solr
- Case Management in Customer Service Portal
- Discussion Forum
- Given Query Find answer from Video or Audio
High Level Design Of Semantic Search On Stack Overflow question answers dataset.