GoldenRetriever - Information retrieval using fine-tuned semantic similarity

GoldenRetriever is part of the HotDoc NLP project, which provides a series of open-source AI tools for natural language processing. HotDoc NLP is part of the AI Makerspace program. Please visit the demo page where you will be able to query a sample knowledge base.

Golden Retriever is a framework for a information retrieval engine (QnA, knowledge base query, etc) that works in 4 steps:

Step 1: The knowledge base has to be separated into "documents" or clauses. Each clause is an indexed unit of information e.g. a clause, a sentence, or a paragraph.
Step 2: The clauses (and query) should be encoded with the same encoder (Infersent, Google USE¹, or Google USE-QA²).
Step 3: A similarity score is calculated (cosine dist, arccos dist, dot product, nearest neighbors).
Step 4: Clauses with the highest score (or nearest neighbors) are returned as the retrieved document.

model_finetuning.py currently optimizes the framework for retrieving clauses from a contract or a set of terms and conditions, given a natural language query.

There is a potential for fine tuning following Yang et. al's (2018) paper on learning textual similarity from conversations.

A fully connected layer is inserted after the clauses are encoded to maximize the dot product between the transformed clauses and the encoded query.

In the transfer learning use-case, the Google-USEQA model is further fine-tuned using a triplet-cosine-loss function. This helps to push correct question-knowledge pairs closer together while maintaining a marginal angle between question-wrong-knowledge pairs. This method can be used to overfit towards any fixed FAQ dataset without losing the semantic similarity capabilities of the sentence encoder.

Deployment

This model is implemented as a flask app.

Run python app.py to launch a web interface from which you can query some pre-set documents.

To run the flask API using docker,

Clone this repository.
Build the container image: docker build -f api.Dockerfile -t goldenretriever .
Run the container: docker run -p 5000:5000 -it goldenretriever
Access the endpoints at http://localhost:5000.

Alternatively, to run the streamlit app using docker,

Clone this repository.
Build the container image: docker build -f streamlit.Dockerfile -t goldenretriever .
Run the container: docker run -p 5000:5000 goldenretriever
Access the web interface on your browser by navigating to http://localhost:5000.

Testing

For comparison, we apply 3 sentence encoding models to the data set provided at InsuranceQA corpus. Each test case consists of a question, and 100 possible answers, of which the correct answer is one or more of the 100 possible answers.

Model evaluation metric is accuracy@k, where k is the number of clauses our model returns for a given query. A top score of 1 indicates that the returned k clauses contains a correct answer to the query, and a score of 0 indicates that none of the k clauses returned contain a correct answer.

Model	acc@1	acc@2	acc@3	acc@4	acc@5
InferSent	0.083	0.134	0.1814	0.226	0.268
Google USE	0.251	0.346	0.427	0.481	0.534
Google USE-QA	0.387	0.519	0.590	0.648	0.698
TFIDF baseline	0.2457	0.3492	0.4127	0.4611	0.4989

Footnotes

¹ Google Universal Sentence Encoder
² Google Universal Sentence Encoder for Question-Answer Retrieval

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG-RP-2019-050). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

greysun / info_retrieve