Chat with your docs!

A RAG (Retrieval Augmented Generation) setup for further exploration of chatting to company documents

How to use this repo

! This repo is tested on a Windows platform

Preparation

Clone this repo to a folder of your choice
Create a subfolder vector_stores in the root folder of the cloned repo
Create a file .env and enter your OpenAI API key in the first line of this file :
OPENAI_API_KEY="sk-....."
Save and close the .env file

In case you don't have an OpenAI API key yet, you can obtain one here: https://platform.openai.com/account/api-keys
Click on + Create new secret key
Enter an identifier name (optional) and click on Create secret key

Conda virtual environment setup

Open an Anaconda prompt or other command prompt
Go to the root folder of the project and create a Python environment with conda using commandline command
conda env create -f appl-docchat.yml
NB: The name of the environment is appl-docchat by default. It can be changed to a name of your choice in the first line of the yml file
Activate this environment using commandline command
conda activate appl-docchat

Pip virtual environment setup

Open an Anaconda prompt or other command prompt
Go to the root folder of the project and create a Python environment with pip using commandline command
python -m venv venv
This will create a basic virtual environment folder named venv in the root of your project folder NB: The chosen name of the environment is here appl-docchat. It can be changed to a name of your choice
Activate this environment using commandline command
venv\Scripts\activate
All required packages can now be installed with command line command
pip install -r requirements.txt

Ingesting documents

The file ingest.py can be used to vectorize all documents in a chosen folder and store the vectors and texts in a vector database for later use.
Execution is done in the activated virtual environment using commandline command:
python ingest.py

Querying documents

The file query.py can be used to query any folder with documents, provided that the associated vector database exists.
Execution is done in the activated virtual environment using commandline command:
python query.py

Ingesting and querying documents through a user interface

The functionalities described above can also be used through a User Interface.
The UI can be started by using commandline command:
streamlit run streamlit_app.py
When this command is used, a browser session will open automatically

Evaluation of Question Answer results

The file evaluate.py can be used to evaluate the generated answers for a list of questions, provided that the file eval.json exists, containing not only the list of questions but also the related list of desired answers (ground truth).
Evaluation is done in the activated virtual environment using commandline command:
python evaluate.py

Monitoring the evaluation results through a user interface

All evaluation results can be viewed by using a dedicated User Interface.
This evaluation UI can be started by using commandline command:
streamlit run streamlit_evaluate.py
When this command is used, a browser session will open automatically

User Stories for improvements

User stories are divided into 2 groups:

RESEARCH: RESEARCH user stories are not meant to change any code but require research and prepare for an actual BUILD task.
BUILD: BUILD user stories change the code. They add functionality to the application or are performance related

Furthermore, every user story below has an indication whether it extends the functionality of the application (FUNC), or is related to optimize the results (EVAL).
User stories are written from the perspective of either the user of the application, or the developer of the application.

For everyone: add your foldername and the list of questions and ground truth answers to eval.json.
Ingestion (1): As a user I want to synchronize the vector database with the document folder I am using. If the document folder has changed (extra file(s) or deleted file(s)), either add extra documents to the vector database or delete documents from the vector database. FUNC, BUILD
Ingestion (2): As a user I want to query not only PDF’s, but also other file types with text, like Word documents, plain text files, and html pages. FUNC, BUILD
Ingestion (3): As a developer I want to determine the optimal settings for chunking. Current settings are chunksize = 1000 and chunk overlap = 200 (see settings.py). Can we do some tests with evaluation documents and find an optimal chunk size and overlap? EVAL, BUILD
Ingestion (3): As a developer I want to use an optimal set of chunks. Can we implement content-aware text chunking, keeping related content together in one chunk (up to a maximum chunk size)? EVAL, BUILD. For inspiration: https://github.com/nlmatics/llmsherpa#layoutpdfreader
Ingestion (4) & Retrieval (7): As a developer I want to generate the best answers to the user questions. The application currently uses OpenAI's text-embedding-ada-002 as embedding model. It is not the best one according to huggingface MTEB embedding leaderboard. See https://huggingface.co/spaces/mteb/leaderboard. Implement an alternative embedding model and evaluate any change in performance EVAL, BUILD
Retrieval (9): As a user I don’t want the chatbot to hallucinate. Add a lower bound for the similarity score to filter out text chunks. If none of the text chunks reaches the lower bound value, answer “I don’t know” (in the language of the user)? EVAL, BUILD
Retrieval (9): As a user I want to know which returned chunks are the preferred ones. Add the similarity score of each chunk to the sources in the response and rank each chunk according to the similarity score FUNC, BUILD
Retrieval (11): As a developer I want to evaluate the impact of switching from LLM gpt 3.5 to gpt 4 EVAL, BUILD

References

This repo is mainly inspired by:

VeenDuco / appl-docchat