langchain-cohere-qdrant-doc-retrieval
This Flask backend API takes a document in multiple formats (.txt, .docx, .pptx, .jpg, .png, .eml, .html, and .pdf) and allows you to perform a semantic search in 100+ languages supported by Cohere Multilingual API. Qdrant vector database is used to save embeddings.
Setup
The following steps will guide you on how to run the application on macOS/Linux.
Prerequisites
- Python 3
- Git
- virtualenv
- Homebrew
Installation
- Clone the repository
git clone https://github.com/menloparklab/langchain-cohere-qdrant-doc-retrieval docQA
- Change into the directory
cd docQA
- Create and activate a virtual environment
python3 -m venv env
source env/bin/activate
- Install the required packages
pip install -r requirements.txt
Unstructured uses detectron which is installed as below:
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
- Install Homebrew
Follow the installation guide on Homebrew website.
- Install the following brew packages
brew install libmagic poppler tesseract libxml2 libxslt
- Create a
.env
file and set the following environment variables:
cohere_api_key="insert here"
openai_api_key="insert here"
qdrant_url="insert here"
qdrant_api_key="insert here"
Replace the values with your own API keys and Qdrant URL.
Qdrant url and api keys
Please signup for a free cloud-based account of Qdrant and create a new cluster. You will then be able to get the qdrant_url and qdrant_api_key used in the section above.
- Run the application using the following command:
gunicorn app:app
- Access the API endpoints
The API endpoints will be live at the following routes:
/embed
/retrieve
Conclusion
You have successfully installed and ran the DocQA system on your local machine. Feel free to explore the code and make changes as per your requirements.
Connecting to a frontend
The deployed api endpoints, /embed
and /retrieve
can now be called from any frontend application. For bubble users, you can watch this video for detailed instructions.
Include headers for the API: "Content-Type": "application/json"
JSON body for /embed
:
{ "collection_name": "{collection_name}", "file_url": "{file_url}" }
JSON body for /retrieve
:
{ "collection_name": "{collection_name}", "query": "{query}" }
For Bubble users
Embed JSON for the bubble:
{ "collection_name": "<collection_name>", "file_url": "<file_url>" }
Retrieve JSON for bubble:
{ "collection_name": "<collection_name>", "query": "<query>" }
Feel free to reach out if any questions on Twitter