chromadb pdf-processing similarity-search text-extraction vector-embeddings

PDF to Text Chroma Search

Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.

Requirements

Python 3.x
PyPDF2
chromadb
langchain

Installation

Clone the repository:

git clone https://github.com/your-username/pdf-to-text-chroma-search.git

Install the required dependencies:

pip install PyPDF2 chromadb langchain

Usage

Script 1: Convert PDFs to text, split into chunks, and store in Chroma DB

Place your PDF files in the input directory.
Run the following command to convert the PDFs to text, split them into chunks, and store their vector representations in the Chroma DB:

python write_script.py

Script 2: Load Chroma DB and query user input

Run the following command to load the Chroma DB and query user input:

python read_script.py

Enter your query when prompted.

About

Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.

chromadb pdf-processing similarity-search text-extraction vector-embeddings

Languages

Language:Python 100.0%