arpithparikh / chunken

CHUNK Extraction Node for RAG

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CHUNKEN - CHUNK Extraction Node

CHUNKEN is designed to further process the text extracted by TEXTEN by chunking it into manageable parts and creating embeddings for these chunks. It integrates with OpenAI for generating embeddings and uses Pinecone and MongoDB for storage and retrieval of chunked data.

Key Features

  • Text Preprocessing: Cleans and preprocesses text, removing unnecessary elements like headers, footers, and URLs.
  • Chunking: Divides large text files into smaller chunks for easier processing and analysis.
  • Embeddings Generation: Uses OpenAI embeddings to generate vector representations of text chunks.
  • Storage Integration: Stores chunk metadata and embeddings in MongoDB and Pinecone for efficient retrieval.
  • Orphaned Chunk Management: Identifies and deletes orphaned chunks in Pinecone to maintain data integrity.

Git Repositories

Table of Contents

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed Python 3.7 or later.
  • You have a working internet connection.

Installation

  1. Clone the repository:

    git clone https://github.com/msuliot/chunken.git
    cd texten
  2. Set up a virtual environment:

    python -m venv venv
    source venv/bin/activate   # On Windows, use `venv\Scripts\activate`
  3. Install the dependencies:

    pip install -r requirements.txt

Usage

To run the CHUNKEN application, use the following command:

python app.py

Configuration

The configuration is managed through a config.json file. Create a configuration file with the following structure:

{
    "input_directories": [
      "path/to/text/output/directory"
    ],
    "database": "blades-of-grass-demo",
    "namespace": "demo24",
    "chunk_size": 1800, 
    "chuck_extension_limit": 248,
    "scheduler_interval": 60
}

The Environmental variables is managed through a .env file. Create a file with the following structure:

OPENAI_API_KEY='key_here'
PINECONE_API_KEY='key_here'
MONGO="connection_string_here"

About

CHUNK Extraction Node for RAG


Languages

Language:Python 100.0%