shivaram/factsheet-generator

About The Project

The factsheet generator utilizes Retrieval Augmented Generation (RAG) over a distributed cluster to extract key facts from a dataset of documents.

System Design

RAG overview

In the RAG pipeline, documents are first split into chunks (500-1000 tokens each). An embedding is then generated from each text chunk. These chunk-embedding pairs are stored in a pgvector database. After the entire dataset of documents is processed, similarity searches can be run to find relevant data given a query.

A set of queries is written beforehand for each fact that should be extracted. For each stratigraphic unit, relevant data along with a query is used as a prompt for a LLM to generate facts.

System overview

Worker nodes running the LLM and embedding model are distributed across the COSMOS machines using Docker Swarm. Tasks are delegated to them by a master node which communicates through gRPC requests. Embeddings and the factsheets generated from the worker nodes are sent to a container running PostgreSQL with pgvector.

shivaram / factsheet-generator

About The Project

System Design

RAG overview

System overview

About

Languages