xfer / rag-stack

🤖 Deploy a private ChatGPT alternative hosted within your VPC. 🔮 Connect it to your organization's knowledge base and use it as a corporate oracle. Supports open-source LLMs like Llama 2, Falcon, and GPT4All.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🧺 RAGstack

Deploy a private ChatGPT alternative hosted within your VPC. Connect it to your organization's knowledge base and use it as a corporate oracle. Supports open-source LLMs like Llama 2, Falcon, and GPT4All.

Slack Issues Twitter

Retrieval Augmented Generation (RAG) is a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt. This gives LLMs information beyond what was provided in their training data, which is necessary for almost every enterprise use case. Examples include data from current web pages, data from SaaS apps like Confluence or Salesforce, and data from documents like sales contracts and PDFs.

RAG works better than fine-tuning the model because it’s cheaper, it’s faster, and it’s more reliable since the source of information is provided with each response.

RAGstack deploys the following resources for retrieval-augmented generation:

Open-source LLM

  • GPT4All: When you run locally, RAGstack will download and deploy Nomic AI's gpt4all model, which runs on consumer CPUs.

  • Falcon-7b: On the cloud, RAGstack deploys Technology Innovation Institute's falcon-7b model onto a GPU-enabled GKE cluster.

  • LLama 2: On the cloud, RAGstack can also deploy the 7B paramter version of Meta's Llama 2 model onto a GPU-enabled GKE cluster.

Vector database

  • Qdrant: Qdrant is an open-source vector database written in Rust, so it's highly performant and self-hostable.

Server + UI

Simple server and UI that handles PDF upload, so that you can chat over your PDFs using Qdrant and the open-source LLM of choice.

CleanShot 2023-07-18 at 20 36 49@2x

Run locally

To run locally, run ./run-dev. This will download ggml-gpt4all-j-v1.3-groovy.bin into server/llm/local/ and run the server, LLM, and Qdrant vector database locally.

All services will be ready once you see the following message:

INFO:     Application startup complete.

Deploy to Google Cloud

To deploy the RAG stack using Falcon-7B running on GPUs to your own google cloud instance, go through the following steps:

  1. Run ./deploy-gcp.sh. This will prompt you for your GCP project ID, service account key file, and region.
  2. If you get an error on the Falcon-7B deployment step, run the following commands and then run ./deploy-gcp.sh again:
gcloud config set compute/zone YOUR-REGION-HERE
gcloud container clusters get-credentials gpu-cluster
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

The deployment script was implemented using Terraform.

  1. You can run the frontend by creating a .env file in ragstack-ui and setting VITE_SERVER_URL to the url of the ragstack-server instance in your Google Cloud run.

Roadmap

  • ✅ GPT4all support
  • ✅ Falcon-7b support
  • ✅ Deployment on GCP
  • 🚧 Llama-2-40b support
  • 🚧 Deployment on AWS

Credits

The code for containerizing Falcon 7B is from Het Trivedi's tutorial repo. Check out his Medium article on how to dockerize Falcon here!

About

🤖 Deploy a private ChatGPT alternative hosted within your VPC. 🔮 Connect it to your organization's knowledge base and use it as a corporate oracle. Supports open-source LLMs like Llama 2, Falcon, and GPT4All.

License:MIT License


Languages

Language:Python 75.1%Language:TypeScript 7.8%Language:HCL 6.5%Language:Shell 4.7%Language:Dockerfile 2.4%Language:CSS 1.8%Language:JavaScript 0.9%Language:Makefile 0.4%Language:HTML 0.3%