adammikulis/DULlama

This is the first iteration of a bare-bones local RAG pipeline in C# (requires .NET 8.0 runtime) using the LLamaSharp library. The program constructs a basic vector database consisting of the original text and the embeddings generated by the chosen LLM. The user submits a query, which gets matched against the db using cosine similarity. This returns the top results (original text) which are then integrated into the LLM's response. This has been tested and immediately corrects wrong information generated by an LLM by itself without RAG. I originally wrote this for my employer which is why it is specific to that data, but you can change it to whatever you want.

All Llama2 and Mistral 7B models at higher quantization (q) require a mininum of 16GB of RAM for CPU inference (8GB VRAM for a GPU if using the CUDA backend). Using models with smaller bit quantization takes less memory/processing at the cost of accuracy (the smallest models can run on a GTX 1060). This program has not been tested with 13B, 34B, 70B, or Mixtral variants but should be compatible if you have hardware that can support larger models. The path in Program.cs goes to a general "C:\ai\models" folder but you can change that to where you keep your model downloads.

The current backend is Cuda12, download the toolkit here if you don't have 12.1 already installed:

I currently recommend Mistralv0.2 over Llama2 due to its faster processing time, smaller memory footprint, and superior benchmark performance.

Mistral 7B downloads:

Math-enhanced: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-DARE-GGUF

Coding: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-code-ft-GGUF

General instruct: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

Llama2 7B downloads:

Coding: https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF

General: https://huggingface.co/TheBloke/Llama-2-7B-GGUF

Future goals:

Add processing for additional files/datatypes (sqr, sql, pdf, etc)
Allow the LLM to process the entire conversation while still focusing on the most recent prompt to avoid repeating answers
Integrate some telemetry like tok/s
Build a better UI

adammikulis / DULlama

About

Languages