rag-python

ML/LLM experiments with Llama Index to develop a personal assistant for cognitive impaired patients.

Index

Setup
- Python scripts
- Ollama
Experiments
Datasets

Alternative indexing techniques using the FAISS library.

Setup

Python scripts

Python 3.11 is unsupported by Pytorch. The application must run with Python 3.10.

Install dependencies:

pip install -r requirements.txt
# additional dependencies
pip install pypdf fastembed chromadb accelerate streamlit langchainhub

In some scripts, the value device must reflect the hardware:

auto is the default
cpu should always work, but processing time will be too long
mps to use M1

Some scripts connect to HuggingFace. Set the following environment variables:

HF_HOME=path: path of the HuggingFace cache
HUGGING_FACE_HUB_TOKEN=token: HuggingFace token to download the models

Ollama

Ollama is required for most scripts.

Install ollama:

brew install ollama
mkdir -p ~/.ollama
# optional to store the models on an external drive
ln -s "/{PATH}/ollama" ~/.ollama/models
ollama serve
ollama pull mistral
ollama pull llama2

Run it:

ollama serve
ollama run llama2

That should open an interactive shell to chat with Llama2.

Experiments

01-use-local-knowledge: basic experiment using llama-index and llama to index and query a dataset.
02-chat-bot: experiment using ollama/llama2 + streamlit/landchan/chromadb to discuss a PDF with the LLM.
03-fine-tuning: experiment fine-tuning bert with a dataset of reviews.
04-training-with-colab: same as 03, but using Colab.
05-create-a-bio: generate knowledge with LLMs and use the results to build the knowledge base for further iterations.
06-sentence-split: evaluates how SentenceSplitter works.
07-rag-pipeline: variation of 06-sentence-split.
08-query-chroma: test to verify how Chroma retrieves knowledge based on queries and filters.
09-refiner: utilisation of LLMs to re-rank results from the vector database.
10-keywords-extraction: methods to extract keywords (or key-phrases) from a text.
11-query-chroma-with-kw: use keywords to pre-filter the nodes returned by a query.
12-faiss: alternative indexing techniques using the FAISS library.
13-ingest-ebook: comparison between two extractors in order to parse a medical book in PDF.
14-smarter-ingest: extension of the SimpleDirectoryReader with enhanced PDF processing.
15-diagnosis: attempts to define the probability of a diagnosis based on dialogs.
16-relevant: find relevant questions to diagnose a disease.
17-better-dialogs: attempt to improve the dialogs with RAG.
18-translate-in-you-form: translate a diagnosis into a dialog directed to the patient.
19-elastic: multilevel indexing of PDFs storing the embeddings in Elasticsearch.

Datasets

bio: the bio of a fictional woman generated by an 05-create-a-bio.
bio-single-file: like bio but in a single file.
dementia-wiki-txt: an extract of the Wikipedia page about dementia.
dementia-wiki-polluted: same as dementia-wiki-txt but polluted by a sentence affirming that there exists a relation between dementia and alien kidnapping (to study hallucinations).
TwentyThousandLeaguesUnderTheSea: Twenty Thousand Leagues Under the Seas by Jules Verne. Source: https://www.gutenberg.org/
gutenberg: five books from https://www.gutenberg.org/. On the Origin of Species By Means of Natural Selection by Charles Darwin, Paradise Lost by John Milton, The Fall of the House of Usher by Edgar Allan Poe, The Republic by Plato, and Twenty Thousand Leagues under the Sea by Jules Verne.

alros / rag-python