The sky doesn't care
What my poor heart wants And the desert Can't hear my cries The moon doesn't mind that I'm left all alone And she's gone, gone
Well, that was a fun keyboard test!
Craft a digital replica of myself by extracting insights from my WhatsApp conversations, employing the Mistral 7B algorithm.
This repository lets you create an AI chatbot clone of yourself, using your WhatsApp chats as training data. The default model used is Mistral-7B-Instruct-v0.2. The code in this repository heavily builds upon llama-recipes (https://github.com/facebookresearch/llama-recipes), where you can find more examples on different things to do with llama models.
This repository includes code to:
- Preprocess exported WhatsApp chats into a suitable format for finetuning
- Finetune a model on your WhatsApp chats, using 4-bit quantized LoRa
- Chat with your finetuned AI clone, via a commandline interface
- Clone this repository
- To install the required dependencies, run the following commands from inside the cloned repository:
pip install -U pip setuptools
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .
To prepare your WhatsApp chats for training, follow these steps:
- Export your WhatsApp chats as .txt files. This can be done directly in the WhatsApp app on your phone, for each chat individually. You can export just one .txt from a single chat or many .txt files from all your chats. Unfortunately, formatting seems to vary between regions. I am based on Europe, so the regex in the preprocessing.py might have to be adjusted if you are based in a different region.
- Copy the .txt files you exported into
data/preprocessing/raw_chats/train
. If you want to have a validation set, copy the validation chats intodata/preprocessing/raw_chats/validation
. - Run
python data/preprocessing/preprocess.py
. This will convert your raw chats into a format suitable for training and save CSV files todata/preprocessing/processed_chats
Run python -m finetuning --dataset "custom_dataset" --custom_dataset.file "scripts/custom_dataset.py" --whatsapp_username "[insert whatsapp_username]"
. Where you replace[insert whatsapp_username]
with your name, as it appears in the exported .txt files from WhatsApp. This is necessary to assign the correct role of "assistant" and "user" for training.
This will first download the base model from Huggingface, if necessary, and then start a LoRa finetune with 4-bit quantization.
The config for the training can be set in configs/training.py
. In particular, you can enable evaluation on the validation set after each epoch by setting run_validation: bool=True
After successful finetuning, run python3 commandline_chatbot.py --peft_model [insert checkpoint folder] --model_name mistralai/Mistral-7B-Instruct-v0.2
, where [insert checkpoint folder]
should be replaced with the output directory you specified in configs/training.py
. Default is checkpoints
folder.
Running this command loads the finetuned model and let's you have a conversation with it in the commandline.
You can define your own system prompt by changing the start_prompt_english
prompt text in the commandline_chatbot.py
file.
At least 22GB vRAM required for Mistral 7B finetune. I ran the finetune on a RTX 4060. When experimenting with other models, vRAM requirement might vary. vRAM requirements could probably be significantly reduced with some optimizations (maybe unsloth.ai).
- If your custom_dataset.py script is not finishing or throws a recursion error, you may need to increase the maximum number of recursions in
train.py
. Please editsys.setrecursionlimit(50000)
. - The system language when exporting chats should probably be English. Some data cleaning steps depend on this.
- Finetuning works best with English chats. If your chats are in another language, you may need to adjust the preprocessing and training parameters accordingly.
- This code should also work with other models than Mistral 7B, but I haven’t tried it myself. If you want to experiment with different models, you can find them here.
- In my finetunes, I have not exported group chats from WhatsApp. I am unsure if this would also work or cause any errors.
It looks like this project could be useful to automatically extract the history from a specific device using whatsapp web api:
https://github.com/mautrix/whatsapp
From the api it looks like you can request old messages through BuildHistorySyncRequest