NeoLLaMder

NeoLLaMder is a project to fine-tune an LLM on the content of the Neolanders Discord server.

Data Pipeline

Preparing the data for the LLM is a three step process:

Export the data with Discrub and place the JSON files in the raw_data directory
Run 1_format_data.py which will remove the unneccesary data and create formatted JSON files in the formatted_data directory
Run 2_clean_data.py which reformats the usernames according to the allowed usernames list, made up of users who have consented to having their username retained in the training data.
Run 3_extract_users.py to separate the messages from individual users into separate JSON files

allowed_usernames.txt is the list of usernames to be retained in the training data. substitute_usernames.txt is a list of usernames generated with Mockaroo that are used for pseudonymous users. Each pseudonymous user is assigned a username that remains consistent throughout the processing of the formatted discord data. These usernames are also substituted in messages from other users that mention the user.

format_quotes_completion.py and format_quotes_alpaca.py reformats the quotes.json from the cleaned_data folder to use the completion and Alpaca formats for training. Below is an example of the formats.

{"text": "quoted_username: \"Completion format example quote\""}
{"instruction": "Generate a quote from Username", "output": "Alpaca format example quote"}

The data pipeline scripts were created in their entirety by prompting GPT-4.

Training

This model is being trained and tested with Axlotl.

Runpod Docker Template

This template was run on an RTX 3090. You'll need to adjust the gradient_accumulation_steps and micro_batch_size in qLora.yml if you're running on a larger or smaller GPU.

Start the container
Download your dataset and config file inside the container.
Start the fine-tune: accelerate launch -m axolotl.cli.train qLora.yml
Login to HuggingFace: huggingface-cli login
Merge the QLoRa into the base model and push to HuggingFace: python merge_peft.py --base_model=mistralai/Mistral-7B-v0.1 --peft_model=./qlora-out --hub_id=Zetaphor/Neolandtest

Quantizing/GGUF

git clone https://github.com/ggerganov/llama.cpp.git
pip install -r llama.cpp/requirements.txt
cd llama.cpp
make CUBLAS=1 -j4 # Build llama.cpp binaries
mv Neolandtest llama.cpp/models # Move the merged models into the llama.cpp models folder
# python llama.cpp/convert.py Neolandtest --outfile Neolandtest.gguf --outtype q8_0 # Optional 8-bit quantization
python3 convert.py ./models/Neolandtest/ # F32 quantization and convert to GGUF
./quantize models/Neolandtest.gguf models/quantized_q5_K_M.gguf q5_K_M # 5-bit quantization
# Modify hf_upload and run again for new file

Disclaimer

This probably violates the Discord TOS. Use this responsibly and make sure you have the consent of both the admins and the community that you intend to scrape data and train from.

This is a fun project that was done with the consent of the people involved, and is intended to only be used within that private server. Don't be a dick and use this on a large public server where it's practically impossible to get consent.

Zetaphor / NeoLLaMder

NeoLLaMder

Data Pipeline

Training

Quantizing/GGUF

Disclaimer

About

Languages