This project is only for educational and learning purposes, focusing on guiding developers on how to build an LLM model from scratch and then fine-tune it for different tasks. To make a great model, we need to know how every part works and how it uses data. Making this project taught me a lot about that. In this repository, you will find tools and materials to:
- Build a core LLM: Train a foundational language model to understand text
- Sentiment Analysis Adaptation: Fine-tune the base LLM to interpret and analyze sentiments in texts.
- Financial QnA Adaptation: Adjust the LLM to address question and answer tasks specifically tailored for the financial sector.
- Performance Benchmarks (WIP): Evaluate the model's effectiveness in sentiment analysis and QnA tasks.
- User Interface (WIP): An interactive platform to test the model on sentiment analysis and QnA tasks.
The code for training based model are mostly from the repository llama2.c by Andrej Karpathy. His repository stands as a masterclass of educational content in AI development. Highly recommended for all learners. Here are some main modifications
- Restructured the training code
- Integrated the 8bit optimizer library for faster base model training
- Made updates to deal with padding in input and output (not quite optimal yet)
All models, from base to fine-tuned, are built from scratch, so you can see how each part works.
.
├── alpaca_finance # Preprocess Alpaca's QnA dataset
├── config.py # Configuration for model and network
├── finetune_dataset.py # Dataset and dataloader for the fine-tuning task
├── finetune_model.py # Fine-tuning model
├── Makefile # Compiling instruction for C code
├── model.py # Base model
├── run.c # C code for forward pass
├── sentiment_finance # Preprocess the news data for sentiment analyis
├── tinystories.py # Tiny Stories to train base model
├── tokenizer.bin # Vocabulary encoder-decoder in binary (C code)
├── tokenizer.model # Vocabulary encoder-decoder (python code)
├── tokenizer.py # Script for encoding and decoding text and token IDs
├── train_base_model.py # Training runner for the base model
└── train_ft_model.py # Training runner for the fine-tuning model
We recommend installing miniconda for managing Python environment, yet this repo works well with other alternatives e.g., venv
.
- Install miniconda by following these instructions
- Create a conda environment
conda create --name your_env_name python=3.10
- Activate conda environment
conda activate your_env_name
pip install -r requirements.txt
Run the command below. Note: This script originates from the llama2.c repository but has been slightly modified.
python tinystories.py download
Run the following command to preporcess the raw data in the base model input format
python -m sentiment_finance.make_dataset
- Download the data
Cleaned_date.json
from Hugging Face. - Save it the folder
alpaca_finance/data
. - Then, run the following command to preporcess the raw data in the base model input format.
python -m alpaca_finance.make_dataset
Run the following command to train the base model
python train_base_model.py training
After training, run the following command to test the base model
python train_base_model.py test
In order to test the base model in C, you need first to compule the C code by running the following command
gcc -o run run.c -lm
then
./run out/model.bin
NOTE: you can also use the Makefile to compile the C code if preferred.
For fine-tuning, we use HuggingFace's LoRA approach to extract layers but have implemented our own custom optimizer for our custom model. In the future, we plan to implement our minimal version for LoRA approach. Run the following command for fine-tuning model for sentiment analysis
python train_ft_model.py --task training --dataset_name news
Run the following command for fine-tuning model for questions and anwsers
python train_ft_model.py alpaca --task training --dataset_name alpaca
Coming soon...
- TinyStories: Provided the data for training the base model
- FINGPT: Provided the data for the sentiment analysis
- Gaurang Bharti: Put together data from Standford's Alpaca and FiQA for question and answer finetuning
- Jonathan Chang: Provided the minimal implementation of LoRA approach