llm-finance

This project is only for educational and learning purposes, focusing on guiding developers on how to build an LLM model from scratch and then fine-tune it for different tasks. To make a great model, we need to know how every part works and how it uses data. Making this project taught me a lot about that. In this repository, you will find tools and materials to:

Build a core LLM: Train a foundational language model to understand text
Sentiment Analysis Adaptation: Fine-tune the base LLM to interpret and analyze sentiments in texts.
Financial QnA Adaptation: Adjust the LLM to address question and answer tasks specifically tailored for the financial sector.
Performance Benchmarks (WIP): Evaluate the model's effectiveness in sentiment analysis and QnA tasks.
User Interface (WIP): An interactive platform to test the model on sentiment analysis and QnA tasks.

The code for training based model are mostly from the repository llama2.c by Andrej Karpathy. His repository stands as a masterclass of educational content in AI development. Highly recommended for all learners. Here are some main modifications

Restructured the training code
Integrated the 8bit optimizer library for faster base model training
Made updates to deal with padding in input and output (not quite optimal yet)

All models, from base to fine-tuned, are built from scratch, so you can see how each part works.

Directory Structure

.
├── alpaca_finance          # Preprocess Alpaca's QnA dataset
├── config.py               # Configuration for model and network
├── finetune_dataset.py     # Dataset and dataloader for the fine-tuning task
├── finetune_model.py       # Fine-tuning model
├── Makefile                # Compiling instruction for C code
├── model.py                # Base model
├── run.c                   # C code for forward pass
├── sentiment_finance       # Preprocess the news data for sentiment analyis
├── tinystories.py          # Tiny Stories to train base model
├── tokenizer.bin           # Vocabulary encoder-decoder in binary (C code)
├── tokenizer.model         # Vocabulary encoder-decoder (python code)
├── tokenizer.py            # Script for encoding and decoding text and token IDs
├── train_base_model.py     # Training runner for the base model
└── train_ft_model.py       # Training runner for the fine-tuning model

Installation

Create Miniconda Environment

We recommend installing miniconda for managing Python environment, yet this repo works well with other alternatives e.g., venv.

Install miniconda by following these instructions

Create a conda environment

conda create --name your_env_name python=3.10

Activate conda environment
```
conda activate your_env_name
```

Install all required libraries

pip install -r requirements.txt

Download Dataset

TinyStories

Run the command below. Note: This script originates from the llama2.c repository but has been slightly modified.

python tinystories.py download

FinGPT for Sentiment Analysis

Run the following command to preporcess the raw data in the base model input format

python -m sentiment_finance.make_dataset

Alpaca for Question and Answer

Download the data Cleaned_date.json from Hugging Face.
Save it the folder alpaca_finance/data.
Then, run the following command to preporcess the raw data in the base model input format.
```
python -m alpaca_finance.make_dataset
```

Model Training

Base Model

Run the following command to train the base model

python train_base_model.py training

After training, run the following command to test the base model

python train_base_model.py test

In order to test the base model in C, you need first to compule the C code by running the following command

gcc -o run run.c -lm

then

./run out/model.bin

NOTE: you can also use the Makefile to compile the C code if preferred.

Fine-tuning Model

For fine-tuning, we use HuggingFace's LoRA approach to extract layers but have implemented our own custom optimizer for our custom model. In the future, we plan to implement our minimal version for LoRA approach. Run the following command for fine-tuning model for sentiment analysis

python train_ft_model.py --task training --dataset_name news

Run the following command for fine-tuning model for questions and anwsers

python train_ft_model.py alpaca --task training --dataset_name alpaca

Model Testing

Coming soon...

Acknowledgement

Dataset

TinyStories: Provided the data for training the base model
FINGPT: Provided the data for the sentiment analysis
Gaurang Bharti: Put together data from Standford's Alpaca and FiQA for question and answer finetuning

Code

Jonathan Chang: Provided the minimal implementation of LoRA approach

lhnguyen102 / llm-finance