CALM: Credit and Risk Assessment Large Language Model

Due to licensing restrictions on LLaMA weights, the model cannot be used for commercial purposes. Please adhere strictly to LLaMA's usage policy.
Considering the limitations of LLaMA's license, we cannot directly distribute the complete model weights. Here, we are only releasing the LoRA weights of CALM-7B.

Content

CALM: Credit and Risk Assessment Large Language Model

1. Preparing the environment

Creating the environment using Conda, followed by installing the required packages using pip.

pip install -r requirements.txt

2. Run

2.1 Download data

Before running, please download rawdata to data/CRA_resample_0.045M.json

2.1.1 Convert data format

export raw_data=/path_to/CRA_resample_0.045M.json
export conv_data=/path_to/CRA_resample_0.045M_conv.json
export data_name=CRA
export dev_data=/path_to/CRA-resample-dev3k.json
export train_data=/path_to/CRA-resample-train4w.json

python scripts/convert_to_conv_data.py \
    --orig_data ${raw_data} \
    --write_data ${conv_data} \
    --dataset_name CRA
head -n 3000 ${conv_data} > ${dev_data}
tail -n +3001 ${conv_data} > ${train_data}

We designate the first 3000 entries as the validation set, while the remaining data serves as the training set.

2.2 Model training

Training strategy

LoRA + int8

The initiation script for training is written in train/scripts/run.sh. You will need to modify the parameters in run.sh according to your specific requirements.

bash scripts/run_sft.sh

model_name_or_path: The pretrained model (if it is an LLaMA model, it needs to be converted to the hf format beforehand in order to be loaded using from_pretrained)
train_file: Training data
validation_file: Validation data
output_dir: Path to the training logs and model saves
cache_dir: Path to the cache data processing process
cutoff_len: Maximum input sequence length (LLaMA model suggests setting it to 1024 or above, Bloom model suggests setting it to 512 or above)

2.2.1 LoRA

nohup torchrun --nproc_per_node 2 src/entry_point/sft_train.py \
    --model_name_or_path ${model_name_or_path} \
    --bf16 True \
    --llama True \
    --use_lora True \
    --deepspeed configs/deepspeed_config_stage3.json \
    --lora_config configs/lora_config_llama.json \
    --train_file ${train_file} \
    --validation_file ${validation_file} \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 6 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --model_max_length ${cutoff_len} \
    --save_strategy "steps" \
    --save_total_limit 3 \
    --learning_rate 3e-4 \
    --weight_decay 0.00001 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --evaluation_strategy "steps" \
    --seed 1234 \
    --gradient_checkpointing \
    --cache_dir ${cache_dir} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    > ${log_dir}/train.log 2>&1 &

Parameters

use_lora: Training using LoRA
use_int8_training: Training with 8-bit quantization, which significantly reduces memory usage
lora_config: The parameter configuration for LoRA is provided. If training a Bloom model, it should be changed to "configs/lora_config_bloom.json"
deepspeed When training sequences are long, it is recommended to utilize deepspeed stage 3, which effectively distributes model parameters across multiple cards, allowing room to load even longer sequences

Note: Please be aware that you can only choose between "use_int8_training" and "deepspeed"; they cannot be used simultaneously.

The structure of the output_dir:

output_dir/
├── checkpoint-244/
│   ├── pytorch_model.bin
│   └── trainer_state.json
├── checkpoint-527/
│   ├── pytorch_model.bin
│   └── trainer_state.json
├── adapter_model.bin
├── print_log.txt
└── adapter_config.json

The highest-level directory stores the final model obtained from the training process.

2.2.2 Merge Model with LORA

If you wish to merge the weights of LoRA with a pre-trained model, you can execute the following command:

model_name_or_path=model_path_to/llama-2-7b-chat-T/
lora_path=lora_path_to/checkpoint_2/3739
output_path=out_path_to/CRA__model_2/model_3739

CUDA_VISIBLE_DEVICES=0 python src/merge_llama_with_lora.py \
    --model_name_or_path ${model_name_or_path} \
    --output_path ${output_path} \
    --lora_path ${lora_path} \
    --llama

The merged weights will be saved in the "output_path" directory. You can subsequently load them directly using "from_pretrained".

Dai-shen / CALM-train