iboing / CorDA

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

Home Page:https://arxiv.org/pdf/2406.05223

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

(under construction)

Authors: Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, Bernard Ghanem

Paper: [PDF]

corda


Getting Start

Download the repo and install dependencies.

git clone https://github.com/iboing/CorDA.git
cd CorDA
pip install -r requirements.txt

The dataset in json/jsonl formats used to collect covariance matrices (MetaMath for math, CodeFeedback for code, WizardLM_evol_instruct and alpaca for instruction following) can be downloaded from our huggingface repo. The other datasets will be automatically downloaded when runing the code.

Step 1: Context-oriented Decomposition

You can skip Step 1 by directly downloading the decomposed model from our huggingface repo.

Corda enables to initialize a learnable adapter in two modes, knowledge-preserved adaptation (tools/build_KPA.sh) and instruction-previewed adaptation (tools/build_IPA.sh). The knowledge-preserved adaptation samples questions from QA datasets, such as triviaQA and nq_open, to obtain covariance matrices for decomposition, and uses the smallest $r$ singular values and vectors to initialize the adapter. The instruction-previewed adaptation samples queries and responses from the finetuning dataset to obtain covariance matrices, and initialize the adapter with the larget $r$ singular values and vectors.

๐Ÿ“– Knowledge-preserved adaptation

CUDA_VISIBLE_DEVICES=0 python build_corda.py \
    --model_id "meta-llama/Llama-2-7b-hf" \
    --cov_aware \
    --r {rank} \
    --use_cache \
    --calib_dataset "nqopen" \
    --calib_loader_size 256 \
    --save_model \
    --save_path {path_to_decomposed_model}

Arguments:

  • --model_id is the pre-trained model for decomposition.
  • --cov_aware adopts our context-oriented decomposition and collects covariance matrices.
  • --r is the low rank of LoRA, e.g. 128.
  • --use_cache adopts the dataloader and covariance matrices saved in CorDA/cache, to avoid calculating the covariance matrices again.
  • --calib_dataset specifies the dataset to sample data to obtain covariance matrices. KPA mode uses QA datasets with choices of "nqopen" and "traivia_qa".
  • --calib_loader_size is the number of sampled data.
  • --save_model saves the initialized model in --save_path.

๐Ÿ”ญ Instruction-previewed adaptation

CUDA_VISIBLE_DEVICES=0 python build_corda.py \
    --model_id "meta-llama/Llama-2-7b-hf" \
    --cov_aware \
    --r {rank} \
    --use_cache \
    --first_eigen \
    --calib_dataset "MetaMATH" \
    --calib_loader_size 256 \
    --save_model \
    --save_path {path_to_decomposed_model}

Arguments:

  • --first_eigen uses the largest $r$ singular values and vectors to initialize the learnable adapter for the instruction-previewed adaptation mode.
  • --calib_dataset specifies the dataset to sample data to obtain covariance matrices. IPA mode uses finetuning datasets with choices of "MetaMATH", "codefeedback", "WizLMinstruct", and "alpaca".

Step 2: Adapter Training

LoRA:

sh tools/train_LoRA.sh {path_to_trained_model}

CorDA:

sh tools/train_CorDA.sh {path_to_decomposed_model} {path_to_trained_model}

Step 3: Inference

After training, LoRA adapter can be merged with the base model by runing:

python merge_adapter_to_base_model.py --base_model "meta-llama/Llama-2-7b-hf" --adapter {path_to_trained_model}/ft --output_path {path_to_merged_model}

The current CorDA code is based on customized model instead of huggingface/PEFT. So, CorDA model can be directly used to inference (trust_remote_code needs to be true when loading the model) without the need of merging. If you still wants to merge the trained adapter in CorDA with the frozen weight to restore the original LLaMA model architecture, you can execute the command:

(optional)
python merge_adapter_for_corda.py --model_id {path_to_trained_model}/ft --save_path {path_to_merged_model}

This may lead to slighty lower performance than inference directly without merging due to numerical variation.

Inference on world knowledge:

Inference on world knowledge benchmarks is based on EleutherAI/lm-evaluation-harness. For example, we evaluate on nq_open by:

accelerate launch -m lm_eval \
    --model hf \
    --model_args pretrained={path_to_trained_model}/ft,trust_remote_code=True,dtype=float16 \
    --output_path {result_path}/nq_open.json \
    --tasks nq_open \
    --batch_size auto \
    --max_batch_size 8 \
    --device cuda

Inference on Math:

Evaluation on Gsm8k and Math can be performed by:

(for CorDA:)
sh tools/inference_Math.sh {path_to_trained_model}/ft
(for LoRA:)
sh tools/inference_Math.sh {path_to_merged_model}

Inference on Code and Instruction Following:

Evaluation on HumanEval and MBPP is based on bigcode-evaluation-harness. Evaluation on MTBench is based on FastChat. We use their default settings for evaluation.

Results

Method TriviaQA NQ open GSM8k Math
LoRA 44.17 1.91 42.68 5.92
CorDA (KPA with nqopen) 45.23 10.44 45.64 6.94
CorDA (IPA with MetaMath) - - 54.59 8.54

Compared with LoRA, CorDA in knowledge-preserved adaptation (KPA) not only has better performances on the finetuning task, but also helps to mitigate the forgetting of world knowledge. CorDA in instruction-previewed adaptation (IPA) enables to further enhance the finetuning performance.

The models can be downloaded from our huggingface repo.

Wikitext/PTB Results

To reproduce the Figure 2 and Table 6 of our paper to compare our context-oriented decomposition with ASVD and Plain SVD, you can execute tools/full_decompose.sh. Concretely,

CUDA_VISIBLE_DEVICES=0 python -u build_corda.py \
    --model_id="meta-llama/Llama-2-7b-hf" \
    --r {the smallest rank to discard} \
    --mode full_decompose \
    --use_cache \
    --calib_dataset "wikitext2" \
    --calib_loader_size 256 \
    [--cov_aware for context-oriented decomposition, --act_aware for ASVD, or remove this argument for plain SVD]

Arguments:

  • --r is the number of the smallest singular values and vectors to discard, i.e., the x-axis of Figure 2.
  • --cov_aware adopts our context-oriented SVD. Use --act_aware for ASVD or remove this argument for the Plain SVD.

Citation

If you find our work/code useful for your research, please consider citing:

@article{yang2024corda,
  title={CorDA: Context-Oriented Decomposition Adaptation of Large Language Models},
  author={Yang, Yibo and Li, Xiaojie and Zhou, Zhongzhu and Song, Shuaiwen Leon and Wu, Jianlong and Nie, Liqiang and Ghanem, Bernard},
  journal={arXiv preprint arXiv:2406.05223},
  year={2024}
}

About

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

https://arxiv.org/pdf/2406.05223

License:Apache License 2.0


Languages

Language:Python 95.5%Language:Shell 4.5%