🤗 PEFT-SFT

Sparse Fine-Tuning for Large Language Models

This is a fork of 🤗 PEFT implementing efficient sparse fine-tuning (SFT) as described in the paper Scaling Sparse Fine-Tuning to Large Language Models. The scripts for the instruction-tuning experiments from the paper can be found at https://github.com/ducdauge/sft-llm. You can also find a simple QA example with 🤗 Trainer here.

Installation

You can install this package as follows:

git clone https://github.com/AlanAnsell/peft.git
cd peft
python setup.py develop # or "pip install .", but this way is recommended

or use

pip install git+https://github.com/AlanAnsell/peft.git

Creating an SFT model

You can prepare a model for SFT as follows:

from transformers import AutoModelForCausalLM
from peft import get_peft_config, get_peft_model, SftConfig, TaskType
model_name_or_path = "meta-llama/Llama-2-7b-hf"

peft_config = SftConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    density=0.01,
    selection_algorithm="rigl", # or "sm3" for moment approximation SFT
    target_modules=["q_proj", "o_proj", "v_proj", "k_proj", "gate_proj", "up_proj", "down_proj"],
)

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)

SFT with 🤗 Trainer

Because SFT updates the set of trainable parameters during training, some code needs to be added to the training loop. If you are using 🤗 Trainer, create an SftTrainer subclass and then construct it normally with your peft_config as argument like so:

from peft import SftTrainer

...

trainer_cls = SftTrainer(MyTrainer) # MyTrainer = Trainer or any subclass thereof
trainer = trainer_cls(
    model=model,
    args=training_args,
    ...
    sft_config=peft_config,
)

You should then be able to use trainer as you would normally.

SFT with a custom training loop

If you are using a custom training loop, you should use the SftAdamW/SftSM3 optimizer depending on whether you are using accumulated gradient or moment approximation SFT, and construct an SftSelector object:

from peft import SftAdamW, SftSM3, SftSelector

...

optimizer_grouped_parameters = [
    {
        "params": [
            p for n, p in model.named_parameters()
            if p.requires_grad
        ],
        "weight_decay": weight_decay,
    },
]

if peft_config.selection_algorithm == "sm3":
    deltas = {
        delta.values: delta
        for _1, _2, delta in model.active_deltas()
    }
    optimizer = SftSM3(
        optimizer_grouped_parameters,
        deltas,
        lr=learning_rate,
    )
else:
    optimizer = SftAdamW(
        optimizer_grouped_parameters,
        lr=learning_rate,
        momentum_dtype=torch.float32,
    )

...

selector = SftSelector(
    model,
    optimizer,
    peft_config,
    num_train_steps, # total expected duration of training in update steps
    gradient_accumulation_steps, # grad accumulation steps per update step
)

Then call the selector's .step() method at the end of each update step, e.g.

for i, batch in enumerate(train_dataloader):
    ...
    loss = model(**batch)
    loss.backward()
    ...

    if (i + 1) % grad_accumulation_steps == 0:
        ...
        optimizer.step()
        optimizer.zero_grad()
        selector.step()

SFT options

The following hyperparameters can be modified through the SftConfig:

density/num_tunable_weights set the number of tunable parameters as a proportion of total model params / as an absolute number respectively. Defaults to density=0.01.
selection_algorithm: sets the SFT selection algorithm. Supply "rigl" for gradient accumulation/RigL-style SFT or "sm3" for moment approximation SFT with the SM3 optimizer. Defaults to "rigl".
reselection_steps: sets the number of steps between parameter reselections. Defaults to 20. You may want to use a larger value for small batch sizes.
selection_accumulation_steps: for gradient accumulation SFT, controls the number of steps over which gradients are accumulated.
initial_reselection_rate: the proportion of parameters that will be reselected initially. This is reduced linearly to zero over the course of training. Defaults to 0.2.
target_modules: controls which linear modules SFT is applied to. If not supplied, SFT will be applied to all linear modules within Transformer blocks.

PEFT

For details on using PEFT please refer to the HuggingFace documentation or the 🤗 PEFT repository.

Citing

If you use our SFT implementation, please use the following snippet to cite our work:

@misc{ansell2024scaling,
      title={Scaling Sparse Fine-Tuning to Large Language Models}, 
      author={Alan Ansell and Ivan Vulić and Hannah Sterz and Anna Korhonen and Edoardo M. Ponti},
      year={2024},
      eprint={2401.16405},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you want to cite 🤗 PEFT in your publication, use the following snippet:

@Misc{peft,
  title =        {PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
  author =       {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul},
  howpublished = {\url{https://github.com/huggingface/peft}},
  year =         {2022}
}

simonucl / peft