Evaluating parameter-efficient transfer learning approaches on SURE benchmark for speech understanding

Paper

Motivation

Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models. Parameter inefficiency can however arise when, during transfer learning, all the parameters of a large pre-trained model need to be updated for individual downstream tasks. As the number of parameters grows, fine-tuning is prone to overfitting and catastrophic forgetting. In addition, full fine-tuning can become prohibitively expensive when the model is used for many tasks. To mitigate this issue, parameter-efficient transfer learning algorithms, such as adapters and prefix tuning, have been proposed as a way to introduce a few trainable parameters that can be plugged into large pre-trained language models such as BERT, and HuBERT. In this paper, we introduce the Speech UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech-processing tasks. Additionally, we introduce a new adapter, ConvAdapter, based on 1D convolution. We show that ConvAdapter outperforms the standard adapters while showing comparable performance against prefix tuning and LoRA with only 0.94% of trainable parameters on some of the tasks in SURE. We further explore the effectiveness of parameter efficient transfer learning for speech synthesis task such as Text-to-Speech (TTS).

Installation

Set up environments

conda create --name speechprompt python==3.8.5
conda activate speechprompt
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch

Install other dependencies

pip install -r requirements.txt

Supported tasks and datasets

How to run

First, we need to specify datasets and arguments. let's use "esd" as the dataset, "finetune" as the tuning method in "speech emotion recognition" task as an example:

CUDA_VISIBLE_DEVICES=2,3 python train.py \
		--dataset "esd" \
		--data_dir "/data/path/ESD" \
		--output_dir '/data/path/output_earlystop_ser_esd_finetune_2e3' \
		--do_train True \
		--do_eval True \
		--do_predict False \
		--evaluation_strategy "steps" \
		--save_strategy "steps" \
		--save_steps 500 \
		--eval_steps 25 \
		--learning_rate 2e-3 \
		--feat_adapter_name "conv_adapter" \
		--trans_adapter_name "adapterblock" \
		--output_adapter False \
		--mh_adapter False \
		--prefix_tuning False \
		--lora_adapter False \
		--feat_enc_adapter False \
		--fine_tune True \
		--per_device_train_batch_size 64 \
		--gradient_accumulation_steps 4 \
		--per_device_eval_batch_size 64 \
		--num_train_epochs 100 \
		--warmup_ratio 0.1 \
		--logging_steps 20 \
		--logging_dir '/data/path/output_earlystop_ser_esd_finetune_2e3/log' \
		--load_best_model_at_end True \
		--metric_for_best_model "f1"

Parameters

dataset: specify the dataset, such as "esd", "fleurs", "fluent_commands", etc.
data_dir: path to the dataset file, for instance, "../data/path/ESD"
output_dir: path to the checkpoints and logs, for instance, '../data/path/output_earlystop_ser_esd_finetune_2e3'
do_train: True if want to train
do_eval: True if want to eval
do_predict: True if want to inference
evaluation_strategy: It can be set according to the official setting of huggingface
save_strategy: It can be set according to the official setting of huggingface
save_steps: It can be set according to the official setting of huggingface
eval_steps: It can be set according to the official setting of huggingface
learning_rate: It can be set according to the official setting of huggingface
feat_adapter_name: The adapter type added in the features encoder, but not applied to this article, can be skipped
trans_adapter_name: The adapter type added in transformer layer, such as "adapterblock" for ConvAdapter and "bottleneck" for Bottleneck Adapter
output_adapter: True if added after feedforward of every transformer layer, only control ConvAdapter and Bottleneck Adapter
mh_adapter: True if added after multihead attention of every transformer layer, only control ConvAdapter and Bottleneck Adapter
prefix_tuning: True if prefix-tuning is added
lora_adapter: True if Lora is added
feat_enc_adapter: True if adapter is add in features encoder of wav2vec2
fine_tune: True if only need fine tuning
per_device_train_batch_size: It can be set according to the official setting of huggingface
gradient_accumulation_steps: It can be set according to the official setting of huggingface
per_device_eval_batch_size: It can be set according to the official setting of huggingface
num_train_epochs: It can be set according to the official setting of huggingface
warmup_ratio: It can be set according to the official setting of huggingface
logging_steps: It can be set according to the official setting of huggingface
logging_dir: It can be set according to the official setting of huggingface
load_best_model_at_end: It can be set according to the official setting of huggingface
metric_for_best_model: It can be set according to the official setting of huggingface

Emotion classification

Let's further explain the five training methods of the model. For example, start a new emotion classification task, we will set the corresponding parameter like below:

## finetune
--fine_tune True
## bottleneck
--trans_adapter_name "bottleneck"
--output_adapter True
## prefix-tuning
--prefix_tuning True
## lora
--lora_adapter True
## ConvAdapter
--trans_adapter_name "adapterblock"
--output_adapter True

We also placed examples according to each training method in "emotion_cls.sh", using the following command to start new emotion classification task:

bash emotion_cls.sh

Tensorboard

In order to further supervise the convergence of model training, we can view the log file through Tensorboard:

tensorboard --logdir=/data/path/output_earlystop_asr_fleurs_lora_2e3/log --bind_all

Citation

@inproceedings{li2023evaluating,
  title={Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding},
  author={Li, Yingting and Mehrish, Ambuj and Zhao, Shuai and Bhardwaj, Rishabh and Zadeh, Amir and Majumder, Navonil and Mihalcea, Rada and Poria, Soujanya},
  booktitle={ICASSP},
  year={2023}
}

Note: Please cite our paper if you find this repository useful.

mnabihali / speech-adapters