How to train a EncoderDecoder Adapter/Prediction Head needed?
julianpollmann opened this issue · comments
Environment info
adapter-transformers
version: 3.2.1- Platform: Linux-6.2.0-27-generic-x86_64-with-glibc2.37
- Python version: 3.10.9
- PyTorch version (GPU?): 1.13.1 (GPU)
Details
Hey, I'm trying to train a adapter for a Seq2Seq task with language adapters. Since most of the language adapters on the hub are pretrained for BERT or RoBERTa I cannot use e.g. BART for the task adapter. I set up a EncoderDecoder Model with bert-base-mulitlingual-cased
as base, but even with very few training data the training loss of adapter training stagnates at a high level (~4) and does not predict something meaningful. When fully fine-tuning with the same training settings the training loss quickly decreases around 0. Setups I tried:
- Training a task adapter with
bart-base
- works - Full Fine-tuning an EncoderDecoder model based on
bert-base-mulitlingual-cased
using the Huggingface Trainer - works - Training a task adapter with an EncoderDecoder model based on
bert-base-mulitlingual-cased
- the model repeatedly predicts the same word; training loss stagnates at high level.
Base model setup
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
"bert-base-multilingual-cased",
"bert-base-multilingual-cased",
tie_encoder_decoder=True
)
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
bad_words = ['[CLS]']
bad_words_ids = [tokenizer.vocab[token] for token in bad_words]
model.config.bad_words_ids = [bad_words_ids]
Adapter setup
I tried to add a task adapter using multiple methods:
adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.add_adapter("simplification", config=adapter_config, set_active=True)
model.train_adapter(["simplification"])
or
setup_adapter_training(
model=model,
adapter_args=AdapterArguments(train_adapter=True),
adapter_name="simplification",
adapter_config_kwargs={"reduction_factor": 2}
)
When training a adapter using BART, a prediction head is added. With the EncoderDecoder this seems to be missing.The saved adapter does not contain a head_config.json
like the BART trained adapter.
What do I need to change to train this task adapter with an EncoderDecoder Model?
Training setup
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="steps",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
fp16=True,
output_dir="./bert2bert/",
max_steps=3000,
logging_steps=50,
save_strategy="no",
# eval_steps=50,
learning_rate=3e-4,
warmup_ratio=0.1,
optim="adamw_torch",
remove_unused_columns=False,
)
trainer = Seq2SeqAdapterTrainer(
tokenizer=tokenizer,
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['validation'],
)
train_results = trainer.train()
EncoderDecoder adapter_config.json
{
"config": {
"adapter_residual_before_ln": false,
"cross_adapter": false,
"factorized_phm_W": true,
"factorized_phm_rule": false,
"hypercomplex_nonlinearity": "glorot-uniform",
"init_weights": "bert",
"inv_adapter": null,
"inv_adapter_reduction_factor": null,
"is_parallel": false,
"learn_phm": true,
"leave_out": [],
"ln_after": false,
"ln_before": false,
"mh_adapter": false,
"non_linearity": "relu",
"original_ln_after": true,
"original_ln_before": true,
"output_adapter": true,
"phm_bias": true,
"phm_c_init": "normal",
"phm_dim": 4,
"phm_init_range": 0.0001,
"phm_layer": false,
"phm_rank": 1,
"reduction_factor": 2,
"residual_before_ln": true,
"scaling": 1.0,
"shared_W_phm": false,
"shared_phm_rule": true,
"use_gating": false
},
"hidden_size": null,
"model_class": "EncoderDecoderModel",
"model_name": null,
"model_type": "encoder-decoder",
"name": "simplification",
"version": "3.2.1"
}
Bart adapter_config.json:
{
"config": {
"adapter_residual_before_ln": false,
"cross_adapter": false,
"factorized_phm_W": true,
"factorized_phm_rule": false,
"hypercomplex_nonlinearity": "glorot-uniform",
"init_weights": "bert",
"inv_adapter": null,
"inv_adapter_reduction_factor": null,
"is_parallel": false,
"learn_phm": true,
"leave_out": [],
"ln_after": false,
"ln_before": false,
"mh_adapter": false,
"non_linearity": "relu",
"original_ln_after": true,
"original_ln_before": true,
"output_adapter": true,
"phm_bias": true,
"phm_c_init": "normal",
"phm_dim": 4,
"phm_init_range": 0.0001,
"phm_layer": false,
"phm_rank": 1,
"reduction_factor": 8,
"residual_before_ln": true,
"scaling": 1.0,
"shared_W_phm": false,
"shared_phm_rule": true,
"use_gating": false
},
"config_id": "26cd1b10db746518",
"hidden_size": 768,
"model_class": "BartForConditionalGeneration",
"model_name": "facebook/bart-base",
"model_type": "bart",
"name": "simplification",
"version": "3.2.1"
}
This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.