BertGeneration training yields "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"
julianpollmann opened this issue · comments
Environment info
adapter-transformers
version: 3.2.1- Platform: Linux-6.2.0-27-generic-x86_64-with-glibc2.37
- Python version: 3.10.9
- PyTorch version (GPU?): 1.13.1 (GPU)
Details
I'm trying to train a EncoderDecoder Adapter with BertGeneration using the Seq2SeqAdapterTrainer:
encoder = BertGenerationEncoder.from_pretrained(
"bert-base-multilingual-cased",
bos_token_id=101,
eos_token_id=102
)
decoder = BertGenerationDecoder.from_pretrained(
"bert-base-multilingual-cased",
add_cross_attention=True,
is_decoder=True,
bos_token_id=101,
eos_token_id=102
)
model = EncoderDecoderModel(encoder=encoder, decoder=decoder)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.encoder.vocab_size
adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.add_adapter("simplification", config=adapter_config, set_active=True)
model.train_adapter(["simplification"])
This results in a RuntimeError:
The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `EncoderDecoderModel.forward`, you can safely ignore this message.
***** Running training *****
Num examples = 10
Num Epochs = 3000
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 3000
Number of trainable parameters = 0
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
xxx/adapters/lib/python3.10/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py:654: FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.
warnings.warn(DEPRECATION_WARNING, FutureWarning)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 train_results = trainer.train()
File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )
File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1789 tr_loss_step = self.training_step(model, inputs)
1790 else:
-> 1791 tr_loss_step = self.training_step(model, inputs)
1793 if (
1794 args.logging_nan_inf_filter
1795 and not is_torch_tpu_available()
1796 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1797 ):
1798 # if loss is nan or inf simply add the average of previous logged losses
1799 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:2549, in Trainer.training_step(self, model, inputs)
2546 loss = loss / self.args.gradient_accumulation_steps
2548 if self.do_grad_scaling:
-> 2549 self.scaler.scale(loss).backward()
2550 elif self.use_apex:
2551 with amp.scale_loss(loss, self.optimizer) as scaled_loss:
File ~/.conda/envs/adapters/lib/python3.10/site-packages/torch/_tensor.py:488, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
478 if has_torch_function_unary(self):
479 return handle_torch_function(
480 Tensor.backward,
481 (self,),
(...)
486 inputs=inputs,
487 )
--> 488 torch.autograd.backward(
489 self, gradient, retain_graph, create_graph, inputs=inputs
490 )
File ~/.conda/envs/adapters/lib/python3.10/site-packages/torch/autograd/__init__.py:197, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
192 retain_graph = create_graph
194 # The reason we repeat same the comment below is that
195 # some Python versions print out the first line of a multi-line function
196 # calls in the traceback and some print out the last line
--> 197 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
198 tensors, grad_tensors_, retain_graph, create_graph, inputs,
199 allow_unreachable=True, accumulate_grad=True)
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
This issue is also present in transformers, where a change of the optimizer is discussed as a solution. Changing the optimizer in the Seq2SeqAdapterTrainer did not work:
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="no",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
fp16=True,
output_dir="./bert2bert/",
max_steps=3000,
logging_steps=50,
save_strategy="no",
# eval_steps=50,
learning_rate=3e-4,
warmup_ratio=0.1,
#optim="adamw_torch",
remove_unused_columns=True
)
optimizer=torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler=get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=100,
num_training_steps=3000)
optimizers = optimizer, scheduler
trainer = Seq2SeqAdapterTrainer(
tokenizer=tokenizer,
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['validation'],
optimizers=optimizers
)
train_results = trainer.train()
Training with the same settings without adapter and Seq2SeqTrainer does work.
It seems like this is caused by the trainer freezing the model weights and not adding adapter modules other than invertibles to the EncoderDecoderMixin.
The trainer will not get any parameters with requires_grad=True for the optimizer over here.
I tried to add these some layers to the Mixin like done in the BARTMixin without success:
def _init_adapter_modules(self):
if not isinstance(self.encoder, ModelAdaptersMixin) or not isinstance(self.decoder, ModelAdaptersMixin):
return
# Relay all invertible adapter calls to encoder
self.invertible_adapters = self.encoder.base_model.invertible_adapters
self.add_invertible_adapter = self.encoder.base_model.add_invertible_adapter
self.get_invertible_adapter = self.encoder.base_model.get_invertible_adapter
self.enable_invertible_adapters = self.encoder.base_model.enable_invertible_adapters
self.invertible_adapters_forward = self.encoder.base_model.invertible_adapters_forward
# Decoder should use invertible adapters of encoder
self.decoder.base_model.invertible_adapters = self.encoder.base_model.invertible_adapters
self.decoder.base_model.add_invertible_adapter = lambda *args: None
self.decoder.base_model.get_invertible_adapter = self.encoder.base_model.get_invertible_adapter
# Added these lines from BARTMixin
# Also tried with self.encoder.base_model...
self.encoder.attention_adapters = AdapterLayer("mh_adapter", self.config)
self.encoder.output_adapters = AdapterLayer("output_adapter", self.config)
self.encoder.attention_adapters._init_adapter_modules()
self.encoder.output_adapters._init_adapter_modules()
self.decoder.cross_attention_adapters = AdapterLayer("cross_adapter", self.config)
self.decoder.cross_attention_adapters._init_adapter_modules()
Thanks for reporting these issues and sorry for not getting back to you earlier. Unfortunately, our current encoder-decoder implementation is very hacky and has all sorts of issues currently. We'll try to look into this.