adapter-hub / adapters

A Unified Library for Parameter-Efficient and Modular Transfer Learning

Home Page:https://docs.adapterhub.ml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BertGeneration training yields "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

julianpollmann opened this issue · comments

Environment info

  • adapter-transformers version: 3.2.1
  • Platform: Linux-6.2.0-27-generic-x86_64-with-glibc2.37
  • Python version: 3.10.9
  • PyTorch version (GPU?): 1.13.1 (GPU)

Details

I'm trying to train a EncoderDecoder Adapter with BertGeneration using the Seq2SeqAdapterTrainer:

encoder = BertGenerationEncoder.from_pretrained(
    "bert-base-multilingual-cased",
    bos_token_id=101,
    eos_token_id=102
)
decoder = BertGenerationDecoder.from_pretrained(
    "bert-base-multilingual-cased",
    add_cross_attention=True,
    is_decoder=True,
    bos_token_id=101,
    eos_token_id=102
)
model = EncoderDecoderModel(encoder=encoder, decoder=decoder)

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.eos_token_id = tokenizer.sep_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.vocab_size = model.config.encoder.vocab_size

adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.add_adapter("simplification", config=adapter_config, set_active=True)
model.train_adapter(["simplification"])

This results in a RuntimeError:

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10
  Num Epochs = 3000
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3000
  Number of trainable parameters = 0
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
xxx/adapters/lib/python3.10/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py:654: FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.
  warnings.warn(DEPRECATION_WARNING, FutureWarning)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 train_results = trainer.train()

File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.conda/envs/adapters/lib/python3.10/site-packages/transformers/trainer.py:2549, in Trainer.training_step(self, model, inputs)
   2546     loss = loss / self.args.gradient_accumulation_steps
   2548 if self.do_grad_scaling:
-> 2549     self.scaler.scale(loss).backward()
   2550 elif self.use_apex:
   2551     with amp.scale_loss(loss, self.optimizer) as scaled_loss:

File ~/.conda/envs/adapters/lib/python3.10/site-packages/torch/_tensor.py:488, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    478 if has_torch_function_unary(self):
    479     return handle_torch_function(
    480         Tensor.backward,
    481         (self,),
   (...)
    486         inputs=inputs,
    487     )
--> 488 torch.autograd.backward(
    489     self, gradient, retain_graph, create_graph, inputs=inputs
    490 )

File ~/.conda/envs/adapters/lib/python3.10/site-packages/torch/autograd/__init__.py:197, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    192     retain_graph = create_graph
    194 # The reason we repeat same the comment below is that
    195 # some Python versions print out the first line of a multi-line function
    196 # calls in the traceback and some print out the last line
--> 197 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    198     tensors, grad_tensors_, retain_graph, create_graph, inputs,
    199     allow_unreachable=True, accumulate_grad=True)

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

This issue is also present in transformers, where a change of the optimizer is discussed as a solution. Changing the optimizer in the Seq2SeqAdapterTrainer did not work:

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="no",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    fp16=True, 
    output_dir="./bert2bert/",
    max_steps=3000,
    logging_steps=50,
    save_strategy="no",
    # eval_steps=50,
    learning_rate=3e-4,
    warmup_ratio=0.1,
    #optim="adamw_torch",
    remove_unused_columns=True
)

optimizer=torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler=get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=100,
    num_training_steps=3000)
optimizers = optimizer, scheduler

trainer = Seq2SeqAdapterTrainer(
    tokenizer=tokenizer,
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    optimizers=optimizers
)
train_results = trainer.train()

Training with the same settings without adapter and Seq2SeqTrainer does work.

It seems like this is caused by the trainer freezing the model weights and not adding adapter modules other than invertibles to the EncoderDecoderMixin.

The trainer will not get any parameters with requires_grad=True for the optimizer over here.

I tried to add these some layers to the Mixin like done in the BARTMixin without success:

def _init_adapter_modules(self):
    if not isinstance(self.encoder, ModelAdaptersMixin) or not isinstance(self.decoder, ModelAdaptersMixin):
        return

    # Relay all invertible adapter calls to encoder
    self.invertible_adapters = self.encoder.base_model.invertible_adapters
    self.add_invertible_adapter = self.encoder.base_model.add_invertible_adapter
    self.get_invertible_adapter = self.encoder.base_model.get_invertible_adapter
    self.enable_invertible_adapters = self.encoder.base_model.enable_invertible_adapters
    self.invertible_adapters_forward = self.encoder.base_model.invertible_adapters_forward
    # Decoder should use invertible adapters of encoder
    self.decoder.base_model.invertible_adapters = self.encoder.base_model.invertible_adapters
    self.decoder.base_model.add_invertible_adapter = lambda *args: None
    self.decoder.base_model.get_invertible_adapter = self.encoder.base_model.get_invertible_adapter

    # Added these lines from BARTMixin
    # Also tried with self.encoder.base_model...
    self.encoder.attention_adapters = AdapterLayer("mh_adapter", self.config)
    self.encoder.output_adapters = AdapterLayer("output_adapter", self.config)
    self.encoder.attention_adapters._init_adapter_modules()
    self.encoder.output_adapters._init_adapter_modules()

    self.decoder.cross_attention_adapters = AdapterLayer("cross_adapter", self.config)
    self.decoder.cross_attention_adapters._init_adapter_modules()

Any ideas @calpt @hSterz

Thanks for reporting these issues and sorry for not getting back to you earlier. Unfortunately, our current encoder-decoder implementation is very hacky and has all sorts of issues currently. We'll try to look into this.