Training started failing yesterday.
thelinuxkid opened this issue · comments
Describe the bug
I started getting this error yesterday after I reinstall ludwig. This was not happening before with the same config and code
My hunch is that this some external library updated and it broke something
Traceback (most recent call last):
File "train_enric_actions.py", line 109, in <module>
train(config_path, base_model, dataset, model_name, output_directory)
File "train_enric_actions.py", line 95, in train
model.train(
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/api.py", line 654, in train
train_stats = trainer.train(
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 967, in train
should_break = self._train_loop(
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 1137, in _train_loop
loss, all_losses = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 314, in train_step
self.distributed.backward(loss, self.dist_model)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/distributed/base.py", line 56, in backward
loss.backward()
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 271, in backward
outputs = ctx.run_function(*detached_inputs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 293, in forward
raise ValueError(
ValueError: Attention mask should be of size (1, 1, 519, 1038), but is torch.Size([1, 1, 519, 519])
Training: 0%|
To Reproduce
Steps to reproduce the behavior:
config:
"model_type": "llm",
"base_model": "{{base_model}}",
"generation": {"temperature": 0.1},
"quantization": {"bits": 4},
"adapter": {"type": "lora"},
"prompt": {
"template": "blah blah"
},
"input_features": [
{
"name": "context_and_intent",
"type": "text"
}
],
"output_features": [
{
"name": "action",
"type": "text",
"preprocessing": {
"fallback_label": "unsure"
},
"decoder": {
"type": "text_extractor",
"match": {
"unsure": {
"type": "contains",
"value": "unsure"
},
"cat1": {
"type": "contains",
"value": "cat1"
},
"cat2": {
"type": "contains",
"value": "cat2"
}
}
}
],
"preprocessing": {
"split": {
"type": "random",
"probabilities": [
0.95,
0,
0.05
]
}
},
"trainer": {
"type": "finetune",
"epochs": 13,
"early_stop": -1,
"optimizer": {
"type": "paged_adam"
},
"weight_decay": 0.1,
"batch_size": 1,
"learning_rate": 0.0002,
"eval_batch_size": 2,
"learning_rate_scheduler": {
"decay": "cosine",
"warmup_fraction": 0.03
},
"gradient_accumulation_steps": 16,
"enable_gradient_checkpointing": true
}
}
model = LudwigModel(config=config, logging_level=logging.INFO)
model.train(
dataset=df,
experiment_name=version,
model_name=model_name,
output_directory=output_directory,
skip_save_processed_input=True,
)
Model
Weyaxi/OpenHermes-2.5-neural-chat-7b-v3-1-7B
Expected behavior
Should train normally as before
Environment (please complete the following information):
- OS: No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal - Cuda Version: 12.1 (also tried on 11.8 and got the same error)
- Python version: 3.8.10
- Ludwig version '0.8.6'
Additional context
Add any other context about the problem here.
@thelinuxkid This should be fixed in the latest master - the issue stems from the latest transformers release that went out yesterday (transformers 4.36) which seems to have some weird compatibility issues at the moment with Ludwig. We're investigating this internally to make sure Ludwig is compatible with transformers 4.36, but for now, are you able to either re-install dependencies from Ludwig master or just downgrade transformers to 4.35.2?
Thanks! That's what I suspected. Will, do!
Awesome! Will come back with an update when things are compatible :)