Training started failing yesterday.

Question

Training started failing yesterday.

thelinuxkid opened this issue 6 months ago · comments

Describe the bug
I started getting this error yesterday after I reinstall ludwig. This was not happening before with the same config and code

My hunch is that this some external library updated and it broke something

Traceback (most recent call last):
  File "train_enric_actions.py", line 109, in <module>
    train(config_path, base_model, dataset, model_name, output_directory)
  File "train_enric_actions.py", line 95, in train
    model.train(
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/api.py", line 654, in train
    train_stats = trainer.train(
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 967, in train
    should_break = self._train_loop(
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 1137, in _train_loop
    loss, all_losses = self.train_step(inputs, targets, should_step=should_step, profiler=profiler)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 314, in train_step
    self.distributed.backward(loss, self.dist_model)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/ludwig/distributed/base.py", line 56, in backward
    loss.backward()
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 271, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/ludwig-JgQxVRRw/lib/python3.8/site-packages/transformers/models/mistral/modeling_mistral.py", line 293, in forward
    raise ValueError(
ValueError: Attention mask should be of size (1, 1, 519, 1038), but is torch.Size([1, 1, 519, 519])
Training:   0%|

To Reproduce
Steps to reproduce the behavior:
config:


  "model_type": "llm",
  "base_model": "{{base_model}}",
  "generation": {"temperature": 0.1},
  "quantization": {"bits": 4},
  "adapter": {"type": "lora"},
  "prompt": {
    "template": "blah blah"
  },

  "input_features": [
    {
      "name": "context_and_intent",
      "type": "text"
    }
  ],

  "output_features": [
    {
      "name": "action",
      "type": "text",
      "preprocessing":  {
        "fallback_label": "unsure"
      },
      "decoder": {
        "type": "text_extractor",
        "match": {
          "unsure": {
            "type": "contains",
            "value": "unsure"
          },
          "cat1": {
            "type": "contains",
            "value": "cat1"
          },
          "cat2": {
            "type": "contains",
            "value": "cat2"
          }
      }
    }
  ],

  "preprocessing": {
    "split": {
      "type": "random",
      "probabilities": [
        0.95,
        0,
        0.05
      ]
    }
  },

  "trainer": {
    "type": "finetune",
    "epochs": 13,
    "early_stop": -1,
    "optimizer": {
      "type": "paged_adam"
    },
    "weight_decay": 0.1,
    "batch_size": 1,
    "learning_rate": 0.0002,
    "eval_batch_size": 2,
    "learning_rate_scheduler": {
      "decay": "cosine",
      "warmup_fraction": 0.03
    },
    "gradient_accumulation_steps": 16,
    "enable_gradient_checkpointing": true
  }
}

  model = LudwigModel(config=config, logging_level=logging.INFO)
  model.train(
    dataset=df,
    experiment_name=version,
    model_name=model_name,
    output_directory=output_directory,
    skip_save_processed_input=True,
  )

Model

Weyaxi/OpenHermes-2.5-neural-chat-7b-v3-1-7B

Expected behavior
Should train normally as before

Screenshots

Environment (please complete the following information):

OS: No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
Cuda Version: 12.1 (also tried on 11.8 and got the same error)
Python version: 3.8.10
Ludwig version '0.8.6'

Additional context
Add any other context about the problem here.

Arnav Garg · Answer 1 · Wed Dec 13 2023 00:56:10 GMT+0800 (China Standard Time)

@thelinuxkid This should be fixed in the latest master - the issue stems from the latest transformers release that went out yesterday (transformers 4.36) which seems to have some weird compatibility issues at the moment with Ludwig. We're investigating this internally to make sure Ludwig is compatible with transformers 4.36, but for now, are you able to either re-install dependencies from Ludwig master or just downgrade transformers to 4.35.2?

Andrés Restrepo · Answer 2 · Wed Dec 13 2023 00:58:31 GMT+0800 (China Standard Time)

Thanks! That's what I suspected. Will, do!

Andrés Restrepo · Answer 3 · Wed Dec 13 2023 01:00:57 GMT+0800 (China Standard Time)

Working now. Thank you!!

Arnav Garg · Answer 4 · Wed Dec 13 2023 01:04:59 GMT+0800 (China Standard Time)

Awesome! Will come back with an update when things are compatible :)