Fine tuning on a personal annotated dataset (in conll-2012 propbank style)

Question

Fine tuning on a personal annotated dataset (in conll-2012 propbank style)

felgaet opened this issue 2 years ago · comments

felgaet commented 2 years ago

Dear @Riccorl,
Thanks for sharing this code. I would like to ask you:

If I have my own annotated dataset (in conll-2012 style), can I easily fine-tune the model on it? If yes, could you please show me how?

Riccardo Orlando · Answer 1 · Tue Feb 08 2022 17:45:40 GMT+0800 (China Standard Time)

Sorry for answering this late. If you didn't solve your problem, here's how you can use your data

In the training config file, e.g. this one, you can change train_data_path and validation_data_path with the path to the folder containing the data you want. Then you can train the model

allennlp train path/to/your/config -s path/to/model --include-package transformer_srl

felgaet · Answer 2 · Thu Feb 10 2022 21:26:20 GMT+0800 (China Standard Time)

Dear @Riccorl,
Thanks for the reply. Now when I run the command you indicated I get the error:

"allennlp.common.checks.ConfigurationError: No instances were read from the given filepath home/username/project_name/data/train/data/english/annotations". Is the path correct?"

but my conll2012 files on which I would like to train the model are on that path.

Could you help me?

Riccardo Orlando · Answer 3 · Thu Feb 10 2022 21:33:08 GMT+0800 (China Standard Time)

What does that path contains? The AllenNLP reader searches for gold_conll files.

felgaet · Answer 4 · Thu Feb 10 2022 23:26:24 GMT+0800 (China Standard Time)

Thanks @Riccorl for you answer. The problem was, as you said, the file extension: it requires ".gold_conll" files, while mine were ".conll". Changing extension seems to work correctly.

What I would like to do, however, is fine-tune a model previously trained for SRL, for example "tli8hf / robertabase-crf-conll2012".

I edited the configuration file as follows:

{
    "dataset_reader": {
      "type": "transformer_srl_span",
      "model_name": "tli8hf/robertabase-crf-conll2012",
    },

    "data_loader": {
      "batch_sampler": {
        "type": "bucket",
        "batch_size" : 32
      }
    },

    "train_data_path": std.extVar("SRL_TRAIN_DATA_PATH"),
    "validation_data_path": std.extVar("SRL_VALIDATION_DATA_PATH"),

    "model": {
        "type": "transformer_srl_span",
        "embedding_dropout": 0.1,
        "bert_model": "tli8hf/robertabase-crf-conll2012",
    },

    "trainer": {
        "optimizer": {
            "type": "huggingface_adamw",
            "lr": 5e-5,
            "correct_bias": false,
            "weight_decay": 0.01,
            "parameter_groups": [
              [["bias", "LayerNorm.bias", "LayerNorm.weight", "layer_norm.weight"], {"weight_decay": 0.0}],
            ],
        },

        "learning_rate_scheduler": {
            "type": "slanted_triangular",
        },
        "checkpointer": {
            "num_serialized_models_to_keep": 2,
        },
        "grad_norm": 1.0,
        "num_epochs": 15,
        "validation_metric": "+f1_role",
        "cuda_device": -1,
    },
}

And I get the following error:

2022-02-10 16:13:58,685 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/transformer-srl/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 118, in main
    args.func(args)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 119, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 242, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 466, in _train_worker
    metrics = train_loop.run()
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 528, in run
    return self.trainer.train()
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 966, in train
    return self._try_train()
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 1001, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 716, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 604, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformer_srl/models.py", line 146, in forward
    input_ids=input_ids, token_type_ids=verb_indicator, attention_mask=mask,
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 684, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 119, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

The same happens for example if I try to train roberta-base.

Do you have any ideas how to fix this error?
Thanks for your great help.

Riccardo Orlando · Answer 5 · Thu Feb 10 2022 23:32:14 GMT+0800 (China Standard Time)

It may be possible it's because roberta doesn't support token type ids. However the model uses them as verb indicator. To make it work with roberta base models, you should change this logic, remove the token type ids from the input and find another way to indicate verbs to the model.

felgaet · Answer 6 · Fri Feb 11 2022 01:27:40 GMT+0800 (China Standard Time)

I understand. Thanks, @Riccorl.
And instead, how can I do fine-tuning starting not from "bert-base-uncased" but from the model for SRL that you have made available in this repository (srl_bert_base_conll2012.tar.gz)?

Giving its path in jsonnet config file I obtain:

"OSError: Can't load config for './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz'. Make sure that:
"- './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz' is a correct model identifier listed on 'https://huggingface.co/models'
- or './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz is the correct path to a directory containing a config.json file

Thanks for your help and patience.

Riccardo Orlando · Answer 7 · Fri Feb 11 2022 17:36:11 GMT+0800 (China Standard Time)

In that case, you have to unpack it. Inside you will find some weights. You can load them in your model, but you will have to add some code for it (like torch. load_state_dict) somewhere, since the file contains the weights for the model as a whole (transformer + classifiers) and not just the HugginFace stuff.

felgaet · Answer 8 · Mon Feb 14 2022 19:48:04 GMT+0800 (China Standard Time)

Thanks a lot!