Fine tuning on a personal annotated dataset (in conll-2012 propbank style)
felgaet opened this issue · comments
Dear @Riccorl,
Thanks for sharing this code. I would like to ask you:
- If I have my own annotated dataset (in conll-2012 style), can I easily fine-tune the model on it? If yes, could you please show me how?
Sorry for answering this late. If you didn't solve your problem, here's how you can use your data
In the training config file, e.g. this one, you can change train_data_path
and validation_data_path
with the path to the folder containing the data you want. Then you can train the model
allennlp train path/to/your/config -s path/to/model --include-package transformer_srl
Dear @Riccorl,
Thanks for the reply. Now when I run the command you indicated I get the error:
"allennlp.common.checks.ConfigurationError: No instances were read from the given filepath home/username/project_name/data/train/data/english/annotations". Is the path correct?"
but my conll2012 files on which I would like to train the model are on that path.
Could you help me?
What does that path contains? The AllenNLP reader searches for gold_conll
files.
Thanks @Riccorl for you answer. The problem was, as you said, the file extension: it requires ".gold_conll" files, while mine were ".conll". Changing extension seems to work correctly.
What I would like to do, however, is fine-tune a model previously trained for SRL, for example "tli8hf / robertabase-crf-conll2012".
I edited the configuration file as follows:
{
"dataset_reader": {
"type": "transformer_srl_span",
"model_name": "tli8hf/robertabase-crf-conll2012",
},
"data_loader": {
"batch_sampler": {
"type": "bucket",
"batch_size" : 32
}
},
"train_data_path": std.extVar("SRL_TRAIN_DATA_PATH"),
"validation_data_path": std.extVar("SRL_VALIDATION_DATA_PATH"),
"model": {
"type": "transformer_srl_span",
"embedding_dropout": 0.1,
"bert_model": "tli8hf/robertabase-crf-conll2012",
},
"trainer": {
"optimizer": {
"type": "huggingface_adamw",
"lr": 5e-5,
"correct_bias": false,
"weight_decay": 0.01,
"parameter_groups": [
[["bias", "LayerNorm.bias", "LayerNorm.weight", "layer_norm.weight"], {"weight_decay": 0.0}],
],
},
"learning_rate_scheduler": {
"type": "slanted_triangular",
},
"checkpointer": {
"num_serialized_models_to_keep": 2,
},
"grad_norm": 1.0,
"num_epochs": 15,
"validation_metric": "+f1_role",
"cuda_device": -1,
},
}
And I get the following error:
2022-02-10 16:13:58,685 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
File "/home/username/miniconda3/envs/transformer-srl/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/__main__.py", line 34, in run
main(prog="allennlp")
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 118, in main
args.func(args)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 119, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 242, in train_model
file_friendly_logging=file_friendly_logging,
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 466, in _train_worker
metrics = train_loop.run()
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/commands/train.py", line 528, in run
return self.trainer.train()
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 966, in train
return self._try_train()
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 1001, in _try_train
train_metrics = self._train_epoch(epoch)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 716, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/allennlp/training/trainer.py", line 604, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformer_srl/models.py", line 146, in forward
input_ids=input_ids, token_type_ids=verb_indicator, attention_mask=mask,
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 684, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 119, in forward
token_type_embeddings = self.token_type_embeddings(token_type_ids)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/username/miniconda3/envs/transformer-srl/lib/python3.6/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
The same happens for example if I try to train roberta-base.
Do you have any ideas how to fix this error?
Thanks for your great help.
It may be possible it's because roberta doesn't support token type ids. However the model uses them as verb indicator. To make it work with roberta base models, you should change this logic, remove the token type ids from the input and find another way to indicate verbs to the model.
I understand. Thanks, @Riccorl.
And instead, how can I do fine-tuning starting not from "bert-base-uncased" but from the model for SRL that you have made available in this repository (srl_bert_base_conll2012.tar.gz)?
Giving its path in jsonnet config file I obtain:
"OSError: Can't load config for './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz'. Make sure that:
"- './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz' is a correct model identifier listed on 'https://huggingface.co/models'
- or './home/username/frame_disambiguation/srl_bert_base_conll2012.tar.gz is the correct path to a directory containing a config.json file
Thanks for your help and patience.
In that case, you have to unpack it. Inside you will find some weights. You can load them in your model, but you will have to add some code for it (like torch. load_state_dict
) somewhere, since the file contains the weights for the model as a whole (transformer + classifiers) and not just the HugginFace stuff.
Thanks a lot!