TextPair Classification with multilabel Problem
felixvor opened this issue · comments
Question
Not sure if this is a bug or I am doing something wrong here. I am trying to train a model with multilabel classification and two text inputs (i.e. textpair).
I prepared an example dataset with the following format:
text text_b label
Sentence A. Sentence B. 0,1,2
Another A. Another B. 1,2
. . .
I found that using TextPairClassificationProcessor
with multilabel=True
seems to work fine to prepare the data for training, which I checked with the debugger. But on training start I get the following error:
...
06/22/2021 00:14:41 - INFO - farm.modeling.language_model - Loaded bert-base-cased
06/22/2021 00:14:41 - INFO - farm.modeling.prediction_head - Prediction head initialized with size [768, 3]
06/22/2021 00:14:44 - INFO - farm.modeling.optimization - Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 2e-05}'
06/22/2021 00:14:45 - INFO - farm.modeling.optimization - Using scheduler 'get_linear_schedule_with_warmup'
06/22/2021 00:14:45 - INFO - farm.modeling.optimization - Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_steps': 67.2, 'num_training_steps': 672}'
06/22/2021 00:14:47 - INFO - farm.train -
***Growing***
Train epoch 0/1 (Cur. train loss: 0.0000): 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "text_pair_classification.py", line 131, in <module>
text_pair_classification()
File "text_pair_classification.py", line 98, in text_pair_classification
trainer.train()
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\farm\train.py", line 301, in train
per_sample_loss = self.model.logits_to_loss(logits=logits, global_step=self.global_step, **batch)
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\farm\modeling\adaptive_model.py", line 386, in logits_to_loss
all_losses = self.logits_to_loss_per_head(logits, **kwargs)
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\farm\modeling\adaptive_model.py", line 370, in logits_to_loss_per_head
all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs))
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\farm\modeling\prediction_head.py", line 360, in logits_to_loss
return self.loss_fct(logits, label_ids.view(-1))
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\torch\nn\modules\loss.py", line 1121, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "C:\Users\Admin\anaconda3\envs\farm\lib\site-packages\torch\nn\functional.py", line 2824, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
ValueError: Expected input batch_size (16) to match target batch_size (48).
In the last line '16' is my batch size and the '48' is exactly batch_size * num_prediction_head_outputs
(tested this with different batch sizes and label lists). I came to my limits when trying to debug your code for the training and loss calculation process and was wondering if you could help me find a solution. Is farm suitable to do multilabel with textpair? I would like to contribute and make this use case more accessible, but currently I do not know where to start looking for a fix. Maybe you have an idea?
Any help would be appreciated :)
Hi @DieseKartoffel The data format looks good to me (and it is the same as in our multilabel classification example: https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_multilabel.py except for the additional text_b input). Are you providing label_list = ["0","1","2"]
to the TextClassificationProcessor
? I will try to reproduce the error message on my side.
Hey Julian thank you for looking into this. Yes I tried to work close to your examples for debugging :-)
I made sure to use the correct labels, and if a label from the dataset is not part of label_list
, farm will already output a useful error and not start the training.
I also tried it with different labels and prepared corresponding datasets. For example I tried batch size 10 with label_list=["a","b","c","d","e"]
and got ValueError: Expected input batch_size (10) to match target batch_size (50)
.
So far, I could not replicate the error. Could you maybe share some code and a small data example? What I did so far is the following:
- I took the https://github.com/deepset-ai/FARM/blob/master/examples/doc_classification_multilabel.py example
- Replaced TextClassificationProcessor with TextPairClassificationProcessor
- Changed the
basic_texts
variable to contain pairs of texts by copyingtext
value totext_b
value - Added a
text_b
column to thetrain.tsv
andval.tsv
dataset that contains the same text as columntext
basic_texts = [
{"text": ("You ... ...", "You ... ...")},
{"text": ("What a lovely world", "What a lovely world")},
]
The output that I get is the following:
[{'task': 'text_classification', 'predictions': [{'start': None, 'end': None, 'context': "('You ... ...', 'You ... ...')", 'label': "['toxic', 'obscene', 'insult']", 'probability': array([0.93692017, 0.19396962, 0.8908834 , 0.10999262, 0.8351795 ,
0.2840815 ], dtype=float32)}, {'start': None, 'end': None, 'context': "('What a lovely world', 'What a lovely world')", 'label': '[]', 'probability': array([0.371408 , 0.00837683, 0.1528986 , 0.00711144, 0.16077891,
0.01845325], dtype=float32)}]}]
I was able to get it working by reproducing your approach step by step. I then compared the code to my project and found that I was using the wrong prediction head... Very easy solution which I should have spotted from the start... Thank you very much for your help!