prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

finetuning BART for text classification tasks as seq2seq

GorkaUrbizu opened this issue · comments

Hi,

I trained a monolingual BART using your toolkit, and now I want to evaluate the model in NLU (natural language understanding), as we don't have any proper seq2seq dataset to evaluate it's generative capacities yet.

The idea is to evaluate the model on sequence labeling and text classification tasks, including sentence-pair classifications, but to get starteed, I would like to evaluate it on a single text classification task, in the form of Text-Label pairs, like topic classification or NLI.

I think your finetuning script train_nmt.py should be enough for that, as the labels could be predicted as target sequences. Otherwise, I thought of finetuning the BART model using hugginface tools, but don't know if any changed are needed for the model, vocab/tokenizer and config files, so I want to try your toolkits finetuning options first, which worked fine for a paraphrasing task using my BART model.

I would like to know if using --is_summarization makes sense for this type of tasks, and if you see any other limitation or any option I should use during finetuning.

I had something like this in mind:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--is_summarization \
--model_path models/bart_topic \
--pretrained_model models/bart_base_512 \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang xx \
--train_tlang xx \
--dev_slang xx \
--dev_tlang xx \
--train_src train.src.xx \
--train_tgt train.trg.xx \
--dev_src dev.src.xx \
--dev_tgt dev.trg.xx \
--max_src 512 \
--max_tgt 512 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--warmup_steps 100 \
--no_reload_optimizer_ctr_and_scheduler \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

Where source files will have a text per line, while target files will include the corresponding labels as text.

But something weird happens during the training, and I get this printed nonstop from the beginning:

Shuffling corpus: xx-xx
Finished epoch 999 for language: xx-xx
Shuffling corpus: xx-xx
Finished epoch 1000 for language: xx-xx

Where N epochs increase 100 per second, which doesn't make sense for a dataset of 1000s of examples. Maybe I'm not reading the files in the correct way, but the same approach of reading files worked fine for a paraphrasing task before.

Thanks for your time,
Gorka

PD: I'm using the old version of the code, which doesn't include the last updates/changes to the toolkit from this week.

Hi Gorka,

The changes I made to the toolkit are in the form of a branch called Dev which I plan to merge with main in a few days. So the main branch is exactly the same as before.

As for your situation:

  1. Bart model by itself is a sequence to sequence model and not a classification model for NLU.
  2. The right thing to do would be to take the model and train a classifier on top of it. This is something I wanted to do long ago but NLU is not my main work so I have postponed this task for a long time (contribution welcome). Fortunately this is not hard work as there is a class in the mbart implementation called MbartModelForSequenceClassification. You should be fine tuning using a model created from that class. I think it will be a fun experience to change my fine tuning script to handle classification.
  3. Suppose you don't want to code this up then simply using my fine tuning script should work but the label will be generated as a result of sequence generation and not sequence classification. In that sense your training data of sentence-label will act as a kind of parallel corpus. However this won't give you the best results
  4. Now for why you are getting epoch skips. It's most likely that due to sentence length constraints all examples are being skipped. Check the batch generation function and see where the examples may be getting skipped. (Actually it's line 873 in common utils. Your label is exactly 1 word and examples with sequence length 1 or less are discarded. I should have set to less than 1 strictly 🤦. I'll make that change in a bit but feel free to do it yourself too.

Thanks for your fast response!

  1. I know, we don't plan to use BART for NLU task, other than evaluating our BART model on them, and compare it to BERT like models.

  2. This seems a better approach for my use case. I have enough experience with hugginface package so I will continue from this path. I might change your script to handle classification, or I might start from the scripts we use to finetune BERT like models in the NLU tasks and change it to handle a mBART model.

  3. I will try the approach you suggested in the 2nd item.

  4. now it makes sense! I can change that myself.

Thanks again, for sharing your insight and tips using this toolkit. they were really helpful. I'm closing this issue now.

Regards,
Gorka