at train_nmt.py I get RuntimeError: The expanded size of the tensor (30) must match the existing size (25) at non-singleton dimension 1.
GorkaUrbizu opened this issue · comments
Hi!
I trained a monolingual BART using the following command:
python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 1800000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 1e-4 \
--hard_truncate_length 1024 \
--shard_files
and now I would like to finetune it on a seq2seq task (paraphrasing) with a small dataset, to see if the model learns something in the pretraining:
python3 train_nmt.py -n 1 -nr 0 -g 1 \
--model_path models/bart__base_ft \
--pretrained_model models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang src \
--train_tlang trg \
--dev_slang src \
--dev_tlang trg \
--train_src data/train.src \
--train_tgt data/train.trg \
--dev_src data/test.src \
--dev_tgt data/test.trg \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--encoder_layers 6 \
--decoder_layers 6 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files
and I get the following error:
...
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['src-trg']
Corpora stats: {'src-trg': 568}
Shuffling corpus: src-trg
Running eval on dev set(s)
BLEU score using sacrebleu after 450000 iterations is 33.4095177159796 for language pair src-trg
New peak reached for src-trg . Saving.
Global BLEU score using sacrebleu after 450000 iterations is: 33.4095177159796
New peak reached. Saving.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "train_nmt.py", line 884, in <module>
run_demo()
File "train_nmt.py", line 881, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,train_files, dev_files, quit_condition)) #
File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/gurbizu/BART/yanmtt/train_nmt.py", line 513, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/gurbizu/BART/yanmtt/common_utils.py", line 82, in label_smoothed_nll_loss
smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (22) at non-singleton dimension 1. Target sizes: [32, 27, 1]. Tensor sizes: [32, 22, 1]
Any idea what could cause this?
Hi,
There are two issues you need to take care of:
- Use the flag: --no_reload_optimizer_ctr_and_scheduler. This ensures that the optimizer is reset.
- The train time error comes from the fact that "src" and "trg" are not in your vocabulary. Firstly, when you trained your mbart-bpe50k tokenizer, what were the file name extensions? In the example I gave my train files were "examples/data/train.vi,examples/data/train.en,examples/data/train.hi". What this does is automatically adds vi and hi as language tokens to the tokenizer as <2vi> and <2hi>. In your case, if your data file was "data/train.xx" then the token will be <2xx>. During pre-training you specify "--langs xx --mono_src data/train.xx" which is correct. But during fine-tuning, you should be passing the following flags:
--train_slang xx
--train_tlang xx
--dev_slang xx
--dev_tlang xx
--train_src data/train.src.xx
--train_tgt data/train.trg.xx
--dev_src data/test.src.xx
--dev_tgt data/test.trg.xx \
Essentially, you rename the train.{src,trg} files to train.{src,trg}.xx. However, renaming is not needed as long as you specify the right slang and tlang.
Furthermore, passing the flag --is_summarization will be needed. The reason for this is that the slang and tlang are xx and my batching code auto-masks input_sentences if the source and target languages are the same.
Finally, the explanation for why you get the error during training:
The decoder input sentence is "<2trg> I am a boy" and the decoder label sentence is "I am a boy ". Note that <2trg> is not a token in your vocabulary you trained. Therefore, after tokenization the decoder input is: "< 2 tr g >_ I_ am_ a_ boy_" and the labels are "I_ am_ a_ boy_ ". Both should have the same number of tokens else there will be a mismatch during loss computation and in this case, clearly there is a mismatch due to incorrect language tokens being used.
Hope it helps.
Thanks! your explanations were really helpful! I knew I was doing something wrong, but couldn't guess what.