prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

at train_nmt.py I get RuntimeError: The expanded size of the tensor (30) must match the existing size (25) at non-singleton dimension 1.

GorkaUrbizu opened this issue · comments

Hi!

I trained a monolingual BART using the following command:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 1800000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 1e-4 \
--hard_truncate_length 1024 \
--shard_files

and now I would like to finetune it on a seq2seq task (paraphrasing) with a small dataset, to see if the model learns something in the pretraining:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--model_path models/bart__base_ft \
--pretrained_model models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang src \
--train_tlang trg \
--dev_slang src \
--dev_tlang trg \
--train_src data/train.src \
--train_tgt data/train.trg \
--dev_src data/test.src \
--dev_tgt data/test.trg \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--encoder_layers 6 \
--decoder_layers 6 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

and I get the following error:

...
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['src-trg']
Corpora stats: {'src-trg': 568}
Shuffling corpus: src-trg
Running eval on dev set(s)
BLEU score using sacrebleu after 450000 iterations is 33.4095177159796 for language pair src-trg
New peak reached for src-trg . Saving.
Global BLEU score using sacrebleu after 450000 iterations is: 33.4095177159796
New peak reached. Saving.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
  File "train_nmt.py", line 884, in <module>
    run_demo()
  File "train_nmt.py", line 881, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,train_files, dev_files, quit_condition))         #
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/gurbizu/BART/yanmtt/train_nmt.py", line 513, in model_create_load_run_save
    lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
  File "/home/gurbizu/BART/yanmtt/common_utils.py", line 82, in label_smoothed_nll_loss
    smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (22) at non-singleton dimension 1.  Target sizes: [32, 27, 1].  Tensor sizes: [32, 22, 1]

Any idea what could cause this?

Hi,

There are two issues you need to take care of:

  1. Use the flag: --no_reload_optimizer_ctr_and_scheduler. This ensures that the optimizer is reset.
  2. The train time error comes from the fact that "src" and "trg" are not in your vocabulary. Firstly, when you trained your mbart-bpe50k tokenizer, what were the file name extensions? In the example I gave my train files were "examples/data/train.vi,examples/data/train.en,examples/data/train.hi". What this does is automatically adds vi and hi as language tokens to the tokenizer as <2vi> and <2hi>. In your case, if your data file was "data/train.xx" then the token will be <2xx>. During pre-training you specify "--langs xx --mono_src data/train.xx" which is correct. But during fine-tuning, you should be passing the following flags:
    --train_slang xx
    --train_tlang xx
    --dev_slang xx
    --dev_tlang xx
    --train_src data/train.src.xx
    --train_tgt data/train.trg.xx
    --dev_src data/test.src.xx
    --dev_tgt data/test.trg.xx \

Essentially, you rename the train.{src,trg} files to train.{src,trg}.xx. However, renaming is not needed as long as you specify the right slang and tlang.

Furthermore, passing the flag --is_summarization will be needed. The reason for this is that the slang and tlang are xx and my batching code auto-masks input_sentences if the source and target languages are the same.

Finally, the explanation for why you get the error during training:

The decoder input sentence is "<2trg> I am a boy" and the decoder label sentence is "I am a boy ". Note that <2trg> is not a token in your vocabulary you trained. Therefore, after tokenization the decoder input is: "< 2 tr g >_ I_ am_ a_ boy_" and the labels are "I_ am_ a_ boy_ ". Both should have the same number of tokens else there will be a mismatch during loss computation and in this case, clearly there is a mismatch due to incorrect language tokens being used.

Hope it helps.

Thanks! your explanations were really helpful! I knew I was doing something wrong, but couldn't guess what.