LiyuanLucasLiu / Transformer-Clinic

Understanding the Difficulty of Training Transformers

Home Page:https://arxiv.org/abs/2004.08249

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tmp_weight is not defined

sshleifer opened this issue · comments

Hi,

In this line, the variable tmp_weight is not defined. How should it be set?

Another Q, what torch version did you use?
when I set tmp_weight=1.0 and
run

GPUID=1
TOKEN_NUMBER=4096
UPDATE_FREQUENCE=1
CUDA_VISIBLE_DEVICES=1 fairseq-train \
  $dbin/iwslt14.tokenized.de-en.joined_dict -s de -t en \
  --arch transformer_iwslt_de_en --share-all-embeddings \
  --user-dir radam_fairseq --optimizer radam \
  --clip-norm 0.0 --lr 7e-4 --lr-scheduler inverse_sqrt \
  --warmup-init-lr 1e-7 --warmup-updates 6000 --max-update 100000 \
  --dropout 0.3 --attention-dropout 0.1 --relu-dropout 0.1 \
  --weight-decay 0.0001 --criterion label_smoothed_cross_entropy \
  --label-smoothing 0.1 --save-dir iwslt14deen/iwslt-preln-1111 \
  --init-type adaptive-profiling --max-tokens 4096 \
  --update-freq 1 --seed 1111 \
  --log-format simple --restore-file x.pt \
  --threshold-loss-scale 0.03125 \
  --log-interval 100

I get

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace
 operation: [torch.cuda.FloatTensor [32, 104, 1536]], which is output 0 of AddBackward0, is at version 
2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to 
compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Any advice?

I'm so sorry for this bug...

The tmp_weight should be removed during a refactorization (#7) and I just fixed this issue in the current master branch.
As to torch version, I am using 1.5.0.

Awesome, thanks! I have IWSLT running it torch 1.6.0.
I was also wondering which files were changed from the initial fairseq besides transformer_layer.py.

If you know which commit/day you copied fairseq that would also be helpful! April 20th, 2020 seems slightly off, but not quite sure.

Glad it works!

The performance gain is not significant on IWSLT (due to the small dataset and shallow model).

this commit is the first commit including the fairseq folder, but this fairseq folder is the original implementation of Admin (extracted from my private repo), instead of a direct cloned of the fairseq repo.

As to changes, transformer_layer.py is the only file changed for the method; a few more files are changed to accommodate these changes. I did some checking and list most changed files as follows (may omit something and need some debugging):

  • generate.py
  • fairseq/options.py
  • fairseq/trainer.py
  • fairseq/models/transformer.py
  • fairseq/modules/transformer_layer.py
  • fairseq/tasks/fairseq_task.py

Hope it helps and Happy Thanksgiving : -)