Chia-Hsuan-Lee / DST-as-Prompting

Source code for Dialogue State Tracking with a Language Model using Schema-Driven Prompting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On reproducing the experiment results in paper

nxpeng9235 opened this issue · comments

Hi,

Congrats on being accepted in EMNLP 2021 as a concise and solid work! I am currently following your research and trying to reproduce the experimental results in the original paper using your codes. However, I have met some trouble in aligning the same JGA scores.

My experiments were all on MultiWOZ v2.2, with domain and slot descriptions. Here are my hyperparameter settings and corresponding results.

  • T5-small, lr=5e-5, n_epoch=3, batchsize=8, JGA=55.3
  • T5-base, lr=5e-5, n_epoch=3, batchsize=8, JGA=56.0
  • T5-base, lr=5e-4, n_epoch=2, batchsize=16(bs=8 with grad_accumulation=2), JGA=56.1
  • T5-base, lr=5e-4, n_epoch=2, batchsize=64(bs=8 with grad_accumulation=8), JGA=56.2 [Same as paper]
    The experiments were implemented on a single A100 40GB, with Python==3.9.12, PyTorch==1.12.1, CUDA==11.6, and the other hyperparameters remained default. There is still a gap between my results and the JGA score on paper, which is 57.6.

I am wondering if there is some other tricks to achieve a better results. If so, is it okay to share? So much appreciated!
Looking forward to your reply :-D

Best

Hi, thanks for your interest! My best guess will be this is an optimization difference between training with "multiple machines" and "accumulating gradients within a single machine".
For the T5-base, we used multi-GPUs and I honestly can't remember the exact configs we used.