yizhongw / Tk-Instruct

Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.

Home Page:https://arxiv.org/abs/2204.07705

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] parameters for performance reproduction in paper

kuri-leo opened this issue · comments

Hello Yizhong and everyone!

Thanks for your great work and contribution. While I'm attempting to replicate the performance in Fig. 5b in this paper with its setting, I find there is a gap and I was wondering if you could share some experience on this.

I attempted four times to run scripts/train_tk_instruct.sh with only changing --max_num_instances_per_task or --seed

  • when set --max_num_instances_per_task 8, it reports train/predict_rougeL 45.866, from Fig. 5b, it's 48.5
  • when set --max_num_instances_per_task 8 and --seed 1337, it reports train/predict_rougeL 46.762
  • when set --max_num_instances_per_task 64, it reports train/predict_rougeL 49.6898, from Fig. 5b, it's 54.7
  • when set --max_num_instances_per_task 100(default), it reports train/predict_rougeL 49.3467

I simply copied data/splits/default/test_tasks.txt into data/splits/default/dev_tasks.txt while maintaining the default settings for everything else. I'm not sure if the parameters in scripts/train_tk_instruct.sh are the default settings in paper, and I'm hoping that you can kindly offer some suggestion.

Thanks in advance!

Cheers,
Leo

Hi Leo, thanks for reporting your results. May I know how many GPUs did you use? This is important because the real batch size is per_device_train_batch_size x num_gpus x accumulation steps. In my experiment, I used 8 A100 GPUs, which results in a batch size of 16.

Hi Leo, thanks for reporting your results. May I know how many GPUs did you use? This is important because the real batch size is per_device_train_batch_size x num_gpus x accumulation steps. In my experiment, I used 8 A100 GPUs, which results in a batch size of 16.

Hey Yizhong,

Thanks for your rapid reply :-)

In my test, I only used one A100 for debugging. So I will try 8 GPU and share the result later.

Thank you again and have a nice day!

Cheers,
Leo

Hi Yizhong,

Thanks for your hints, and I have successfully performed 2 tests and received even better results of 48.5 (sample 8 times) and 55.1235 (sample 64 times) than 48.5 and 54.7 respectively reported from paper.

That's amazing!

Leo

Hi everyone,

It's strange that I can only get 48 on 8 3090 GPUs without changing any parameters, does anyone know the possible reason?

@Yufang-Liu

Hi Yufang,

Given the multitude of factors that could lead to variations in scores, it was on the batch_size my previous test. Further, I suggest examining if any optimization measures such as half-precision or ZERO have been auto-implemented via the Deepspeed or accelerate package.

Hope this may help.

Leo

@Yufang-Liu

Hi Yufang,

Given the multitude of factors that could lead to variations in scores, it was on the batch_size my previous test. Further, I suggest examining if any optimization measures such as half-precision or ZERO have been auto-implemented via the Deepspeed or accelerate package.

Hope this may help.

Leo

Hi Leo, thanks a lot for your helpful suggestions !!

I found the reason is the version of the installed packages. I got the same results on 8 3090 GPUs with the same version of packages. Still not sure which package affects the performance.