[Question] parameters for performance reproduction in paper

Question

[Question] parameters for performance reproduction in paper

kuri-leo opened this issue 2 years ago · comments

Hello Yizhong and everyone!

Thanks for your great work and contribution. While I'm attempting to replicate the performance in Fig. 5b in this paper with its setting, I find there is a gap and I was wondering if you could share some experience on this.

I attempted four times to run scripts/train_tk_instruct.sh with only changing --max_num_instances_per_task or --seed：

when set --max_num_instances_per_task 8, it reports train/predict_rougeL 45.866, from Fig. 5b, it's 48.5
when set --max_num_instances_per_task 8 and --seed 1337, it reports train/predict_rougeL 46.762
when set --max_num_instances_per_task 64, it reports train/predict_rougeL 49.6898, from Fig. 5b, it's 54.7
when set --max_num_instances_per_task 100(default), it reports train/predict_rougeL 49.3467

I simply copied data/splits/default/test_tasks.txt into data/splits/default/dev_tasks.txt while maintaining the default settings for everything else. I'm not sure if the parameters in scripts/train_tk_instruct.sh are the default settings in paper, and I'm hoping that you can kindly offer some suggestion.

Thanks in advance!

Cheers,
Leo

Yizhong Wang · Answer 1 · Wed Jul 13 2022 13:55:16 GMT+0800 (China Standard Time)

Hi Leo, thanks for reporting your results. May I know how many GPUs did you use? This is important because the real batch size is per_device_train_batch_size x num_gpus x accumulation steps. In my experiment, I used 8 A100 GPUs, which results in a batch size of 16.

kurileo · Answer 2 · Wed Jul 13 2022 14:09:01 GMT+0800 (China Standard Time)

Hi Leo, thanks for reporting your results. May I know how many GPUs did you use? This is important because the real batch size is per_device_train_batch_size x num_gpus x accumulation steps. In my experiment, I used 8 A100 GPUs, which results in a batch size of 16.

Hey Yizhong,

Thanks for your rapid reply :-)

In my test, I only used one A100 for debugging. So I will try 8 GPU and share the result later.

Thank you again and have a nice day!

Cheers,
Leo

kurileo · Answer 3 · Thu Jul 14 2022 10:21:27 GMT+0800 (China Standard Time)

Hi Yizhong,

Thanks for your hints, and I have successfully performed 2 tests and received even better results of 48.5 (sample 8 times) and 55.1235 (sample 64 times) than 48.5 and 54.7 respectively reported from paper.

That's amazing!

Leo

Liu Yufang · Answer 4 · Sun Jul 30 2023 10:02:05 GMT+0800 (China Standard Time)

Hi everyone，

It's strange that I can only get 48 on 8 3090 GPUs without changing any parameters, does anyone know the possible reason?

kurileo · Answer 5 · Wed Aug 02 2023 11:50:32 GMT+0800 (China Standard Time)

@Yufang-Liu

Hi Yufang,

Given the multitude of factors that could lead to variations in scores, it was on the batch_size my previous test. Further, I suggest examining if any optimization measures such as half-precision or ZERO have been auto-implemented via the Deepspeed or accelerate package.

Hope this may help.

Leo

Liu Yufang · Answer 6 · Wed Aug 02 2023 12:34:14 GMT+0800 (China Standard Time)

@Yufang-Liu

Hi Yufang,

Given the multitude of factors that could lead to variations in scores, it was on the batch_size my previous test. Further, I suggest examining if any optimization measures such as half-precision or ZERO have been auto-implemented via the Deepspeed or accelerate package.

Hope this may help.

Leo

Hi Leo, thanks a lot for your helpful suggestions !!

I found the reason is the version of the installed packages. I got the same results on 8 3090 GPUs with the same version of packages. Still not sure which package affects the performance.