Inconsistnet with your reported scores of teacher models

Question

Inconsistnet with your reported scores of teacher models

liuxy1103 opened this issue 7 months ago · comments

I'm using the sft of your code to reproduce the results of Openllama2-7B and gpt2-xl, then dolly-eval on rouge-L is 27.51 and 27.04 respectively. My dataset was obtained from minillm, the hyperparameters used are the defaults of your profile, and I also used 4 A100 (40G)

Jongwoo Ko · Answer 1 · Thu Mar 14 2024 13:06:40 GMT+0800 (China Standard Time)

Hi @liuxy1103, it seems that I have to check the performance of the my teacher models and review (re-run) for the code on my own. As I cannot use A100 GPUs for right now, it would take some time.

Meanwhile, did you run distillm based on your current teacher? If is it available, can you run the distillm code and report me your results?

liuxy1103 · Answer 2 · Thu Mar 14 2024 13:08:50 GMT+0800 (China Standard Time)

I'm not sure if there's any randomness to this code. I've also run it on v100 with different results. Also, I'd like to know what brought about the difference between your results and minillm's.

Jongwoo Ko · Answer 3 · Thu Mar 14 2024 13:20:14 GMT+0800 (China Standard Time)

For GPT-2 (base) experiments, I got (23.6733, 24.1762, 23.8693, 23.4965, 23.9844) for MiniLLM, as my code is based on MiniLLM at Aug. 2023, it might be different version compared to you are working on. I did not change any scripts or code for MiniLLM part and SFT parts, as they do not require any modification. It seems to check the differences between the codes that will take some times.

liuxy1103 · Answer 4 · Thu Mar 14 2024 13:25:39 GMT+0800 (China Standard Time)

Can I assume that the sft code is from minillm, and that the results of your re-execution are different from what the minillm paper reports?

Jongwoo Ko · Answer 5 · Thu Mar 14 2024 13:33:35 GMT+0800 (China Standard Time)

Yes, as I reported in the paper, all results are from the re-implementation (in your word, re-execution). By the way, could you share your code if you succeed your code for MiniLLM?

liuxy1103 · Answer 6 · Thu Mar 14 2024 14:12:38 GMT+0800 (China Standard Time)

I found that minillm's results could not be reproduced, even after reading in the checkpoint he gave. Have you tried the ckpt he provided? If aspect, I hope to get your sft ckpt for testing.

dolly	Self-Instruct	Vicuna	SN	UN
26.88	17.9443	17.694	32.748	31.3981
27.3815	16.8773	17.1679	33.1547	31.4187
28.2254	16.8735	18.1522	33.4686	31.5469
27.4134	17.2188	16.9904	33.2153	31.4363
27.65	17.8472	17.7174	32.8547	31.4921
27.51006	17.35222	17.54438	33.08826	31.45842

liuxy1103 · Answer 7 · Thu Mar 14 2024 14:13:12 GMT+0800 (China Standard Time)

dolly	Self-Instruct	Vicuna	SN	UN
26.7611	14.5515	16.4227	24.5059	29.5738
26.3861	15.0279	17.0802	24.8715	29.6167
26.8233	14.6385	16.9929	25.2251	29.6849
28.1064	14.4444	16.9931	24.7591	29.734
27.1202	15.0413	16.4705	24.5255	29.67
27.03942	14.74072	16.79188	24.77742	29.65588

liuxy1103 · Answer 8 · Thu Mar 14 2024 14:14:01 GMT+0800 (China Standard Time)

The two tables above are the results of my openllama2-7B and gpt2-xl running the sft code.

liuxy1103 · Answer 9 · Thu Mar 14 2024 14:41:48 GMT+0800 (China Standard Time)

One more question, is your experiment a re-test of the model selected in the validation set? The results of my validation set are as follows. I hope you can help me to reproduce these results, because it has a huge impact on the progress of my research. Thank you!

openllama-7B：

step	exat match	rougeL
12318	6.4	31.4662

gpt-xl:

step	exat match	rougeL
14288	4.8	29.6821

liuxy1103 · Answer 10 · Thu Mar 14 2024 15:43:22 GMT+0800 (China Standard Time)

"For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32}within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs."

For the various configuration files in the project whether you need to search again, I strictly enforce the default configuration inside.

Songming Zhang · Answer 11 · Thu Mar 14 2024 23:51:19 GMT+0800 (China Standard Time)

<style> </style>
dolly Self-Instruct Vicuna SN UN
26.7611 14.5515 16.4227 24.5059 29.5738
26.3861 15.0279 17.0802 24.8715 29.6167
26.8233 14.6385 16.9929 25.2251 29.6849
28.1064 14.4444 16.9931 24.7591 29.734
27.1202 15.0413 16.4705 24.5255 29.67
27.03942 14.74072 16.79188 24.77742 29.65588

Same problem on UN. I've tried to reproduce the results of gpt2-xl in distillm and minillm for a long time. For me, using the ckpt provided by minillm, I got the following results:

dolly	Self-Instruct	Vicuna	SN	UN
26.7886	15.2692	16.2763	27.966	31.7629
26.6182	14.986	15.941	26.9682	31.6883
26.7466	14.6881	16.4003	26.9641	-

Although slightly better than yours, it seems that there are always gaps between the reported results and mine on UN.

Jongwoo Ko · Answer 12 · Fri Mar 15 2024 00:37:04 GMT+0800 (China Standard Time)

I have assessed the current situation. It is difficult to distribute the parameters because it is challenging to access the server where I was running the experiment, and it is expected to take time to identify which parts of the code or scripts are causing the results not to be implemented, especially for teacher.

I will close the issue for now. Instead I will upload the student model parameters for lightweight OpenLLaMA-3B LoRA which trained on distillm. I plan to normalize the code to be reproducible for the teacher model as soon as possible, so I would appreciate your patience.