jongwooko / distillm

Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)

Home Page:https://arxiv.org/abs/2402.03898

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistnet with your reported scores of teacher models

liuxy1103 opened this issue · comments

I'm using the sft of your code to reproduce the results of Openllama2-7B and gpt2-xl, then dolly-eval on rouge-L is 27.51 and 27.04 respectively. My dataset was obtained from minillm, the hyperparameters used are the defaults of your profile, and I also used 4 A100 (40G)

Hi @liuxy1103, it seems that I have to check the performance of the my teacher models and review (re-run) for the code on my own. As I cannot use A100 GPUs for right now, it would take some time.

Meanwhile, did you run distillm based on your current teacher? If is it available, can you run the distillm code and report me your results?

I'm not sure if there's any randomness to this code. I've also run it on v100 with different results. Also, I'd like to know what brought about the difference between your results and minillm's.

For GPT-2 (base) experiments, I got (23.6733, 24.1762, 23.8693, 23.4965, 23.9844) for MiniLLM, as my code is based on MiniLLM at Aug. 2023, it might be different version compared to you are working on. I did not change any scripts or code for MiniLLM part and SFT parts, as they do not require any modification. It seems to check the differences between the codes that will take some times.

Can I assume that the sft code is from minillm, and that the results of your re-execution are different from what the minillm paper reports?

Yes, as I reported in the paper, all results are from the re-implementation (in your word, re-execution). By the way, could you share your code if you succeed your code for MiniLLM?

I found that minillm's results could not be reproduced, even after reading in the checkpoint he gave. Have you tried the ckpt he provided? If aspect, I hope to get your sft ckpt for testing.

<style> </style>
dolly Self-Instruct Vicuna SN UN
26.88 17.9443 17.694 32.748 31.3981
27.3815 16.8773 17.1679 33.1547 31.4187
28.2254 16.8735 18.1522 33.4686 31.5469
27.4134 17.2188 16.9904 33.2153 31.4363
27.65 17.8472 17.7174 32.8547 31.4921
27.51006 17.35222 17.54438 33.08826 31.45842
<style> </style>
dolly Self-Instruct Vicuna SN UN
26.7611 14.5515 16.4227 24.5059 29.5738
26.3861 15.0279 17.0802 24.8715 29.6167
26.8233 14.6385 16.9929 25.2251 29.6849
28.1064 14.4444 16.9931 24.7591 29.734
27.1202 15.0413 16.4705 24.5255 29.67
27.03942 14.74072 16.79188 24.77742 29.65588

The two tables above are the results of my openllama2-7B and gpt2-xl running the sft code.

One more question, is your experiment a re-test of the model selected in the validation set? The results of my validation set are as follows. I hope you can help me to reproduce these results, because it has a huge impact on the progress of my research. Thank you!

openllama-7B:

<style> </style>
step exat match rougeL
12318 6.4 31.4662

gpt-xl:

<style> </style>
step exat match rougeL
14288 4.8 29.6821

"For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32}within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs."

For the various configuration files in the project whether you need to search again, I strictly enforce the default configuration inside.

<style> </style>

dolly Self-Instruct Vicuna SN UN
26.7611 14.5515 16.4227 24.5059 29.5738
26.3861 15.0279 17.0802 24.8715 29.6167
26.8233 14.6385 16.9929 25.2251 29.6849
28.1064 14.4444 16.9931 24.7591 29.734
27.1202 15.0413 16.4705 24.5255 29.67
27.03942 14.74072 16.79188 24.77742 29.65588

Same problem on UN. I've tried to reproduce the results of gpt2-xl in distillm and minillm for a long time. For me, using the ckpt provided by minillm, I got the following results:

dolly Self-Instruct Vicuna SN UN
26.7886 15.2692 16.2763 27.966 31.7629
26.6182 14.986 15.941 26.9682 31.6883
26.7466 14.6881 16.4003 26.9641 -

Although slightly better than yours, it seems that there are always gaps between the reported results and mine on UN.

I have assessed the current situation. It is difficult to distribute the parameters because it is challenging to access the server where I was running the experiment, and it is expected to take time to identify which parts of the code or scripts are causing the results not to be implemented, especially for teacher.

I will close the issue for now. Instead I will upload the student model parameters for lightweight OpenLLaMA-3B LoRA which trained on distillm. I plan to normalize the code to be reproducible for the teacher model as soon as possible, so I would appreciate your patience.