WangRongsheng / Aurora

🐳 Aurora is a [Chinese Version] MoE model. Aurora is a further work based on Mixtral-8x7B, which activates the chat capability of the model's Chinese open domain.

Home Page:https://arxiv.org/abs/2312.14557

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

what optimization strategy is used?

g-h-chen opened this issue · comments

Hi Rongsheng,
Thanks for your work! I'm wondering what optimization strategy is used (ZERO-1/2/3)?

Also, can you reveal how many GPU hours you used in your training?

Hi Rongsheng, Thanks for your work! I'm wondering what optimization strategy is used (ZERO-1/2/3)?

@g-h-chen Hi, this is helpful for you. https://github.com/WangRongsheng/Aurora?tab=readme-ov-file#train

Also, can you reveal how many GPU hours you used in your training?

We use a single NVIDIA H100. The traing time information is here. https://huggingface.co/wangrongsheng/Aurora/blob/main/train_results.json

Thanks for your reply. I read the source code but found no sign of using any of the ZERO optimization. Is this the case? Did I miss anything?

Thanks for your reply. I read the source code but found no sign of using any of the ZERO optimization. Is this the case? Did I miss anything?

we support Deepspeed Zero, but we don't use it. we will update readme and you can check it later.

Roger that! Thanks!