WangRongsheng / Aurora

🐳 Aurora is a [Chinese Version] MoE model. Aurora is a further work based on Mixtral-8x7B, which activates the chat capability of the model's Chinese open domain.

Home Page:https://arxiv.org/abs/2312.14557

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why does the paper utilize duplicate data from the Alpaca dataset?

danghieuan opened this issue · comments

Dear @WangRongsheng,

Thank you for your contribution; this paper is amazing. However, I have a question regarding the instruction finetuning dataset as mentioned below:

In the datasets alpaca_data_zh_51k and alpaca_gpt4_data_zh, can you explain why both datasets from Alpaca were used? It appears that the alpaca_gpt4_data_zh dataset might have higher quality as it contains natural responses, unlike the alpaca_data_zh_51k original dataset. Is it more beneficial to utilize both datasets for the instruction finetuning step, or would it be preferable to prioritize using only the alpaca_gpt4_data_zh set due to its inclusion of natural responses?

Thank you for your clarification.

Best regards,

Hi, @danghieuan

I am sorry for taking so long to respond to you.

alpaca_data_zh_51k comes from the response of ChatGPT3.5, and alpaca_gpt4_data_zh comes from the response of GPT-4. Generally, the quality of alpaca_gpt4_data_zh is better than that of alpaca_data_zh_51k, but compared to the data collected from other websites, we believe that both alpaca_data_zh_51k and alpaca_gpt4_data_zh are of high quality.

On the other hand, command fine-tuning requires sufficient data training. Therefore, we have adopted two datasets.

I hope this helps. Thank you.

Thank @WangRongsheng . I got it.