Why does the paper utilize duplicate data from the Alpaca dataset?

Question

Why does the paper utilize duplicate data from the Alpaca dataset?

danghieuan opened this issue 5 months ago · comments

Thank you for your contribution; this paper is amazing. However, I have a question regarding the instruction finetuning dataset as mentioned below:

In the datasets alpaca_data_zh_51k and alpaca_gpt4_data_zh, can you explain why both datasets from Alpaca were used? It appears that the alpaca_gpt4_data_zh dataset might have higher quality as it contains natural responses, unlike the alpaca_data_zh_51k original dataset. Is it more beneficial to utilize both datasets for the instruction finetuning step, or would it be preferable to prioritize using only the alpaca_gpt4_data_zh set due to its inclusion of natural responses?

Thank you for your clarification.

Best regards,

WangRongsheng · Answer 1 · Fri Jan 19 2024 19:44:59 GMT+0800 (China Standard Time)

Hi, @danghieuan

I am sorry for taking so long to respond to you.

alpaca_data_zh_51k comes from the response of ChatGPT3.5, and alpaca_gpt4_data_zh comes from the response of GPT-4. Generally, the quality of alpaca_gpt4_data_zh is better than that of alpaca_data_zh_51k, but compared to the data collected from other websites, we believe that both alpaca_data_zh_51k and alpaca_gpt4_data_zh are of high quality.

On the other hand, command fine-tuning requires sufficient data training. Therefore, we have adopted two datasets.

I hope this helps. Thank you.

An Dang-Hieu · Answer 2 · Sat Jan 20 2024 23:32:56 GMT+0800 (China Standard Time)

Thank @WangRongsheng . I got it.