Why does the paper utilize duplicate data from the Alpaca dataset?
danghieuan opened this issue · comments
Dear @WangRongsheng,
Thank you for your contribution; this paper is amazing. However, I have a question regarding the instruction finetuning dataset as mentioned below:
In the datasets alpaca_data_zh_51k
and alpaca_gpt4_data_zh
, can you explain why both datasets from Alpaca were used? It appears that the alpaca_gpt4_data_zh
dataset might have higher quality as it contains natural responses, unlike the alpaca_data_zh_51k
original dataset. Is it more beneficial to utilize both datasets for the instruction finetuning step, or would it be preferable to prioritize using only the alpaca_gpt4_data_zh
set due to its inclusion of natural responses?
Thank you for your clarification.
Best regards,
Hi, @danghieuan
I am sorry for taking so long to respond to you.
alpaca_data_zh_51k
comes from the response of ChatGPT3.5, and alpaca_gpt4_data_zh
comes from the response of GPT-4. Generally, the quality of alpaca_gpt4_data_zh
is better than that of alpaca_data_zh_51k
, but compared to the data collected from other websites, we believe that both alpaca_data_zh_51k
and alpaca_gpt4_data_zh
are of high quality.
On the other hand, command fine-tuning requires sufficient data training. Therefore, we have adopted two datasets.
I hope this helps. Thank you.
Thank @WangRongsheng . I got it.