about the size of DART dataset and its performance

Question

about the size of DART dataset and its performance

JinliangLu96 opened this issue 3 years ago · comments

Recently, I used GPT to do generation with DART dataset. However, I found that the test set may be different from other works. In fact, I can only get 5,097 samples for testing, while GEM website says their test set is 12,552. And the data provied in (Li, et al 2021) (https://github.com/XiangLi1999/PrefixTuning) also has 12,552 samples but they do not provide gold references.

Through the official evaluation scripts and test set, I obtain about 37-38 BLEU, which is much lower than the results (46-47 BLEU) reported by (Li, et al 2021) and other works (like the leaderboard in github: https://github.com/Yale-LILY/dart). So, I am confused that which one is right.

Could you please answer these questions if possible? I will be appreciate.

Reference

Li X L, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint arXiv:2101.00190, 2021.