train/valid/test split for dataset

Question

train/valid/test split for dataset

taokz opened this issue a year ago · comments

方便提供一下summarization任务的划分吗？因为原始数据没有划分，说是随机。
或者我想了解一下输入格式是怎么样的，比如 --train_file ./dataprepare/data/healthcaremagic/train.json \，这个json的格式是如何的，我看您的comment也说了可以是csv， csv的话就是两列吗，source text & target text.
谢谢！

YuanHongyi · Answer 1 · Sat Apr 01 2023 20:34:55 GMT+0800 (China Standard Time)

We have already provided the data preprocessing scripts in downstream_src/dataprepare/. The JSON file contains a list of samples and each sample is formatted as a dict of 'id': some int identifiers, 'src': source string, and 'tgt': target string. Please refer to the provided scripts for more detailed information.