GanjinZero / BioBART

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model [ACL-BioNLP 2022]

Home Page:https://arxiv.org/abs/2204.03905

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

train/valid/test split for dataset

taokz opened this issue · comments

方便提供一下summarization任务的划分吗?因为原始数据没有划分,说是随机。
或者我想了解一下输入格式是怎么样的,比如 --train_file ./dataprepare/data/healthcaremagic/train.json \, 这个json的格式是如何的,我看您的comment也说了可以是csv, csv的话就是两列吗,source text & target text.
谢谢!

We have already provided the data preprocessing scripts in downstream_src/dataprepare/. The JSON file contains a list of samples and each sample is formatted as a dict of 'id': some int identifiers, 'src': source string, and 'tgt': target string. Please refer to the provided scripts for more detailed information.