Information Extraction Datasets Collections
欢迎大家贡献公开信息抽取数据集(尤其是.中文.信息抽取数据集)
2022-02-24-Updated:
TODO:
Datasets |
Domain |
Language |
Intro |
Ent Types |
PaperWithCode |
Train/Dev/Test(Preprocess Code) |
Download |
CoNLL02 |
News |
English |
|
|
|
|
download |
CoNLL03 |
News |
English |
|
|
|
|
download |
ConNLL03 |
News |
English |
|
LOC、ORG、PER、MISC |
|
doc |
download |
CoNLL 2017 |
News |
|
Multilingual: has developed treebanks for 40+ languages with cross-linguistically consistent annotation and recoverability of the original raw texts |
|
|
|
download |
Cross-lingual Name Tagging |
Wiki |
282 Languages |
|
|
|
doc |
download |
OntoNotes4.0 |
News |
English,Chinese,Arabic |
|
PERSON、NORP、 LOC、 GPE、 PRODUCT、 EVENT、LAW |
|
doc |
download |
OntoNotes5.0 |
News |
English, Chinese,Arabic |
|
|
|
|
download |
NNE [2019] |
News |
English |
A Dataset for Nested Named Entity Recognition in English Newswire |
|
|
|
download |
MSRA |
新闻 |
中文 |
|
人物、地点、机构 |
|
|
download |
WeiBo |
微博 |
中文 |
|
地名、人名、机构名、行政区名 |
|
|
|
Resume |
简历 |
中文 |
|
人名、国籍、籍贯、种族、专业、学位、机构、职称 |
|
|
download |
BosonNER |
新闻 |
中文 |
|
时间、地点、人名、组织名、公司名、产品名 |
|
|
download |
ClueNER |
新闻 |
中文 |
|
组织、人名、地址、公司、政府、书籍、游戏、电影、职位、景点 |
|
|
download |
People Daily |
新闻 |
中文 |
|
地名、机构名、人名 |
|
|
download |
CCKS2019-Task1 |
电子病历 |
中文 |
CCKS2019评测任务一,即“面向中文电子病历的命名实体识别”的数据集 |
实验室检验、影像检查、手术、疾病和诊断、药物、解剖部位 |
|
|
download |
CCKS2020-Task |
|
中文 |
面向试验鉴定的命名实体数据集 |
试验要素、性能指标、系统组成、任务场景 |
|
|
download |
CCKS2017-2020 |
电子病历 |
中文 |
|
症状和体征、检查和检验、疾病和诊断、治疗、身体部位 |
|
|
download |
CCKS2018 |
电子病历 |
中文 |
|
解剖部位、症状描述、独立症状、药物、手术 |
|
|
download |
CCKS2019 |
电子病历 |
中文 |
|
疾病、诊断、检查、检验、手术、药物、解剖部 |
|
|
download |
CCKS2020 |
电子病历 |
中文 |
|
疾病、诊断、检查、检验、手术、药物、解剖部位 |
|
|
download |
Relation Extraction
Datasets |
Domain |
Language |
Introduction |
Rel Types |
PaperWithCode |
Train/Dev/Test(Preprocess Code) |
Download |
ACE04 |
|
English |
|
|
|
常用的分割方式:link1, link2) |
download |
ACE05 |
|
English |
|
|
|
常用的分割方式:link |
download |
Conll04 |
News |
English |
|
|
|
|
download |
GENIA |
Bio |
English |
The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology. |
|
|
|
[download] |
ADE |
Bio , Drug |
English |
a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. |
|
|
注意这里是使用十折交叉验证 |
download |
Chempot |
BioPapers |
English |
ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task. |
|
|
|
download |
SciERC |
SciPapers |
English |
|
|
|
|
download |
DialogRE |
Film |
English,Chinese |
The first human-annotated dialogue-based relation extraction dataset, containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy Friends. |
|
|
5,936 / 1,928/1,858 (36 rels) |
download |
DocRED |
News |
English |
DocRED是基于维基百科的文档级关系抽取数据集,数据集中每个文档都被标注了命名实体提及、核心参考信息、句内和句间关系以及支持证据。关系类型涉及科学、艺术、时间、个人生活在内的96种Wikidata关系类型。 |
|
|
|
download |
TACRED |
News |
English |
TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges |
|
|
|
download |
CDR |
Sci-Biometrics |
English |
a human-annotated dataset in the biomedical domain. It consists of 500 documents for train- ing. The task is to predict the binary interactions between Chemical and Disease concepts. |
|
|
数据预处理处理:link |
download |
GDA |
Sci-Biometrics |
English |
a large-scale dataset in the biomedical domain. It contains of 29,92 articles.The task is to predict the binary interactions between Gene and Disease concepts. |
|
|
数据预处理:link |
download |
SciREX |
Sci-CS |
English |
SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources |
|
sota |
数据预处理:[link] |
download |
SciCN |
Sci-CS |
中文 |
|
|
|
|
[download] |
NYT-10 |
Wiki |
English |
由NYT corpus 在2010年基于Freebase远程监督得到的,共包含founders、place_of_birth在内的53种关系(包括一种NA |
|
|
数据划分和预处理: CasRel |
download |
WebNLG |
Wiki |
English |
the WebNLG challenge consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples |
|
|
数据划分和预处理:CasRel |
download |
SemEval-2010-Task8 |
News |
English |
SemEval数据集是2010年国际语义评测大会中Task8任务所使用的数据集,该数据集包括8000个训练样本,2717个测试样本 |
|
link |
|
download |
FewRel |
Wiki |
English |
该数据集包括100个关系类别、70,000个关系实例。每句的平均长度为24.99 |
|
|
|
download |
Wiki80 |
Wiki |
English |
Wiki80是从数据集FewRel上提取的一个关系数据集,共包含location、part of、follows等80种关系,每种关系个数均为700,共56000个样本。 |
|
|
|
download |
DuIE2.0 |
新闻 |
中文 |
数据集包含超过43万三元组数据、21万中文句子及48个预定义的关系类型 |
|
|
|
download |
CCKS2019 |
电子病历 |
中文 |
层级关系分类任务,包括三大类(亲属关系、社交关系、师生关系),四中类(配偶、血亲、姻亲、友谊)、35小类(现夫、前妻)种关系类型 |
|
|
|
download |
Chinese Literature Text |
文学作品 |
中文 |
面向中文文学的一个实体关系数据集,标注了物体、人名、地名、时间名、容量名、组织和摘要共7类实体,位于、部分、家庭、概括、社会、拥有、使用、制造、邻接等9类实体关系 |
|
|
|
download |
Event Extraction