yangjingla / IE-Datasets-Collections

中英文信息抽取数据集整理

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Information Extraction Datasets Collections

欢迎大家贡献公开信息抽取数据集(尤其是.中文.信息抽取数据集)


2022-02-24-Updated: TODO:

  • 为中文和英文数据集分别建立一份文档
  • 为每份数据集,提供相关论文介绍
  • 使用预处理的代码进行处理好。
  • 统计实体的类型 +数据集的划分
  • 处理好的上传Drive提供下载链接

Named Entity Recognition

Datasets Domain Language Intro Ent Types PaperWithCode Train/Dev/Test(Preprocess Code) Download
CoNLL02 News English download
CoNLL03 News English download
ConNLL03 News English LOC、ORG、PER、MISC doc download
CoNLL 2017 News Multilingual: has developed treebanks for 40+ languages with cross-linguistically consistent annotation and recoverability of the original raw texts download
Cross-lingual Name Tagging Wiki 282 Languages doc download
OntoNotes4.0 News English,Chinese,Arabic PERSON、NORP、 LOC、 GPE、 PRODUCT、 EVENT、LAW doc download
OntoNotes5.0 News English, Chinese,Arabic download
NNE [2019] News English A Dataset for Nested Named Entity Recognition in English Newswire download
MSRA 新闻 中文 人物、地点、机构 download
WeiBo 微博 中文 地名、人名、机构名、行政区名
Resume 简历 中文 人名、国籍、籍贯、种族、专业、学位、机构、职称 download
BosonNER 新闻 中文 时间、地点、人名、组织名、公司名、产品名 download
ClueNER 新闻 中文 组织、人名、地址、公司、政府、书籍、游戏、电影、职位、景点 download
People Daily 新闻 中文 地名、机构名、人名 download
CCKS2019-Task1 电子病历 中文 CCKS2019评测任务一,即“面向中文电子病历的命名实体识别”的数据集 实验室检验、影像检查、手术、疾病和诊断、药物、解剖部位 download
CCKS2020-Task 中文 面向试验鉴定的命名实体数据集 试验要素、性能指标、系统组成、任务场景 download
CCKS2017-2020 电子病历 中文 症状和体征、检查和检验、疾病和诊断、治疗、身体部位 download
CCKS2018 电子病历 中文 解剖部位、症状描述、独立症状、药物、手术 download
CCKS2019 电子病历 中文 疾病、诊断、检查、检验、手术、药物、解剖部 download
CCKS2020 电子病历 中文 疾病、诊断、检查、检验、手术、药物、解剖部位 download

Relation Extraction

Datasets Domain Language Introduction Rel Types PaperWithCode Train/Dev/Test(Preprocess Code) Download
ACE04 English 常用的分割方式:link1, link2) download
ACE05 English 常用的分割方式:link download
Conll04 News English download
GENIA Bio English The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology. [download]
ADE Bio , Drug English a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. 注意这里是使用十折交叉验证 download
Chempot BioPapers English ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task. download
SciERC SciPapers English download
DialogRE Film English,Chinese The first human-annotated dialogue-based relation extraction dataset, containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy Friends. 5,936 / 1,928/1,858 (36 rels) download
DocRED News English DocRED是基于维基百科的文档级关系抽取数据集,数据集中每个文档都被标注了命名实体提及、核心参考信息、句内和句间关系以及支持证据。关系类型涉及科学、艺术、时间、个人生活在内的96种Wikidata关系类型。 download
TACRED News English TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges download
CDR Sci-Biometrics English a human-annotated dataset in the biomedical domain. It consists of 500 documents for train- ing. The task is to predict the binary interactions between Chemical and Disease concepts. 数据预处理处理:link download
GDA Sci-Biometrics English a large-scale dataset in the biomedical domain. It contains of 29,92 articles.The task is to predict the binary interactions between Gene and Disease concepts. 数据预处理:link download
SciREX Sci-CS English SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources sota 数据预处理:[link] download
SciCN Sci-CS 中文 [download]
NYT-10 Wiki English 由NYT corpus 在2010年基于Freebase远程监督得到的,共包含founders、place_of_birth在内的53种关系(包括一种NA 数据划分和预处理: CasRel download
WebNLG Wiki English the WebNLG challenge consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples 数据划分和预处理:CasRel download
SemEval-2010-Task8 News English SemEval数据集是2010年国际语义评测大会中Task8任务所使用的数据集,该数据集包括8000个训练样本,2717个测试样本 link download
FewRel Wiki English 该数据集包括100个关系类别、70,000个关系实例。每句的平均长度为24.99 download
Wiki80 Wiki English Wiki80是从数据集FewRel上提取的一个关系数据集,共包含location、part of、follows等80种关系,每种关系个数均为700,共56000个样本。 download
DuIE2.0 新闻 中文 数据集包含超过43万三元组数据、21万中文句子及48个预定义的关系类型 download
CCKS2019 电子病历 中文 层级关系分类任务,包括三大类(亲属关系、社交关系、师生关系),四中类(配偶、血亲、姻亲、友谊)、35小类(现夫、前妻)种关系类型 download
Chinese Literature Text 文学作品 中文 面向中文文学的一个实体关系数据集,标注了物体、人名、地名、时间名、容量名、组织和摘要共7类实体,位于、部分、家庭、概括、社会、拥有、使用、制造、邻接等9类实体关系 download

Event Extraction

  • TODO

ToolKit

ToolKit Intro Repo
Spacy 英文NLP工具 homepage
FastHAN 中文NLP工具 homepage
HanNLP 中文NLP工具 homepage
ZJU-DEEPKG 知识图谱抽取 homepage
THU-NRE 信息抽取工具 homepage

More Datasets

Forum Intro Websit
Huggface Datasets https://huggingface.co/datasets?sort=downloads&search=web
Github
Kaggle

Contact

About

中英文信息抽取数据集整理