SIRConceicao / N-ary-Corpora-Collection

Catalogue of N-ary biomedical corpora focused on the biomedical field.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

N-ary Corpora Collection

The purpose of this repository is to gather N-ary datasets that are used for the text mining relation extraction task. It will be mostly focused on biomedical datasets, although other relevant ones from other scientific domains might be present. Only datasets built for n-ary relations are mentioned and not others that can be adapted to the task.

Motivation:

Relation extraction (RE) is a task of text mining that aims to analyse the relations between the identified entities [1]. N-ary relation extraction aims to extract relations from n entities. Currently, only few datasets are available for this type of RE. N-ary relation extraction can help to answer more specific questions such as:

  • given a mutation in a gene, which drug would it respond to, resulting in a gene-mutation-drug, ternary relation [2];
  • given a gene variation how does it impact drug response phenotype, a ternary relation of gene variation-drug-phenotype [3];
  • which type of drugs combinations will result in a positive effect [4];
  • given a specific mutation in a gene, how does it affect the reaction to the drug [5].

1. Biomedical Datasets

# Year Entities N-ary Nº Relations Type Annotation Level Relation Source Reference & Dataset
1.1 2017 Drug-Gene-Mutation  Binary & Ternary Ternary 3462 | Drug-Gene 137 464 | Drug-Mutation 3192 Silver Sent & Doc Filtered from 1 Million Full text from PubMed Central Cross-Sentence N-ary Relation Extraction with Graph LSTMs | [Dataset]
1.2 2020 Gene variations, Genes, Drugs & Phenotypes Ternary 2871 Gold Sent 911 PubMed Abstracts PGxCorpus, a manually annotated corpus for pharmacogenomics | [Dataset]
1.3 2022 Drug combinations Variable length N-ary 1248 Gold Sent,Par or Abs 1634 PubMed Abstracts A Dataset for N-ary Relation Extraction of Drug Combinations | [Dataset]
Note: Sent = Sentence, Par = Paragraph, Abs = Abstract, Doc = Document

#1.1 - Drug-Gene-Mutation

A silver standard drug-gene-mutation dataset in the context of molecular tumour boards. Filtering from an initial circa one million full text articles from PubMed Central and applying distant supervision, the final dataset resulted in 3,462 ternary relation instances, where just 59 relations were unique. The dataset could also be divided in sub-relations of drug-gene with 137,469 instances and of drug-mutation with 3,192 instances. Distant supervision was applied to the binary pairs using the Gene Drug Knowledge Database.

Characteristics

  • Language : English
  • Format : TSV
  • Standard : Silver
  • Data origin : Filtered from 1M PubMed abstracts
  • Number of instances : 144,150
  • N-ary : 2-ary (drug-gene & drug-mutation) & 3-ary (drug-gene-mutation)
  • Total relations :
    • Positive examples :
      • 3-ary : 3462 (59 unique)
      • 2-ary : Drug-Gene 137,496 & Drug-Mutation 3192
    • Negative examples were created by randomly sampling co-occurring entity triples without known interactions

#1.2 - PGxCorpus

The PGxCorpus, is a manually annotated corpus consisting in entities of interest in the pharmacogenomics field, such as gene variations, phenotypes, genes and drugs. This corpus consists of 945 sentences with 6,761 annotated entities and 2,871 relations, 10 types of entities and 7 types of relations. Although this corpus is not specifically built for n-ary relations, 92% of its sentences have three target entities of genomic factor, chemical and phenotype.

Characteristics

  • Language : English
  • Format : Brat
  • Standard : Gold
  • Data origin : 911 PubMed abstracts
  • Number of instances : 945 phrases
  • N-ary : 2-ary & 3-ary (drug-genetic factors-phenotype)
  • Total relations : 2871

#1.3 - Drug Combinations Dataset

Studies have suggested that the combination of two of more drugs have a more positive impact treating some medical conditions than a single drug. This dataset was build using 1600 manually annotated abstracts, having a variable-length n-ary relations between drug entities (from 2 to 15 drug mentions.) These mentions might be within a sentence or in a paragraph or abstract (enclosing context).

Characteristics:

  • Language : English
  • Format : JSON lines
  • Standard : Gold
  • Data origin : 1600 PubMed abstracts
  • Number of instances :1634
  • N-ary : variable lenght n-ary (drug-drug (...))
  • Total relations : 1248
    • Per label:
      • POS_COMB : 838
      • OTHER_COMB : 410
      • NO_COMB : 591
  • Train set size : 1362
  • Test set size : 272

2. Other scientific domains datasets

# Year Entities N-ary Nº Relations Type Annotation Level Relation Source Reference & Dataset
2.1 2020 Dataset, Metric, Task, Method Binary & Quaternary Gold 16 2-ary & 5 4-ary (average per document) Doc 483 Fully annotated documents from Papers with Code SciREX: A Challenge Dataset for Document-Level Information Extraction | [Dataset]
Note: Doc = Document

#2.1 - Scirex

SCIREX is a document-level dataset that includes several Information Extraction tasks, such as document-level N-ary relation identification from scientific publications and entity identification. It presents relations between the entities of type (Dataset, Method, Metric and Task) which focus on the main results of a scientific article. It is fully annotated with entities, their mentions, their coreferences, and their document level relation

Characteristics

  • Language : English
  • Format : JSON lines
  • Standard : Gold
  • Data origin : 483 Fully annotated documents from Papers with Code
  • Number of instances : UNK
  • N-ary : 2-ary & 4-ary
  • Total relations : 16 2-ary & 5 4-ary (average per document)

References:

[1] J. Liu, H. Ren, M. Wu, J. Wang, and H. jin Kim, “Multiple relations extraction among multiple entities in unstructured text,” Soft Computing, vol. 22, pp. 4295–4305, 2018.

[2] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. tau Yih, “Cross-sentence n-ary relation extraction with graph lstms,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 101–115, 4 2017

[3] J. Legrand, R. Gogdemir, C. Bousquet, K. Dalleau, M.-D. Devignes, W. Digan, C.-J. Lee, N.-C. Ndiaye, N. Petitpain, and P. Ringot, “Pgxcorpus, a manually annotated corpus for pharmacogenomics,” Scientific data, vol. 7, pp. 1–13, 2020.

[4] A. Tiktinsky, V. Viswanathan, D. Niezni, D. Meron Azagury, Y. Shamay, H. Taub-Tabib, T. Hope, and Y. Goldberg, “A dataset for n-ary relation extraction of drug combinations,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Seattle, United States), pp. 3190–3203, Association for Computational Linguistics, July 2022.

[5] R. Jia, C. Wong, and H. Poon, “Document-level n-ary relation extraction with multiscale representation learning,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota), pp. 3693–3704, Association for Computational Linguistics, June 2019.


This section provides the information about the search queries and platforms for this work:

[Date : 5-12-2022]

Search queries: "n-ary"; "n-ary" AND "relation extraction"; "n-ary" AND "relation extraction" AND "biomedical";
"n-ary" AND "relation extraction" OR "biomedical"

Web Search Platforms: [Google Scholar]; [PubMed]; [Semantic Scholar]

About

Catalogue of N-ary biomedical corpora focused on the biomedical field.