BioMedical-NLP-Dataset

生物医学自然语言处理相关的数据集

Dataset collections of natural language processing for biomedicine/health domain

挑战榜单(Benchmark Evaluation)
信息提取(Information Extraction)
实体识别(Named Entity Recognition)
术语标准化(Entity Normalization)
关系抽取(Relationship Extraction)
事件抽取(Event Extraction)
共指消解(Coreference Resolution)
文本分类(Text Classification)
文本相似度 (Text Similarity)
文档检索(Document Retrieval)
问答系统(Question Answering)
知识图谱(Knowledge Graph)
预训练语言模型(Pre-trained Language Model)
大语言模型(Large Language Model)
其他

挑战榜单

Benchmark evaluation in NLP for Biomedicine or Health domain refers to the process of evaluating the performance of NLP models or algorithms on a standardized set of tasks and datasets that are specific to the biomedical or health domain. In this context, a benchmark is a set of tasks or datasets that are widely accepted by the research community as representative of the challenges faced in the biomedical or health domain. These benchmarks typically consist of tasks such as named entity recognition, relation extraction, text classification, and others that are relevant to the domain.

CBLUE

中文医疗信息处理挑战榜CBLUE(Chinese Biomedical Language Understanding Evaluation)

paper github website
BLURB

BLURB is the Biomedical Language Understanding and Reasoning Benchmark. A collection of resources for biomedical natural language processing.

paper website

信息抽取

Information extraction (IE) in Natural Language Processing (NLP) for the Biomedicine or Health domain refers to the process of automatically extracting structured information from unstructured or semi-structured biomedical or health-related texts such as electronic health records (EHRs), clinical trial reports, scientific publications, and social media posts. The goal of IE is to identify and extract specific pieces of information such as named entities (e.g., diseases, drugs, genes, proteins), relationships between them (e.g., drug-disease associations, gene-protein interactions), and events (e.g., adverse drug reactions, disease diagnoses) mentioned in the text.

实体识别

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and categorizing named entities in unstructured text. In the context of biomedicine or health domain, NER specifically refers to identifying and categorizing named entities such as diseases, symptoms, treatments, drugs, genes, proteins, and other biomedical concepts.

2019
- BioNLP-OST 2019 CRAFT-CA task: Concept Annotation Task
  
  Chemical Entities of Biological Interest (CHEBI), Cell Ontology (CL), Gene Ontology Biological Process (GO_BP), Gene Ontology Cellular Component (GO_CC), Gene Ontology Molecular Function (GO_MF), Molecular Process Ontology (MOP), NCBI Taxonomy (NCBITaxon), Protein Ontology (PR), Sequence Ontology (SO), Uberon (UBERON).
- BioNLP-OST 2019 PharmaCoNER task
  
  Entity types: Normalizables, No_Normalizables, Proteinas, Unclear
- BioNLP-OST 2019 AGAC task
  
  Task 1 is a traditional NER for 12 labels, which cultivate molecular phenomena related to gene mutation. Variation (Var), Molecular Physiological Activity (MPA), Interaction, Pathway, Cell Physiological Activity (CPA), Regulation (Reg), Positive Regulation (PosReg), Negative Regulation (NegReg); Disease, Gene, Protein, Enzyme.
  
  Task 2 is a relation extraction task, which capture the thematic roles between entities. ThemeOf, CauseOf.
  
  Task 3 is a prediction task for the novel link discovery, which extract triple information among gene, function change, and disease out of the corpus texts. Gene;Function change;disease.
- BioNLP-OST 2019 Bacteria-Biotope Task
  
  the BB task is an information extraction task involving entity recognition, entity normalization, and relation extraction.
  
  4 entity types: Microorganism, Habitat, Geographical, Phenotype.
  
  2 relation types: Lives_in, Exhibits.
- CCKS 2019 面向中文电子病历的命名实体识别
  
  子任务1：医疗命名实体识别。实体包括疾病和诊断，检查，检验，手术，药物，解剖部位。子任务2：医疗实体及属性抽取（跨院迁移）。
  
  data
2018
- CHIP 2018 评测一：中文电子病历中临床医疗实体及属性抽取
  
  从医学影像学检查结果文本描述中提取“肿瘤相关疾病“的常用字段，包括肿瘤原发位置，原发肿瘤大小，转移部位。训练数据：600份影像学检查报告，肺癌，乳腺癌相关。测试数据：200份。
- CCKS 2018 面向中文电子病历的命名实体识别
  
  对于给定的一组电子病历纯文本文档，任务的目标是识别并抽取出与医学临床相关的实体提及。实体包括：解剖部位，症状描述，独立症状，药物，手术。
  
  data
2017
- CCKS 2017 Task 2: 电子病历命名实体识别
  
  实体包括：身体部位、症状和体征、疾病和诊断、检查和检验以及治疗。
  
  data
2015
- BioCreative V Track 2-CHEMDNER-patents
  
  automatic extraction of chemical and biological data from medicinal chemistry patents.
  
  The CHEMDNER-patents corpora will consist of a training, development and test set, each comprising a total of 7000 manually annotated records.
  
  CEMP (chemical entity mention in patents, main task)
  
  CPD (chemical passage detection, text classification task)
  
  GPRO (gene and protein related object task)
  
  paper
2012
- BioCreative IV Track 2-CHEMDNER Task: Chemical compound and drug name recognition task
  
  detect mentions of chemical compounds and drugs.
  
  paper
- BioCreative IV Track 4-GO Task
  
  SubTask A: Retrieving GO evidence sentences for relevant genes.
  
  SubTask B: Predicting GO terms for relevant genes.
  
  paper
2009
- n2c2 2009: Medication Extraction Challenge
  
  Medication extraction challenge aims to encourage development of natural language processing systems for the extraction of medication-related information from narrative patient records. Information to be targeted includes medications, dosages, modes of administration, frequency of administration, and the reason for administration.
  
  paper
2006
- BioCreative II Task 1A: Gene Mention Tagging
  
  named entity extraction of gene and gene product mentions in text.
  
  paper
- n2c2 2006: Deidentification and Smoking Challenge
  
  Study the two challenge questions on the same data. Task 1: automatic de-identification of clinical records. PHI Category: Patients, Doctors , Locations, Hospitals, Dates, IDs, Phone Numbers, Ages.
  
  paper
2004
- BioCreative I Task 1A: Gene Mention Identification
  
  focused on the identification of gene or protein names in running text.
  
  paper

术语标准化

Entity normalization (EN) in Natural Language Processing (NLP) for the Biomedicine or Health domain refers to the process of mapping a specific named entity (e.g., a disease, a drug, a gene) mentioned in a text to a unique identifier in a reference knowledge base or ontology. The goal of EN is to resolve ambiguity and ensure consistency in entity representations across different texts and knowledge resources.

2019
- CHIP 2019 评测一：临床术语标准化任务
  
  主要目标是针对中文电子病历中挖掘出的真实手术实体进行语义标准化。给定一手术原词，要求给出其对应的手术标准词。所有手术原词均来自于真实医疗数据，并以《ICD9-2017协和临床版》手术词表为标准进行了标注。训练集数据量：4000条。验证集数据量：1000条。测试集数据量：2000条。
  
  解决方案：第一名，第二名，第三名
- 2019 n2c2 Track 3: n2c2/UMass Track on Clinical Concept Normalization
2017
- BioCreative VI Track 1: Interactive Bio-ID Assignment (IAT-ID)
  
  Bioentity normalization task. For gene/gene products, the identifier types are Entrez and UniProtKB. Small chemicals are identified using ChEBI as the primary identifier. Subcellular structures are identified using the Gene Ontology Cellular Component (GO CC) identifier. Cell lines are identified using Cellosaurus as the primary identifier. Cell types are identified using the Cell Ontology identifier. Tissues and organs are identified using Uberon as the identifier. Finally, organisms are identified using NCBI Taxon.
2009
- BioCreative III: GN: Gene Normalization
  
  link gene or gene products mentioned in the literature to standard database identifiers. However, in this challenge, there are two significant characteristics that make it unique: 1. Instead of using abstracts, full-length articles are provided. 2. Instead of being species-specific, no species information is provided.
2006
- BioCreative II Task 1B: Human Gene Normalizations
  
  Systems will be required to return the EntrezGene (formerly Locus Link) identifiers corresponding to the human genes and direct gene products appearing in a given MEDLINE abstract.
2004
- BioCreative I Task 1B: Gene Normalizations
  
  focused on creating normalized gene lists.

关系抽取

Relation Extraction (RE) is a natural language processing (NLP) task that involves identifying and extracting semantic relationships between entities in text. In the biomedical or health domain, RE refers to the process of automatically identifying and extracting the relationships between biomedical entities such as genes, proteins, diseases, drugs, and biological processes mentioned in scientific literature. The goal of RE is to extract useful information from unstructured text, such as research articles, clinical notes, and electronic health records, which can be used for a variety of applications, including drug discovery, personalized medicine, and clinical decision support.

2018
- n2c2 2018 — Track 2: Adverse Drug Events and Medication Extraction in EHRs
  
  This task aims to answer the question: “Can NLP systems automatically discover drug to adverse event (ADE) relations in clinical narratives?”. three subtasks: 1) Concepts: Identifying drug names, dosages, durations and other entities. 2) Relations: Identifying relations of drugs with adverse drugs events (ADEs)[1] and other entities given gold standard entities. 3) End-to-end: Identifying relations of drugs with ADEs and other entities on system predicted entities.
  
  paper
2017
- BioCreative VI Track 5: Text mining chemical-protein interactions
  
  automatically detect in running text (PubMed abstracts) relations between chemical compounds/drug and genes/proteins.
2013
- BioNLP-ST-2013: Gene Regulation Network (GRN)
  
  corpus including entities, events and relations, including genic interactions.
2012
- n2c2 2012: Temporal Relations Challenge
  
  The 2012 i2b2 temporal relations challenge data include 310 discharge summaries consisting of 178 000 tokens. Clinically relevant events include clinical concepts, clinical departments, evidentials, occurrences. Temporal relations: BEFORE, AFTER, SIMULTANEOUS, OVERLAP, BEGUN_BY, ENDED_BY, DURING, BEFORE_OVERLAP.
  
  paper
2011
- BioNLP Shared Task 2011: Entity Relations Supporting Task (REL)
  
  The task concerns the detection of relations stated to hold between a gene or gene product and a related entity such as a protein domain or protein complex.
  
  Entities: human-annotated gene and gene product entities, annotated as "Protein"
  
  Relation Type: Subunit-Complex, Protein-Component
2010
- BioCreative III: PPI: Protein-Protein Interactions
  
  The aim of this task is to promote the development of automated systems that are able to extract biologically relevant information directly from the literature, in this case related to protein-protein interaction (PPI) annotation information.
- n2c2 2010: Relations Challenge
  1. extraction of medical problems, tests, and treatments. 2) classification of assertions made on medical problems, present, absent, or possible. 3) relations of medical problems, tests, and treatments.
  A total of 394 training reports, 477 test reports, and 877 unannotated reports were de-identified and released to challenge participants with data use agreements.
  
  paper
2006
- BioCreative II Task 2: Protein-Protein Interactions
  
  focuses on the prediction of protein interactions from full text articles.
2004
- GAD Gene-Disease Associations
  
  Gene-disease associations curated from genetic association studies.
  
  10697 genes, 12774 diseases, 74928 gene-disease associations
  
  paper

事件抽取

Event Extraction (EE) is a natural language processing (NLP) task that involves identifying and extracting the occurrence of specific events or processes in text. In the biomedical or health domain, EE refers to the process of automatically identifying and extracting events or processes related to biomedical entities such as genes, proteins, diseases, drugs, and biological processes mentioned in scientific literature. The goal of EE is to automatically detect and extract information about specific biomedical events or processes mentioned in text, such as the activation of a gene, the inhibition of a protein, or the progression of a disease. This information can be used for a variety of applications, including drug discovery, personalized medicine, and clinical decision support.

2019
- BioNLP-OST 2019 Seedev Task
  
  the SeeDev representation scheme defines 16 entity types. task1: Binary relation extraction task. task2: Full event extraction task, these entities participates in 21 types of events that can be grouped into five categories.
2016
- BioNLP-ST 2016: Bacteria Biotope-Event extraction of microorganisms and habitats with ontologies and their linking
  
  Entities: Bacteria, Habitat, Geographical. Events: Lives_In.
- BioNLP-ST 2016: Event extraction of genetic and molecular mechanisms involved in plant seed development (SeeDev)
  
  16 different types of entities. 5 sets of event types that may be combined in complex events.
2013
- BioNLP-ST 2013: Cancer Genetics (CG) Task
  
  The CG task aims to advance the automatic extraction of information from statements on the biological processes relating to the development and progression of cancer.
- BioNLP-ST-2013: Pathway Curation (PC) task
  
  The PC task aims to evaluate the applicability of event extraction systems to support the curation, evaluation and maintenance of biomolecular pathway models and to encourage the further development of methods for these tasks.
- BioNLP-ST-2013: Bacteria Biotopes (BB)
  
  Entity recognition of bacteria taxa and bacteria habitats. Bacteria habitat categorization through the OntoBiotope-Habitat ontology. Extraction of localization relations between bacteria and habitats.
2011
- BioNLP Shared Task 2011: GENIA Event Extraction (GENIA)
  
  The GENIA task aims at extracting events occurring upon genes or gene products, which are typed as "Protein" without differentiating genes from gene products. Other types of physical entities, e.g. cells, cell components, are not differentiated from each other, and their type is given as "Entity"
- BioNLP Shared Task 2011: Epigenetics and Post-translational Modifications Task (EPI)
  
  This task focuses on events relating to epigenetic change, including DNA methylation and histone modification, as well as other common post-translational protein modifications.
  
  Event type: Hydroxylation(羟基化), Phosphorylation(磷酸化), Ubiquitination(泛素化), DNA methylation(DNA甲基化), Glycosylation(糖基化), Acetylation(乙酰化), Methylation(甲基化), Catalysis(催化).
- BioNLP Shared Task 2011: Infectious Diseases Task (ID)
  
  This tasks focuses on the biomolecular mechanisms of infectious diseases.
  
  Five entities: Genes and gene products, Two-component systems, Chemicals, Organisms, Regulons/Operons.
  
  Nine events: Gene expression, Transcription, Protein catabolism, Phosphorylation, Localization, Binding, Regulation, Positive regulation, Negative regulation, Process.
- BioNLP Shared Task 2011: Bacteria Biotopes (BB)
  
  The task consists in extracting bacteria localization events, in other words, mentions of given species and the place where it lives.
  
  Entities: Host, HostPart, Geographical, Environment, Food, Medical, Soil, Water.
  
  Events: Localization, PartOf.
- BioNLP Shared Task 2011: Bacteria Gene Interactions (BI)
  
  This task consists in a full extraction of genetic processes mentioned in scientific texts concerning the bacterium Bacillus subtilis.
  
  Entities: GeneProduct, Protein, PolymeraseComplex, Gene, ProteinFamily, GeneFamily, GeneComplex, Regulon, Site, Promoter, Action, Transcription, Expression.
  
  Events: RegulonDependence, BindTo, TranscriptionFrom, RegulonMember, SiteOf, TranscriptionBy, PromoterOf, PromoterDependence, ActionTarget, Interaction.
- BioNLP Shared Task 2011: Bacteria Gene Renaming (RENAME)
  
  The task consists in extracting gene renaming acts and gene synonymy reminders in scientific texts about bacteria.
  
  Entities: All gene and protein names have been annotated as text-bound entities of type Gene.
  
  Events: The only type of event is Renaming where both arguments are of type Gene.
2004
- BioCreative I Task 2: Functional Annotations
  
  automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles.

共指消解

Coreference resolution is a natural language processing (NLP) task that involves identifying all the expressions (words, phrases, or pronouns) in a text that refer to the same entity. In the biomedical or health domain, coreference resolution is used to identify all the mentions of medical concepts or entities, such as diseases, treatments, drugs, or anatomical parts, that refer to the same thing.

2019
- BioNLP-OST 2019 CRAFT-CR task: Coreference Resolution Task
2011
- BioNLP Shared Task 2011: Protein/Gene Coreference Task (COREF)
  
  The COREF task addresses the problem of finding anaphoric references to proteins or genes.
- n2c2 2011: Coreference Challenge
  
  paper

文本分类

Text classification in NLP for the biomedicine or health domain refers to the process of automatically categorizing or labeling text data based on their content with the aim of extracting meaningful information and insights. In this domain, text classification is commonly used to classify various types of medical documents, such as clinical notes, discharge summaries, pathology reports, and research articles, into predefined categories, such as disease diagnosis, treatment options, medication information, and patient outcomes.

2019
- CHIP 2019 评测三：临床试验筛选标准短文本分类
  
  临床试验是指通过人体志愿者也称为受试者进行的科学研究，筛选标准是临床试验负责人拟定的鉴定受试者是否满足某项临床试验的主要指标，分为入组标准和排出标准，一般为无规则的自由文本语句。
  
  此评测任务的主要目标是针对临床试验筛选标准进行分类，所有预料均来自于真实临床试验，并经过了初步处理和人工标注。给定事先定义好的44种筛选标准类别和一系列中文临床试验筛选标准的描述句子，参赛者需返回每一条筛选标准的具体类别。
  
  训练集：22962；验证集：7682；测试集：7697。
  
  解决方案：第一名，第二名，第三名
  
  paper
2008
- n2c2 2008: Obesity Challenge
  
  The obesity challenge is a multi-class, multi-label classification task focused on obesity and its co-morbidities. The data for the challenge consist of discharge summaries from Partners Healthcare. All records have been fully de-identified. Obesity information and co-morbidities have been marked at a document level as present, absent, questionable, or unmentioned in the documents.
  
  paper
2006
- n2c2 2006: Deidentification and Smoking Challenge
  
  Study the two challenge questions on the same data. Task 2: identification of the smoking status of patients. Classify patient records into five possible smoking status categories: Past Smoker, Current Smoker, Smoker, Non-Smoker, Unknown.
  
  paper

文本相似度

Text similarity in NLP for the biomedicine or health domain refers to the process of quantifying the degree of semantic or syntactic similarity between two or more pieces of text data. In this domain, text similarity is commonly used to compare different medical documents, such as clinical notes, research articles, and patient records, to identify relevant information, track disease progression, and monitor treatment outcomes. Text similarity in biomedicine or health domain typically involves the use of various NLP techniques such as word embeddings, sentence embeddings, and document embeddings. These techniques transform text data into numerical representations that can be compared using various similarity metrics, such as cosine similarity, Jaccard similarity, and Euclidean distance.

2019
- CHIP 2019 评测二：平安医疗科技疾病问答迁移学习比赛
  
  本次评测任务的主要目标是针对中文的疾病问答数据，进行病种间的迁移学习。具体而言，给定来自5个不同病种的问句对，要求判定两个句子语义是否相同或者相近。所有语料来自互联网上患者真实的问题，并经过了筛选和人工的意图匹配标注。病种包括：diabetes，hypertension，hepatitis，aids，breast cancer。
  
  训练集，数据量分别为：10000，2500，2500，2500，2500。验证集，数据量分别为：2000，2000，2000，2000，2000。测试集，数据量为50000
  
  解决方案：第一名，第二名，第三名
2018
- CHIP 2018 评测二：平安医疗科技智能患者健康咨询问句匹配大赛
  
  主要目标是针对中文的真实患者健康咨询语料，进行问句意图匹配。给定两个语句，要求判定两者意图是否相同或者相近。所有语料来自互联网上患者真实的问题，并经过了筛选和人工的意图匹配标注。
  
  训练集：20000条左右标注好的数据，经过脱敏处理。测试集：10000条左右，不含标注。
  
  解决方案：第一名，第二名，第三名

文档检索

Document retrieval in NLP for the biomedicine or health domain refers to the process of retrieving relevant medical documents or articles from large collections of text data based on user queries or information needs. In this domain, document retrieval is commonly used to search for specific information related to diseases, treatments, medications, and patient outcomes. Document retrieval in biomedicine or health domain typically involves the use of information retrieval (IR) techniques such as keyword-based search, Boolean search, and vector space model. These techniques index the text data and match the query terms with the most relevant documents based on their similarity scores.

2019
- BioNLP-OST 2019 RDoc Task
  
  task1 (RDoC-IR) is on retrieving PubMed Abstracts related to RDoC constructs. 250 abstracts for train and 200 abstracts for test. task 2 (RDoC-SE) is on extracting the most relevant sentences for an RDoC construct from a relevant abstract. 250 abstracts for train and 50 abstracts for test.
2018
- n2c2 2018 — Track 1: Cohort Selection for Clinical Trials
  
  This task aims to answer the question, “Can NLP systems use narrative medical records to identify which patients meet selection criteria for clinical trials?” The task requires NLP systems to compare each patient to a list of selection criteria, and determine if the patients meet, do not meet, or possibly meet each criterion.
  
  paper
2010
- BioCreative III: IAT: Interactive Demostration Task for Gene Indexing and Retrieval
  
  focus on indexing (identifying which genes are being studied in an article and linking these genes to standard database identifiers) and gene-oriented document retrieval (identifying full-text papers relevant to a selected gene).

问答系统

Question answering in NLP for the biomedicine or health domain refers to the process of automatically answering natural language questions related to medical information or knowledge. In this domain, question answering is commonly used to support clinical decision-making, patient care, and medical research.

2019
- CCIR 2019 评测：基于电子病历的数据查询类问答
  
  给定医疗知识图谱、医疗事件图谱和一系列自然语言问题，参赛者返回问题结果。
  
  病人事件图谱数据集下载
  
  训练数据：1800条自然语言问句，SPARQL查询语句，以及答案。验证数据：600条自然语言问句，SPARQL查询语句，以及答案。测试数据：600条自然语言问句。
- PubMedQA: A Dataset for Biomedical Research Question Answering
  
  paper github website

知识图谱

A knowledge graph in NLP for biomedicine or health domain refers to a structured representation of medical knowledge and information in the form of a graph. It represents entities and their relationships in the medical domain, such as diseases, symptoms, treatments, medications, and clinical trials, as nodes and edges in a graph. The nodes represent the entities, and the edges represent the relationships between them. Knowledge graphs in biomedicine or health domain typically involve the extraction and integration of information from various sources, such as medical literature, clinical trials, electronic health records, and medical ontologies. The information is then organized into a graph structure, which enables efficient navigation, querying, and reasoning over the medical knowledge. Knowledge graphs in biomedicine or health domain can be used for various tasks such as information retrieval, question answering, and decision support. They can also be used to support clinical research, drug discovery, and personalized medicine.

2020
- CCKS 2020 新冠知识图谱构建与问答
  
  四个子任务：1）新冠百科知识图谱类型推断， 2）新冠概念图谱的上下位关系预测，3）新冠科研抗病毒药物图谱的链接预测，4）新冠百科知识图谱问答评测。

预训练语言模型

Pre-trained language model in NLP for the biomedical or health domain is a model that has been trained on a large corpus of text data in this domain. The purpose of pre-training is to enable the model to learn the underlying patterns and relationships of language in the specific domain. Pre-training typically involves training the model on a large amount of text data and using techniques like unsupervised learning to learn the relationships between words and phrases. Once the pre-training phase is complete, the model can be fine-tuned on a specific task, such as text classification or named entity recognition, using a smaller, labeled dataset. Pre-trained language models have become increasingly popular in NLP because they enable researchers and practitioners to achieve state-of-the-art results on a wide range of tasks with minimal data and computational resources. In the biomedical and health domain, pre-trained language models have been used to extract information from medical records, analyze scientific literature, and develop predictive models for disease diagnosis and treatment.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

paper github
BERTCNER: Chinese clinical named entity recognition (CNER) using pre-trained BERT model

paper github
BlueBERT: pre-trained on PubMed abstracts and clinical notes (MIMIC-III)

paper github
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

paper github
LinkBERT: Pretraining Language Models with Document Links

paper github
SciBERT: A Pretrained Language Model for Scientific Text

paper github
PubMedBERT: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

paper website

大语言模型

Large language models have the potential to revolutionize the field of biomedical health in a number of ways. These models have been trained on vast amounts of data related to various aspects of health and medicine, and can assist with tasks such as medical diagnosis, drug discovery, clinical decision-making, patient engagement, medical research, education, natural language processing, clinical trials, and public health. Overall, large language models have the potential to improve patient outcomes by providing more accurate diagnoses, more effective treatments, and more personalized care. As the field of biomedical health continues to evolve, we can expect to see more innovative applications of these powerful tools.

BioMedical-NLP-Dataset

挑战榜单

信息抽取

实体识别

2019

2018

2017

2015

2012

2009

2006

2004

术语标准化

2019

2017

2009

2006

2004

关系抽取

2018

2017

2013

2012

2011

2010

2006

2004

事件抽取

2019

2016

2013

2011

2004

共指消解

2019

2011

文本分类

2019

2008

2006

文本相似度

2019

2018

文档检索

2019

2018

2010

问答系统

2019

知识图谱

2020

预训练语言模型

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BERTCNER: Chinese clinical named entity recognition (CNER) using pre-trained BERT model

BlueBERT: pre-trained on PubMed abstracts and clinical notes (MIMIC-III)

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

LinkBERT: Pretraining Language Models with Document Links

SciBERT: A Pretrained Language Model for Scientific Text

PubMedBERT: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

大语言模型

其他

About