A curated list of popular Datasets, Models and Papers for LLMs in Medical/Healthcare.
数据集名称 | 内容概述 | 获取链接 | 数据大小 |
---|---|---|---|
MedDialog | MedDialog数据集(中文)包含了医生和患者之间的对话(中文)。它有110万个对话和400万个话语。数据还在不断增长,会有更多的对话加入。原始对话来自好大夫网。 | 下载链接 | 3.3GB |
Chinese medical dialogue data 中文医疗对话数据集 | 数据中有六个科室的医学问答数据: <Andriatria_男科> 94596个问答对 <IM_内科> 220606个问答对 <OAGD_妇产科> 183751个问答对**<Oncology_肿瘤科>** 75553个问答对 <Pediatric_儿科> 101602个问答对 <Surgical_外科> 115991个问答对 总计 792099个问答对 |
下载链接 | 800k 条,330MB |
Huatuo-26M | Huatuo-26M 是迄今为止最大的中医问答数据集。该数据集包含超过2600万对高质量的医学问答对,涵盖疾病、症状、治疗和药物信息等广泛主题。 | 下载链接 | 4.54GB |
huatuo_encyclopedia_qa | 该数据集共有364,420条医疗QA数据,其中一些数据以不同的方式包含多个问题。我们从纯文本(例如,医学百科全书和医学文章)中提取医学QA对。我们在中文维基百科上收集了8699个疾病百科条目和2736个药物百科条目。此外,我们还从钱文健康网站抓取了226432篇高质量的医学文章。 | 下载链接 | 605MB |
中文医疗对话数据集(华佗项目) | 22万条中文医疗对话数据集(华佗项目):FreedomIntelligence/HuatuoGPT-sft-data-v1 | 下载链接 | 333MB |
医疗大模型数据集(包括预训练、指令微调和奖励数据集) | 240万条中文医疗数据集(包括预训练、指令微调和奖励数据集) | 下载链接 | 2.1GB |
外科问诊数据BillGPT/Chinese-medical-dialogue-data | 60.8K条外科问诊数据,示例:"患者:新癀片有什么用,想问一下新癀片吃了有什么作用呀? 医生:病情分析:您好:新癀片主要是可以清热解毒,活血化瘀,消肿止痛。用于热毒瘀血所致的咽喉肿痛、牙痛、痹痛、胁痛、黄疸、无名肿毒等症。指导意见:如果您有咽喉疼痛等症状服用效果是很好的,但是有胃炎的朋友尽量不要服用,有一定的胃肠反应,里面也含有对胃有刺激成分。" | 下载链接 | 936MB |
中文医学指令精调/指令微调数据集(Instruct-tuning) | 采用了公开和自建的中文医学知识库,主要参考了cMeKG。 医学知识库围绕疾病、药物、检查指标等构建,字段包括并发症,高危因素,组织学检查,临床症状,药物治疗,辅助治疗等。利用GPT3.5接口围绕医学知识库构建问答数据,设置了多种Prompt形式来充分利用知识。 | 下载链接 | 7.6K条 |
MeChat:中文心理健康支持对话大模型与数据集 | 数据集通过 ChatGPT 改写真实的心理互助 QA 为多轮的心理健康支持多轮对话(single-turn to multi-turn inclusive language expansion via ChatGPT),该数据集含有 56k 个多轮对话,其对话主题、词汇和篇章语义更加丰富多样,更加符合在长程多轮对话的应用场景。 | 下载链接 | 56k条 |
CMB-Chinese Medical Benchmark | CMB是一个全方位多层次的中文医疗模型评估平台。它共包含280839道多项选择题和74例复杂病例问诊题,涵盖了所有医学临床工种和不同职业级别的考试,综合考察模型的医学知识和临床问诊能力 | 下载链接 | 30MB |
ChatMed_Consult_Dataset | ChatMed_Consult_Dataset 中的query(或者是prompt)来自于互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI GPT-3.5引擎回答的。我们后续会对互联网上的医生回答与患者回答进行筛选甄别,择优选择,构建质量更优的数据集。 | 下载链接 | 395MB |
中医药指令数据集ChatMed_TCM_Dataset | 以开源的[中医药知识图谱] (https://github.com/ywjawmw/TCM_KG) 为基础,采用以实体为中心的自指令方法(entity-centric self-instruct),调用ChatGPT得到11w+的围绕中医药的指令数据。 | 下载链接 | 110MB |
cMedQA中文社区医学问答数据集 | 华人社区医疗问答的数据集,该数据集是1.0版本,提供方将不时更新和扩充数据库。为了保护隐私,数据是匿名的,不包括个人信息。 | 下载链接 | 80MB |
WebMedQA 线上医学QA | WebMedQA是一个从百度医生和120Ask等在线健康咨询网站收集的真实**医学问答数据集。用户首先填写个人信息表格,然后描述他们的疾病和健康问题。这些问题对所有注册的临床医生和用户开放,直到问题提出者选择最满意的答案并结束问题。医生和热心的用户可以在发布的问题下提供诊断和建议,他们的标题和专业与他们的答案一起显示。提问者也可以进一步询问他们是否对其中一个答案感兴趣。每个问题所属的类别也由其提出者选择。 | 下载链接 | 75MB |
ChineseBLUE基准 | ChineseBLUE基准由不同的带有语料库的生物医学文本挖掘任务组成。这些任务涵盖了各种文本类型(生物医学网络数据和临床笔记)、数据集大小和难度,更重要的是,突出了常见的生物医学文本挖掘挑战。 | 下载链接 | 400MB |
Yidu-S4K | 命名实体识别,实体及属性抽取 | 下载链接 | 4K条 |
Yidu-N7K | 临床语标准化 | 下载链接 | 7K条 |
HealthCareMagic-100k | 来自HealthCareMagic.com的10万次病人和医生之间的真实对话 | 下载链接 | 137MB |
icliniq-10k | 来自icliniq.com网站的病人和医生之间的10K条真实对话 | 下载链接 | 20MB |
GenMedGPT-5k | 5k从ChatGPT GenMedGPT-5k和疾病数据库中生成了患者和医生之间的对话。 | 下载链接 | 5K条 |
- Med-PaLM 2: Towards Expert-Level Medical Question Answering with Large Language Models [Paper]
- KeBioLM: Improving Biomedical Pretrained Language Models with Knowledge [Paper]
- BioELMo: Probing Biomedical Embeddings from Language Models [Paper]
- BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model [Paper]
- ClinicalT5: A Generative Language Model for Clinical Text [Paper]
- GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records [Paper]
- ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models [Paper] [Code]
- DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 [Paper]
- Capabilities of GPT-4 on Medical Challenge Problems [Paper]
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining [Paper]
- Publicly Available Clinical BERT Embeddings [Paper]
- BioMegatron: Larger Biomedical Domain Language Model [Paper]
- Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks [Paper]
- Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction [Paper]
- DoctorGLM: Fine-tuning your chinese doctor is not a herculean task [Paper] [Code]
- HuatuoGPT, Towards Taming Language Models To Be a Doctor [Paper] [Code]
- BioELECTRA:Pretrained Biomedical text Encoder using Discriminators [Paper]
- LinkBERT: Pretraining Language Models with Document Links [Paper]
- BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining [Paper]
- Large Language Models Encode Clinical Knowledge [Paper]
- A large language model for electronic health records [Paper]
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [Paper]
- BEHRT: Transformer for Electronic Health Records [Paper]
- Federated Learning of Medical Concepts Embedding using BEHRT [Paper] [Code]
- RadBERT: Adapting Transformer-based Language Models to Radiology [paper] [HuggingFace]
- Highly accurate protein structure prediction with AlphaFold [Paper] [Code]
- Accurate prediction of protein structures and interactions using a three-track neural network [Paper]
- Protein complex prediction with AlphaFold-Multimer [Paper]
- FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours [Paper] [Code]
- HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle [Paper] [Code]
- Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold [Paper] [Code]
- OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization [Paper] [Code]
- ManyFold: an efficient and flexible library for training and validating protein folding models [Paper] [Code]
- ColabFold: making protein folding accessible to all [Paper] [Code]
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences [Paper] [Code]
- ProGen: Language Modeling for Protein Generation [Paper] [Code]
- ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing [Paper] [Code]
- Evolutionary-scale prediction of atomic level protein structure with a language model [Paper]
- High-resolution de novo structure prediction from primary sequence [Paper] [Code]
- Single-sequence protein structure prediction using a language model and deep learning [Paper]
- Improved the Protein Complex Prediction with Protein Language Models [Paper]
- MSA Transformer [Paper] [Code]
- Deciphering antibody affinity maturation with language models and weakly supervised learning [Paper]
- xTrimoABFold: De novo Antibody Structure Prediction without MSA [Paper]
- scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data [Paper] [Code]
- Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions [Paper] [Code]
- E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D Structure Prediction [Paper] [Code]
本项目遵循 MIT License.
本项目遵循 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
如果本项目对您有帮助,请引用我们的项目。
@misc{medllmdata2023,
author = {Jun Wang, Changyu Hou, Xiaorui Wang, Guotong Xie},
title = {Awesome Dataset for Medical LLM: A curated list of popular Datasets, Models and Papers for LLMs in Medical/Healthcare},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/onejune2018/Awesome-Medical-Healthcare-Dataset-For-LLM}},
}
```