FreedomIntelligence / Huatuo-26M

The Largest-scale Chinese Medical QA Dataset: with 26,000,000 question answer pairs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Huatuo-26M

📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa
中文 | English

👩🏻‍⚕Introduction

  • Huatuo-26M is currently the largest Chinese medical question-and-answer dataset. This dataset contains over 26 million high-quality medical Q&A pairs, covering various aspects such as diseases, symptoms, treatment methods, and drug information.
  • Huatuo-Lite is a refined and optimized dataset based on Huatuo-26M, having undergone multiple purifications and rewrites. It features more data dimensions and higher data quality.

📚Data Content

The Huatuo-26M dataset is collected and integrated from multiple sources, including:

Each question-answer pair in the dataset contains the following fields:

  • questions:Problem Description
  • answers:Doctor/Expert Answers
  • Huatuo-Lite dataset also includes Hospital Department and Related Diseases fields

The following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.

🤖Data Usage

The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:

  • Natural Language Processing: Including but not limited to Q&A systems, text classification, sentiment analysis, etc.
  • Machine Learning model training: Such as disease prediction, personalized treatment recommendation, etc.
  • AI applications in the medical field: Such as intelligent diagnosis systems, medical consultation chatbots, etc.

🚀Quick Start

To start using the Huatuo-26M dataset, you can follow the steps below:

import datasets
# part 1
knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
# part 2
encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
# part 3 (only url)
consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')

# testdatasets (6k)
huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')

👩🏻‍🔬Experiment Record

Benchmark

  • Retrieval Evaluation:

    Click to expand retrieve
  • Answer Generation Evaluation:

    Click to expand retrieve

Application

  • Zero-shot transfer to other QA datasets:

    Click to expand retrieve
  • As external knowledge for RAG:

    Click to expand retrieve
  • As pre-training data for language model (LM):

    Click to expand retrieve
  • As fine-tuning data for Medical LLM:

    Click to expand

🚁License

The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.

📱Contact Us

If you have any questions or need help, please feel free to ask us via email (xidongw@163.com)or in the Issues section.


😁Citation

@misc{li2023huatuo26m,
      title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset}, 
      author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
      year={2023},
      eprint={2305.01526},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

The Largest-scale Chinese Medical QA Dataset: with 26,000,000 question answer pairs.