chris-lovejoy / medical-datasets-for-education

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High-Quality, Open-Source Medical Datasets for Education (Crowd-Sourced)

This is a curated list of medical datasets best-suited to the purpose of learning and teeaching. It is not intended to be comprehensive list of all medical datasets - rather, the intention is to include data that has been vetted for both quality and ease-of-access (ie. open-source) and which are therefore well-suited to educational purposes. The rationale behind this repository is explained in more detail here.

Contents

Principles of this repository

  • The hope is for this to become a crowd-sourced resource and therefore contributions are warmly invited. Please either add an Issue, submit a pull request or ping an email to hi at chrislovejoy.me.
  • Links to high-quality tutorials utilising the dataset will be included along with the dataset description, where such tutorials and walk-throughs are available.
  • This list exists in collaboration with other lists of medical datasets which have different approaches and focusses.

(1) Medical Images

X-Rays

CheXpert: 224,316 chest radiographs from 65,240 patients. Each report was labeled for the presence of 14 observations as positive, negative, or uncertain.

CT

The National CT Colonography Trial: 825 cases of CT colonography imaging with accompanying spreadsheets that provide polyp descriptions and their location within the colon segments.

MRI

fastMRI: Several thousand knee MRIs. Requires application for access (online form).

Histology

Automatic Non-rigid Histological Image Registration (ANHIR) challenge dataset: 50+ histological sets of whole slide images

Other (Ultrasound, Retinal)

EchoNet-Dynamic: 10,030 echocardiogram videos.

(2) Natural Language Data

Clinical

MIMIC-III: Anonymized critical care EHR database on 38,597 patients and 53,423 ICU admissions. Requires registration.

Biomedical Research

S2ORC: The Semantic Scholar Open Research Corpus: 81.1M English-language academic papers spanning many academic disciplines.

(3) Other Modalities and Multi-Modal

Bioinformatics / Biospecimen

The Cancer Genome Atlas Program: over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data from over 20,000 primary cancer and matched normal samples spanning 33 cancer types.

Time series

Parkinson Speech Dataset: 26 types of sound recordings taken from 20 Parkinson's patients and 20 health patients.

Subsections to potentially add:

  • EHR-derived data
  • public health / population health data

Other Lists of Medical Datasets

About