CAVED123 / TDC-DATASET

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

logo

This repository hosts Therapeutics Data Commons (TDC), an open, user-friendly and extensive dataset hub for medicinal machine learning tasks. So far, it includes more than 100+ datasets for 20+ tasks (ranging from target identification, virtual screening, QSAR to patient recruitment, safety survellience and etc) in most of the drug development stages (from discovery and development to clinical trials and post-market monitoring).

Features

  • Extensive: covers 100+ datasets for 20+ tasks in most of the drug development stages.
  • Ready-to-use: the output can directly feed into prediction library such as scikit-learn and DeepPurpose.
  • User-friendly: very easy to load the dataset (3 lines of codes) and supports various useful functions such as conversion to DGL/PyG graph for interaction data, cold/scaffold split, label distribution visualization, binarize, log-conversion and so much more!
  • Benchmark: provides a benchmark mode for fair comparison. We also provide a leaderboard!
  • Easy-to-contribute: provides a very simple way to contribute a new dataset (just write a loading function, see CONTRIBUTE page)!

Example

GIF placeholder
![](fig/example.gif)
CLICK HERE FOR THE CODE!
from tdc.property_pred import ADME
data = ADME(name = 'LogD74')
# scaffold split using benchmark seed
split = data.get_split(method = 'scaffold', seed = 'benchmark')
# visualize label distribution
data.label_distribution()
# binarize 
data.binarize()
# convert to log
data.conver_to_log()
# get data in the various formats
data.get_data(format = 'DeepPurpose')

Installation

pip install tdc

Cite

arxiv placeholder

Core Data Overview

We have X task formulations and each is associated with many datasets. For example, ADMET is a task formulation and it has its own many datasets. To call a dataset Y from task formulation X, simply calling X(name = Y).

Property Prediction

Interaction Prediction

Generation

  • Paired Molecule GenerationMolGenPaired

    CLICK HERE FOR THE DATASETS!
    Dataset Name Description Reference Type Stats (#pairs/#drugs)
    DRD2
    MolGenPaired(name = 'DRD2')
    34,404/21,703
    QED
    MolGenPaired(name = 'QED')
    88,306/52,262
    logP
    MolGenPaired(name = 'LogP')
    99,909/99,794
    JNK3
    GSK-3beta
  • RetrosynthesisRETRO

    CLICK HERE FOR THE DATASETS!
    Dataset Name Description Reference Type Stats (#drugs)
    USPTO-50K
  • ForwardsynthesisFORWARD

    CLICK HERE FOR THE DATASETS!
    Dataset Name Description Reference Type Stats (#drugs)
    USPTO-50K
  • Reaction PredictionREACT

    CLICK HERE FOR THE DATASETS!
    Dataset Name Description Reference Type Stats (#drugs)
    USPTO-50K

Data Split

To retrieve the dataset split, you could simply type

data = X(name = Y)
data.get_split(seed = 'benchmark')
# {'train': df_train, 'val': df_val, ''test': df_test}

You can specify the splitting method, random seed, and split fractions in the function by e.g. data.get_split(method = 'cold_drug', seed = 1, frac = [0.7, 0.1, 0.2]). For drug property prediction, a scaffold split function is also provided. Simply set method = 'scaffold'.

Benchmark and Leaderboard

We are actively working on a more systematic way to benchmark and leaderboard methods. We would release this feature in the next version. In the meantime, if you have expertise or interest in helping build this feature, please send emails to kexinhuang@hsph.harvard.edu.

Examples: How to Make Predictions

TDC is designed to rapidly conduct experiments. The data output can be directly used for powerful prediction packages. Here, we show how to use DeepPurpose for more advanced drugs/proteins encoders such as MPNN, Transformers and etc.

Using DeepPurpose

CLICK HERE FOR THE CODE!

Contribute

TDC is designed to be a community-driven effort. We know DrugDataLoader only covers tip of iceberg of the data out there. You can easily upload your data by simply writing a function that takes the expected input and output. See step-by-step instruction in the CONTRIBUTE page.

Contact

Send emails to kexinhuang@hsph.harvard.edu or open an issue.

Disclaimer

TDC is an open-source effort. Many datasets are aggregated from various public website sources. We use the Attribution-NonCommercial-ShareAlike 4.0 International license to suffice many datasets requirement. If it still infringes the copyright of the dataset author, please let us know and we will take it down ASAP.

About

License:Other


Languages

Language:Python 100.0%