clinical trial; clinical trial approval probability;

HINT: Hierarchical Interaction Network for Predicting Clinical Trial Approval Probability

Table Of Contents

  • Installation
    • Setup conda environment
    • Activate conda
  • Raw Data
    • clinicaltrial.gov
    • DrugBank
    • MoleculeNet
  • Data Preprocessing
    • Collect all the records
    • diseases to icd10
    • drug to SMILES
    • ICD-10 code hierarchy
    • Sentence Embedding for trial protocol
    • Selection of clinical trial
    • Data split
    • Generated Dataset and Statistics
  • Learn and Inference
    • Phase I/II/III prediction
    • Indication prediction
  • Contact


Setup conda environment

conda env create -f conda.yml

An alternative way is to build conda environment step-by-step.

conda create -n predict_drug_clinical_trial python==3.7 
conda activate predict_drug_clinical_trial 

For example, it uses conda or pip to install the required packages. It may take a long time.

conda install -c rdkit rdkit  
pip install tqdm scikit-learn 
pip install torch
pip install seaborn 
pip install scipy

Activate conda environment

conda activate predict_drug_clinical_trial

Raw Data


  • description

    • We download all the clinical trials records from ClinicalTrial.gov. It contains 348,891 clinical trial records. The data size grows with time because more clinical trial records are added. It describes many important information about clinical trials, including NCT ID (i.e., identifiers to each clinical study), disease names, drugs, brief title and summary, phase, criteria, and statistical analysis results.
  • output

    • ./raw_data: store all the xml files for all the trials (identified by NCT ID).
    • TrialTrove: ./trialtrove/trial_outcomes_v1.csv
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip

Then we unzip the ZIP file. The unzipped file occupies over 8.6 G. Please make sure you have enough space.

unzip AllPublicXML.zip
cd ../


  • description

    • We use DrugBank to get the molecule structures (SMILES, simplified molecular-input line-entry system) of the drug.
  • input

    • None
  • output

    • data/drugbank_drugs_info.csv


ClinicalTable is a public API to convert disease name (natural language) into ICD-10 code.


  • description

    • MoleculeNet include five datasets across the main categories of drug pharmaco-kinetics (PK). For absorption, we use the bioavailability dataset. For distribution, we use the blood-brain-barrier experimental results provided. For metabolism, we use the CYP2C19 experiment paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion, we use the clearance dataset from the eDrug3D database. For toxicity, we use the ToxCast dataset, provided by MoleculeNet. We consider drugs that are not toxic across all toxicology assays as not toxic and otherwise toxic.
  • input

    • None
  • output

    • data/ADMET

Data Preprocessing

Collect all the records

  • description

    • download all the records from clinicaltrial.gov. The current version has 348,891 trial IDs.
  • input

    • raw_data/: raw data, store all the xml files for all the trials (identified by NCT ID).
  • output

    • data/all_xml: store NCT IDs for all the xml files for all the trials.
find raw_data/ -name NCT*.xml | sort > data/all_xml

Disease to ICD-10 code

  • description

    • The diseases in ClinicalTrialGov are described in natural language.

    • On the other hand, ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). It leverages the hierarchical information inherent to medical ontologies.

    • We use ClinicalTable, a public API to convert disease name (natural language) into ICD-10 code.

  • input

    • raw_data/
    • data/all_xml
  • output

    • data/diseases.csv
python src/collect_disease_from_raw.py

drug to SMILES

  • description

    • SMILES is simplified molecular-input line-entry system of the molecule.

    • The drugs in ClinicalTrialGov are described in natural language.

    • DrugBank contains rich information about drugs.

    • We use DrugBank to get the molecule structures in terms of SMILES.

  • input

    • data/drugbank_drugs_info.csv
  • output

    • data/drug2smiles.pkl
python src/drug2smiles.py 

Selection of clinical trial

We design the following inclusion/exclusion criteria to select eligible clinical trials for learning.

  • inclusion criteria

    • study-type is interventional
    • intervention-type is drug
    • p-value in primary-outcome is available
    • disease codes are available
    • drug molecules are available
    • eligibility criteria are available
  • exclusion criteria

    • study-type is observational
    • intervention-type is surgery, biological, device
    • p-value in primary-outcome is not available
    • disease codes are not available
    • drug molecules are not available
    • eligibility criteria are not available

The csv file contains following features:

  • nctid: NCT ID, e.g., NCT00000378, NCT04439305.
  • status: completed, terminated, active, not recruiting, withdrawn, unknown status, suspended, recruiting.
  • why_stop: for completed, it is empty. Otherwise, the common reasons contain slow/low/poor accrual, lack of efficacy
  • label: 0 (failure) or 1 (approved).
  • phase: I, II, III or IV.
  • diseases: list of diseases.
  • icdcodes: list of icd-10 codes.
  • drugs: list of drug names
  • smiless: list of SMILES
  • criteria: egibility criteria
  • input

    • data/diseases.csv
    • data/drug2smiles.pkl
    • data/all_xml
    • trialtrove/*
  • output

    • data/raw_data.csv
python src/collect_raw_data.py | tee data_process.log 


Data Split

  • description (Split criteria)

    • phase I: phase I trials, augmented with phase IV trials as positive samples.
    • phase II: phase II trials, augmented with phase IV trials as positive samples.
    • phase III: phase III trials, augmented with failed phase I and II trials as negative samples and successed phase IV trials as positive samples.
    • indication: trials that fail in phase I or II or III are negative samples. Trials that pass phase III or enter phase IV are positive samples.
  • input

    • data/raw_data.csv
  • output:

    • data/phase_I_{train/valid/test}.csv
    • data/phase_II_{train/valid/test}.csv
    • data/phase_III_{train/valid/test}.csv
    • data/indication_{train/valid/test}.csv
python src/data_split.py 

ICD-10 code hierarchy

  • description

    • get all the ancestor code for the current icd-10 code.
  • input

    • data/raw_data.csv
  • output:

    • data/icdcode2ancestor_dict.pkl
python src/icdcode_encode.py 

sentence embedding

  • description

    • BERT embedding to get sentence embedding for sentence in clinical protocol.
  • input

    • data/raw_data.csv
  • output:

    • data/sentence2embedding.pkl
python src/protocol_encode.py 

Data Statistics

Dataset # Train # Valid # Test # Total Split Date
Phase I 1028 146 295 1469 08/13/2014
Phase II 2667 381 762 3810 03/20/2014
Phase III 4286 612 1225 6123 04/07/2014
Indication 3767 538 1077 5382 05/21/2014

We use temporal split, where the earlier trials (before split date) are used for training and validation, the later trials (after split date) are used for testing. The train:valid:test ratio is 7:1:2.

Learn and Inference

After processing the data, we learn the Hierarchical Interaction Network (HINT) on the following four tasks. The following figure illustrates the pipeline of HINT.


Phase I/II/III Prediction

Phase-level prediction predicts the approval probability of a single phase study.

python src/learn_phaseI.py
python src/learn_phaseII.py
python src/learn_phaseIII.py

Indication level Prediction

Indication-level prediction predicts if the drug can pass all three phases for the final market approval.

python src/learn_indication.py 


  • PR-AUC (Precision-Recall Area Under Curve). Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
  • F1. The F1 score is the harmonic mean of the precision and recall.
  • ROC-AUC (Area Under the Receiver Operating Characteristic Curve). ROC curve summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.


The empirical results are given for reference. The mean and standard deviation of 5 independent runs are reported.

Phase I 0.7406 (0.0221) 0.8474 (0.0144) 0.8383 (0.0186)
Phase II 0.6030 (0.0198) 0.7127 (0.0163) 0.7850 (0.0136)
Phase III 0.6279 (0.0165) 0.6419 (0.0183) 0.7257 (0.0109)
Indication 0.7136 (0.0120) 0.7798 (0.0087) 0.7987 (0.0111)

Jupyter Notebook Tutorial

Please see learn_phaseI.ipynb for details.


Please contact futianfan@gmail.com for help or submit an issue. This is a joint work with Kexin Huang, Cao(Danica) Xiao, Lucas M. Glass and Jimeng Sun.


