ytabatabaee / DL4H

Final Project of Deep Learning for Healthcare Spring 2022 - Reproducing DeepDTA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproducing DeepDTA: Deep Drug–Target Binding Affinity Prediction

This repository contains the codes and data for the final project of Deep Learning for Healthcare course at UIUC in Spring 2022. This project attempts to reproduce the major results of the DeepDTA paper (Ozturk et al.,2018), and includes some additional experiments beyond the paper.

Contents

DeepDTA

DeepDTA is a deep learning-based model that predicts the level of interaction, or binding affinity, between a drug and a target chemical. DeepDTA uses convolutional neural networks (CNNs) to learn representations from raw sequences of proteins and ligands.

drawing

Citation to the paper: Hakime Öztürk, Arzucan Özgür, Elif Ozkirimli, DeepDTA: deep drug–target binding affinity prediction, Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages i821–i829, https://doi.org/10.1093/bioinformatics/bty593

Code repository of the paper: https://github.com/hkmztrk/DeepDTA

Code repository of the baseline SimBoost (official - in R): https://github.com/hetong007/SimBoost

Code repository of the baseline SimBoost (unofficial - in Python): https://github.com/mahtaz/Simboost-ML_project-

Code repository of the baseline KronRLS: https://github.com/aatapa/RLScore

Dependencies

DeepDTA is written in Python 3 and has the following dependencies.

It is important to note that most Tensorflow and Keras commands used in DeepDTA code are deprecated in the new version of Tensorflow, and therefore Tensorflow 1.x should necessarily be used to run the code. Google Colab loads Tensorflow 2.x by default, and the 1.x version can be loaded with the following command: %tensorflow_version 1.x

SimBoost (official code) is written in R and has the following dependencies.

SimBoost (unofficial code) is written in Python 3 and has the following dependencies.

KronRLS is written in Python 2 and has the following dependencies.

To compute the Area Under Precision Recall curve (AUPR) as performance measure, all methods use the auc.jar package available for download at http://mark.goadrich.com/programs/AUC/ and located at auc.jar in the repository.

Data

The paper uses the Davis Kinase binding affinity dataset (Davis et al., 2011), containing 442 proteins and 68 compounds with overall 30,056 interactions, and the KIBA large-scale kinase inhibitors bioactivity dataset (Tang et al., 2014), containing 229 proteins and 2111 compounds with overall 118,254 interactions.

Raw Data Download: The raw datasets are available for download from the following links:

Preprocessed Data: The preprocessed datasets are located under DeepDTA/data directory as kiba and davis. Each dataset directory contains several files named as follows:

  • proteins.txt: This file contains raw amino-acid sequences of proteins.
  • ligands_can.txt: This file continas the raw SMILES sequences of ligands (compounds) in canonical form.
  • Y: This file contains binding affinity values between proteins and ligands.
  • target-target_similarities_WS.txt: This file contains the Smith-Waterman (SW) matrices of similarity between target pairs.
  • drug-drug_similarities_2D.txt: This file contains the Pubchem Sim matrices of similarity between drug pairs.

Data Statistics: The data_statistics.ipynb file demonstrates some statistics of the datasets, including distribution of the binding affinity scores and distribution of the protein and SMILES sequence lengths for both Davis and KIBA datasets.

Codes

Preprocessing

The preprocessed data is already available at DeepDTA/data. All the experiments were done on this preprocessed data, and no extra preprocessing is required. The data_statistics.ipynb jupyter notebook shows how the data can be loaded and used in any code, and demonstrates a reproduction of Figure 1 in the paper.

Additional Explanation of the Data

This section is partially taken from the original README of the paper's repository.

Similarity files

For each dataset, there are two similarity files, drug-drug and target-target similarities.

  • Drug-drug similarities obtained via Pubchem structure clustering.
  • Target-target similarities are obtained via S-W similarity.

These files were used to re-produce the results of two other methods (Pahikkala et al., 2017) and (He et al., 2017), and also for some experiments in DeepDTA model, please refer to paper.

Binding affinity files

  • For davis dataset, standard value is Kd in nM. In the article, the following transformation was used:

  • For KIBA dataset, standard value is KIBA score. Two versions of the binding affinity value txt files correspond the original values and transformed values (more information here). In the article the tranformed form was used.

  • nan values indicate there is no experimental value for that drug-target pair.

Train and test folds There are two files for each dataset: train fold and test fold. Both of these files keep the position information for the binding affinity value given in binding affinity matrices in the text files.

  • Since the authors performed 5-fold cv, each fold file contains five different set of positions.
  • Test set is same for all five training sets.

For using the folds

  • Load affinity matrix Y
import pickle
import numpy as np

Y = pickle.load(open("Y", "rb"))  # Y = pickle.load(open("Y", "rb"), encoding='latin1')
# log transformation for davis 
if log_space:
        Y = -(np.log10(Y/(math.pow(10,9))))
label_row_inds, label_col_inds = np.where(np.isnan(Y)==False)
  • label_row_inds: drug indices for the corresponding affinity matrix positions (flattened)
    e.g. 36275th point in the KIBA Y matrix indicates the 364th drug (same order in the SMILES file)

    label_row_inds[36275]
  • label_col_inds: protein indices for the corresponding affinity matrix positions (flattened)

    e.g. 36275th point in the KIBA Y matrix indicates the 120th protein (same order in the protein sequence file)

    label_col_inds[36275]
  • You can then load the fold files as follows:

    import json
    test_fold = json.load(open(yourdir + "folds/test_fold_setting1.txt"))
    train_folds = json.load(open(yourdir + "folds/train_fold_setting1.txt"))
    
    test_drug_indices = label_row_inds[test_fold]
    test_protein_indices = label_col_inds[test_fold]

    Remember that, train_folds contain an array of 5 lists, each of which correspond to a training set.

Preprocessing for SimBoost

A code for preprocessing the raw data and generating the similarity matrices for SimBoost is available in the simboost_R/preprocessing/ directory and can be run using the commands below, which will generate .Rda files in the simboost_R/data/ directory.

Rscript preprocess_metz_davis.R
Rscript preprocess_Kiba.R

Note that the kiba dataset should be manually downloaded, but davis will be downloaded automatically in the code.

Training and Evaluation

The training and evaluation can be done with a single command in all methods, but the evaluation could be done separately by loading the pretrained models as well. We will bring the commands used for running the experiments for each method below:

DeepDTA

DeepDTA can be run with the following command.

$ cd DeepDTA/source/
$ python3 run_experiments.py --num_windows 32 \
                          --seq_window_lengths 8 \
                          --smi_window_lengths 6 \
                          --batch_size 256 \
                          --num_epoch 100 \
                          --max_seq_len 1000 \
                          --max_smi_len 100 \
                          --dataset_path 'data/kiba/' \
                          --problem_type 1 \
                          --log_dir 'logs/'

We explain some of the non-trivial parameters below:

  • --num_windows: The number of filters for the first convlutional layer.
  • --seq_window_lengths, --smi_window_lengths: fixed length of windows for protein (seq) and compound (smiles) sequences. Could be provided as a range, such as 4 8 12.
  • --max_seq_len, --max_smi_len: fixed lengths of protein (seq) and compound (smi) sequences. These were set as 1000 and 100 respectively for kiba and 1200 and 85 for davis in the paper.
  • --problem_type: 1 for kiba and 0 for davis, indicates whether a log transformation is needed

To use a different performance measure or run one of the DeepDTA baselines, you can change the following two lines at the end of run_experiments.py:

perfmeasure = get_cindex # specify performance measure, e.g. mse loss
deepmethod = build_combined_categorical # specify model type, e.g. baseline, combined, etc

SimBoost

We only ran the python version of SimBoost in this project. The python codes for training and evaluation of SimBoost are available in the jupyter notebooks simboost_python/SimBoost_kiba.ipynb and simboost_python/SimBoost_davis.ipynb.

  • Command for running the R version:
$ cd simboost_R/xgboost/
$ Rscript Sequential.feature.*.R
$ Rscript Sequential.cv.xgb.quantile.exec.R
$ Rscript Sequential.cv.xgb.exec.R

where * is the name of the dataset (kiba or davis).

KronRLS

KronRLS can be run with the following commands. Note that setup.py should only be run to install for the first time.

$ cd KronRLS/
$ python2 setup.py # use only in the first run
$ python2 kronecker_experiments.py

The dataset and problem type (regression or classification) could be specified in the main function in the kronecker_experiments.py, you can just uncomment the function you want to run, such as kiba_regression() or davis_regression().

Log files and plots

The MSE loss and CI-score plots of the DeepDTA model and its variants are provided in the DeepDTA/source/figures/ directory and the log files for training and evaluation are provided in the DeepDTA/source/logs/ directory.

Pretrained Models

Most of the pretrained models are provided in the pretrained_models directory in Github, but the ones that were larger than 100MB are provided in Google Drive, with links available below:

Loading DeepDTA pretrained models

Note: Since the Keras and Tensorflow versions used in DeepDTA are old and now deprecated, the recent h5py packages can not be used to load the pretrained models. You will need to reinstall the package using the following command:

pip install 'h5py==2.10.0' --force-reinstall

The following code can then be used to load a pretrained model:

from keras.models import load_model
model = load_model('combined_davis.h5', custom_objects={"cindex_score": cindex_score})
model.summary()

Based on where you run the code, you may also need to have the cindex_score function, which is available at DeepDTA/source/run_experiments.py.

Evaluating the pretrained models

The evaluation and training codes are not separate for any of the methods, however, evaluation can be easily done using the pretrained models as well. A complete example is available at the end of simboost_python/SimBoost_davis.ipynb. All the .pkl models can be loaded with pickle as below.

import pickle
loaded_model = pickle.load(open('davis_simboost.pkl', 'rb'))
Y_pred = loaded_model.predict(X_test)

print("Davis Test CI-Index: %.3f" % cindex_score(Y_test, Y_pred))
print("Davis Test MSE: %.3f" % mean_squared_error(Y_test, Y_pred))

Results

For each experiment on the Davis dataset, the total number of training samples was 20036 and the total number of test samples was 5010.

Davis

Method Proteins Compounds CI-Index MSE Loss
KronRLS S–W Pubchem Sim 0.867 0.376
SimBoost S–W Pubchem Sim 0.862 0.298
DeepDTA S–W Pubchem Sim 0.771 0.685
DeepDTA CNN Pubchem Sim 0.810 0.490
DeepDTA S–W CNN 0.823 0.462
DeepDTA CNN CNN 0.876 0.255

KIBA

For each experiment on the KIBA dataset, the total number of training samples was 78836 and the total number of test samples was 19709.

Method Proteins Compounds CI-Index MSE Loss
KronRLS S–W Pubchem Sim 0.794 0.373
SimBoost S–W Pubchem Sim 0.824 0.279
DeepDTA S–W Pubchem Sim 0.704 1.59
DeepDTA CNN Pubchem Sim 0.702 0.541
DeepDTA S–W CNN 0.759 0.355
DeepDTA CNN CNN 0.857 0.211

Acknowledgements

Please cite the original paper if you are using this code in your work.

@article{ozturk2018deepdta,
  title={DeepDTA: deep drug--target binding affinity prediction},
  author={{\"O}zt{\"u}rk, Hakime and {\"O}zg{\"u}r, Arzucan and Ozkirimli, Elif},
  journal={Bioinformatics},
  volume={34},
  number={17},
  pages={i821--i829},
  year={2018},
  publisher={Oxford University Press}
}

About

Final Project of Deep Learning for Healthcare Spring 2022 - Reproducing DeepDTA


Languages

Language:Jupyter Notebook 72.2%Language:Python 16.6%Language:R 8.8%Language:C 2.4%Language:Shell 0.1%