Continuous and Data-Driven Descriptors (CDDD)

Implementation of the Paper "Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations" by Robin Winter, Floriane Montanari, Frank Noe and Djork-Arne Clevert.¹

Installing

Prerequisites

python 3
tensorflow 1.10
numpy
rdkit
scikit-learn

Conda

Create a new enviorment:

git clone https://github.com/jrwnter/cddd.git
cd cddd
conda env create -f environment.yml
source activate cddd

Install tensorflow without GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl

Or with GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

And install the cddd package:

pip install .

Downloading Pretrained Model

A pretrained model as described in ref. 1 is available on Google Drive. Download and unzip by execuiting the bash script "download_default_model.sh":

./download_default_model.sh

The default_model.zip file can also be downloaded manualy under https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Testing

Extract molecular descriptors from two QSAR datasets (ref. 2,3) and evaluate the perfromance of a SVM trained on these descriptors.

cd example
python3 run_qsar_test.py --model_dir ../default_model

or with gpu support on e.g. device 0:

python3 run_qsar_test.py --model_dir ../default_model --use_gpu --device 0

The accuracy on the Ames dataset should be arround 0.814 +/- 0.006.

The r2 on the Lipophilicity dataset should be arround 0.731 +/- 0.029.

Getting Started

Extracting Molecular Descripotrs

Run the script run_cddd.py to extract molecular descripotrs of your provided SMILES:

cddd --input smiles.smi --output descriptors.csv  --smiles_header smiles

Supported input:

.csv-file with one SMILES per row
.smi-file with one SMILES per row

For .csv: Specify the header of the SMILES column with the flag --smiles_header (default: smiles)

Inference Module

The pretrained model can also be imported and used directly in python via the inference class:

import pandas as pd
from cddd.inference import InferenceModel
from cddd.preprocessing import preprocess_smiles

Load and preprocess data:

ames_df = pd.read_csv("example/ames.csv", index_col=0)
ames_df["smiles_preprocessed"] = ames_df.smiles.map(preprocess_smiles)
ames_df = ames_df.dropna()
smiles_list = ames_df["smiles_preprocessed"].tolist()

Create a instance of the inference class:

inference_model = InferenceModel()

Encode all SMILES into the continuous embedding (molecular descriptor):

smiles_embedding = inference_model.seq_to_emb(smiles_list)

The infernce model instance can also be used to decode a molecule embedding back to a interpretable SMILES string:

decoded_smiles_list = inference_model.emb_to_seq(smiles_embedding)

References

[1] R. Winter, F. Montanari, F. Noe and D. Clevert, Chem. Sci, 2019, https://pubs.rsc.org/en/content/articlelanding/2019/sc/c8sc04175j#!divAbstract

[2] K. Hansen, S. Mika, T. Schroeter, A. Sutter, A. Ter Laak, T. Steger-Hartmann, N. Heinrich and K.-R. MuÌ´Lller, J. Chem. Inf. Model., 2009, 49, 2077–2081.

[3] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing and V. Pande, Chemical Science, 2018, 9, 513–530.

About

Implementation of the Paper "Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations" by Robin Winter, Floriane Montanari, Frank Noe and Djork-Arne Clevert.

MIT License

Languages

Language:Python 99.4%Language:Shell 0.3%Language:Dockerfile 0.3%