Hierarchical Graph-to-Graph Translation for Molecules

Our paper is at https://arxiv.org/abs/1907.11223

Installation

First install the dependencies via conda:

PyTorch >= 1.0.0
networkx
RDKit
numpy
Python >= 3.6

And then run pip install .

Data Format

The training file should contain pairs of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see data/qed/train_pairs.txt.
The test file is a list of molecules to be optimized. Please see data/qed/test.txt.

Sample training procedure

Extract substructure vocabulary from a given set of molecules:

python get_vocab.py < data/qed/mols.txt > vocab.txt

Please replace data/qed/mols.txt with your molecules data file.

Preprocess training data:

python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16 < data/qed/train_pairs.txt
mkdir train_processed
mv tensor* train_processed/

Please replace --train and --vocab with training and vocab file.

Train the model:

mkdir models/
python gnn_train.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir models/

Make prediction on your lead compounds:

python ensemble_decode.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model_dir models/ > results.csv

If you want a faster decoding for debugging purposes, run

python ensemble_decode.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model_dir models/ --num_decode 20 > results.csv

The output is a CSV file having the following format:

lead compound smiles	new compound smiles	similarity
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	COc1ccc([C@@]2(c3cccc(C#N)c3)COC(N)=N2)cc1	0.6364
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	NC1=NC@@(c2cccc(-c3ccccc3)c2)CO1	0.5273
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	CCOc1ccc([C@@]2([C@]3(c4ccc(OC)cc4)COC(N)=N3)COC(N)=N2)cc1	0.4310
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	NC1=NC@@(c2ccc(-c3ccccc3)cc2)CO1	0.4717
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	COc1ccc(C@@Hc2cccc(-c3cncnc3)c2)cc1	0.4643
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3	NC1=NC@@(c2cccc(-c3ccccc3)c2)CO1	0.5472

SeongokRyu / hgraph2graph

Hierarchical Graph-to-Graph Translation for Molecules

Installation

Data Format

Sample training procedure

About

Languages