Our paper is at https://arxiv.org/abs/1907.11223
First install the dependencies via conda:
- PyTorch >= 1.0.0
- networkx
- RDKit
- numpy
- Python >= 3.6
And then run pip install .
- The training file should contain pairs of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see
data/qed/train_pairs.txt
. - The test file is a list of molecules to be optimized. Please see
data/qed/test.txt
.
- Extract substructure vocabulary from a given set of molecules:
python get_vocab.py < data/qed/mols.txt > vocab.txt
Please replace data/qed/mols.txt
with your molecules data file.
- Preprocess training data:
python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16 < data/qed/train_pairs.txt
mkdir train_processed
mv tensor* train_processed/
Please replace --train
and --vocab
with training and vocab file.
- Train the model:
mkdir models/
python gnn_train.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir models/
- Make prediction on your lead compounds:
python ensemble_decode.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model_dir models/ > results.csv
If you want a faster decoding for debugging purposes, run
python ensemble_decode.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model_dir models/ --num_decode 20 > results.csv
The output is a CSV file having the following format:
lead compound smiles | new compound smiles | similarity |
---|---|---|
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | COc1ccc([C@@]2(c3cccc(C#N)c3)COC(N)=N2)cc1 | 0.6364 |
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | NC1=NC@@(c2cccc(-c3ccccc3)c2)CO1 | 0.5273 |
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | CCOc1ccc([C@@]2([C@]3(c4ccc(OC)cc4)COC(N)=N3)COC(N)=N2)cc1 | 0.4310 |
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | NC1=NC@@(c2ccc(-c3ccccc3)cc2)CO1 | 0.4717 |
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | COc1ccc(C@@Hc2cccc(-c3cncnc3)c2)cc1 | 0.4643 |
c1ccc(c2cncnc2)cc1[C@@]3(c4ccc(OC)cc4)N=C(N)OC3 | NC1=NC@@(c2cccc(-c3ccccc3)c2)CO1 | 0.5472 |