t-SMILES: A Scalable Fragment-based Molecular Representation Algorithm for De Novo Molecule Generation

This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES). It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.

Systematic evaluations show that:

It can build a multilingual system for molecular description, in which each decomposition algorithm creates a kind of language, and all these languages can complement each other and contribute to a whole mixed chemical space. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.
It significantly improves generalization performance compared with classical SMILES, DeepSMILES, and SELFIES;
It performs excellently on low-resource datasets JNK3 and AID1706 whether it is the original model or based on data augmentation or pre-training fine-tuning;
It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.
It being universally adaptable to any decomposition method such as BRICS, JTVAE, MMPA, or Scaffold.
It enables the robust application of sequence-based generative models, such as LSTM, Transformer, VAE and AAE, in molecular modeling.

Here we provide the source code of our method.

Dependencies

We recommend Anaconda to manage the version of Python and installed packages.

Please make sure the following packages are installed:

Python**(version >= 3.7)**
PyTorch** (version == 1.7)**

$ conda install pytorch torchvision cudatoolkit=x.x -c pytorch

Note: it depends on the GPU device and CUDA tookit

(x.x is the version of CUDA)

RDKit** (version >= 2020.03)**

$ conda install -c rdkit rdkit

Networkx**(version >= 2.4)**

$ pip install networkx

Numpy** (version >= 1.19)**

$ conda install numpy

Pandas** (version >= 1.2.2)**

$ conda install pandas

Matplotlib** (version >= 2.0)**

$ conda install matplotlib

Usage

For designing the novel drug molecules with t-SMILES, you should do the following steps sequentially by running scripts:

DataSet/Graph/CNJTMol.py

preprocess()

It contained a preprocess function to generate t-SMILES from data set.

DataSet/Tokenlizer.py

preprocess(delimiter=',', invalid_token = '&', save_split = False)

It defines a tokenizer tool which could be used to generate vocabulary of t-SMILES and SMILES.

DataSet/Graph/CNJMolAssembler.py

rebuild_file()

It reconstructs molecules form t-SMILES to generate classical SMILES.

In this study, MolGPT, RNN, VAE, and AAE generative models are used for evaluation.

Acknowledgement

We thank the following Git repositories that gave me a lot of inspirations:

GPT2: https://github.com/samwisegamjeee/pytorch-transformers
MolGPT: https://github.com/devalab/molgpt
MGM：https://github.com/nyu-dl/dl4chem-mgm
JTVAE：https://github.com/wengon-jin/icml18-jtnn
hgraph2graph: https://github.com/wengong-jin/hgraph2graph
FragDGM: https://github.com/marcopodda/fragment-based-dgm
Guacamol：https://github.com/BenevolentAI/guacamol_baselines
MOSES: https://github.com/molecularsets/moses

juanniwu / t-SMILES

t-SMILES: A Scalable Fragment-based Molecular Representation Algorithm for De Novo Molecule Generation

Dependencies

Usage

Acknowledgement

About

Languages