robinlingwood / BIMODAL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bidirectional Molecule Generation with Recurrent Neural Networks

Please note that up-to-date versions of this code are found here.

This is the supporting code for: Grisoni F., Moret M., Lingwood R., Schneider G., "Bidirectional Molecule Generation with Recurrent Neural Networks". Journal of Chemical Information and Modeling (2020). Available here.

You can use this repository for the generation of SMILES with bidirectional recurrent neural networks (RNNs). In addition to the methods' code, several pre-trained models for each approach are included.

The following methods are implemented:

  • Bidirectional Molecule Design by Alternate Learning (BIMODAL), designed for SMILES generation – see Grisoni et al. 2020.
  • Synchronous Forward Backward RNN (FB-RNN), based on Mou et al. 2016.
  • Neural Autoregressive Distribution Estimator (NADE), re-adapted for SMILES generation from Berglund et al. 2015.
  • Forward RNN, i.e., unidirectional RNN for SMILES generation.

Table of Contents

  1. Prerequisites
  2. Using the Code
    1. Sampling from a pre-trained model
    2. Training a model on your data
    3. Fine-tuning a model on your data
  3. Authors
  4. License
  5. How to cite

Prerequisites

This repository can be cloned with the following command:

git clone https://github.com/ETHmodlab/BIMODAL

To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can install the virtual environment:

cd path/to/repository/
conda env create -f brnn.yml

To activate the dedicated environment:

conda activate brnn

Your code should now be ready to use!

Using the code

Sampling from a pre-trained model

In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (Kd/I/IC50/EC50 <1μM), for 10 epochs.

To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:

from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)

Parameters:

  • experiment_name (str): name of the experiment with pre-trained model you want to sample from (you can find pre-trained models in evaluation/)
  • stor_dir (str): directory where the models are stored. The sampled SMILES will also be saved there (if write_csv=True)
  • N (int): number of SMILES to sample
  • T (float): sampling temperature
  • fold (list of int): number of folds to use for sampling
  • epoch (list of int): epoch(s) to use for sampling
  • valid (bool): if set to True, only generate valid SMILES are accepted (increases the sampling time)
  • novel (bool): if set to True, only generate novel SMILES (increases the sampling time)
  • unique (bool): if set to True, only generate unique SMILES are provided (increases the sampling time)
  • write_csv (bool): if set to True, the .csv file of the generated smiles will be exported in the specified directory.

Notes:

  • For the provided pre-trained models, only fold=[1] and epoch=[9] are provided.
  • The list of available models and their description are provided in evaluation/model_names.md

Training a New Model

Alternatively, if you want to pre-train a model on your own data, you will need to execute three steps: (i) data processing (ii) training and (iii) evaluation. Please be aware that you will need the access to a GPU to pre-train your own model as this is a computationally intensive step.

Preprocessing

Data can be processed by using preprocessing/main_preprocessor.py:

from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)

Parameters:

  • filename_in (str): name of the file containing the SMILES strings (.csv or .tar.xz)
  • model_type (str): name of the chosen generative method
  • starting_point (str): starting point type ('fixed' or 'random')
  • augmentation(int): augmentation folds [Default = 1]

Notes:

  • In preprocessing/main_preprocessor.py you will find info regarding advanced options for pre-processing (e.g., stereochemistry, canonicalization, etc.)
  • Please note that the pre-treated data will have to be stored in data/.

Training

Training requires a parameter file (.ini) with a given set of parameters. You can find examples for all models in experiments/, and further details about the parameters below:

Section Parameter Description Comments
Model model Type ForwardRNN, FBRNN, BIMODAL, NADE
hidden_units Number of hidden units Suggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL
generation To be defined only for NADE (other models defined through preprocessing) fixed, random
Data data Name of data file Has to be located in data/
encoding_size Number of different SMILES tokens 55
molecular_size Length of string with padding See preprocessing
missing_token To add in the parameter file only for NADE M
Training epochs Number of epochs Suggested value: 10
learning_rate Learning rate Suggested value: 0.001
n_folds Folds in cross-validation See below: More than 1 for cross_validation, 1 to use only one fold of the data for validation
batch_size Batch size Suggested value: 128
Evaluation samples Number of generated SMILES after each epoch
temp Sampling temperature Suggested value: 0.7
starting_token Starting token for sampling G for all models except NADE, which requires a sequence consisting of missing values (see publication)

Note:

  • Be aware that value such as the number of tokens or the missing token for NADE have to be defined as in the example above. We kept those as parameters such that you can easily change them if you wish to use this code for other applications.

Options for training:

  • Cross-validation:
from trainer import Trainer

t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.cross_validation(stor_dir = '../evaluation/', restart = False)
  • Single run: 1/n_folds of data used for validation
from trainer import Trainer

t = Trainer(experiment_name = 'BIMODAL_fixed_512')
t.single_run(stor_dir = '../evaluation/', restart = False)

Parameters:

  • experiment_name : Name of parameter file (.ini)
  • stor_dir: Directory where outputs can be found
  • restart: If true, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)

Evaluation

You can do the evaluation of the outputs of your experiment with the evaluation/main_evaluator.py with the following possibilities:

from evaluation import Evaluator

stor_dir = '../evaluation/'
e = Evaluator(experiment_name = 'BIMODAL_fixed_512')
# Plot training and validation loss within one figure
e.eval_training_validation(stor_dir=stor_dir)
# Plot percentage of novel, valid and unique SMILES
e.eval_molecule(stor_dir=stor_dir)

Parameters:

  • experiment_name: Name parameter file (.ini)
  • stor_dir: Directory where outputs can be found

Note:

  • the losses plot can be found, in that case, in '{experiment_name}/statistic/all_statistic.png'
  • the novel, valid and unique SMILES plot can be found, in that case, in '../evaluation/{experiment_name}/molecules/novel_valid_unique_molecules.png'

Fine-tuning a model

Fine-tuning requires a pre-trained model and a parameter file (.ini). Examples of the parameter files (BIMODAL and ForwardRNN) are provided in experiments/.

You can start the sampling procedure with model/main_fine_tuner.py

Section Parameter Description Comments
Model model Type ForwardRNN, FBRNN, BIMODAL, NADE
hidden_units Number of hidden units Suggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL
generation Only NADE (other models defined through preprocessing) fixed, random
Data data Name of data file Has to be located in data/
encoding_size Number of different SMILES tokens 55
molecular_size Length of string with padding See preprocessing
missing_token To add in the parameter file only for NADE M
Training epochs Number of epochs Suggested value: 10
learning_rate Learning rate Suggested value: 0.001
batch_size Batch size Suggested value: 128
Evaluation samples Number of generated SMILES after each epoch
temp Sampling temperature Suggested value: 0.7
starting_token Starting token for sampling G for all models except NADE, which requires a sequence consisting of missing values (see publication)
Fine-Tuning start_model Name of pre-trained model to be used for fine-tuning

To fine-tune a model, you can run:

t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template')
t.fine_tuning(stor_dir='../evaluation/', restart=False)

Parameters:

  • experiment_name: Name parameter file (.ini)
  • stor_dir: Directory where outputs can be found
  • restart: If True, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)

Note:

  • The batch size should not exceed the number of SMILES that you have in your fine-tuning file (taking into account the data augmentation).

Authors

See also the list of contributors who participated in this project.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

If you use this code (or parts thereof), please cite it as:

@article{grisoni2020,
  title={Bidirectional Molecule Generation with Recurrent Neural Networks},
  author={Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={Article ASAP},
  number={},
  pages={},
  year={2020},
  doi = {10.1021/acs.jcim.9b00943},
  url = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943},
 publisher={ACS Publications}
}

About


Languages

Language:Python 100.0%