zaixizhang / mol_opt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mol_opt: A Benchmark for Practical Molecular Optimization


GitHub Repo stars GitHub Repo forks

This repository hosts an open-source benchmark for Practical Molecular Optimization (PMO), to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 25 molecular design algorithms on 23 tasks with a particular focus on sample efficiency (oracle calls). The preprint version of the paper is available at https://arxiv.org/pdf/2206.12411.pdf

install

conda create -n molopt python=3.7
conda activate molopt 
pip install torch 
pip install PyTDC 
pip install PyYAML
conda install -c rdkit rdkit 
pip install wandb   
wandb login  ### user need to register wandb

We recommend to use PyTorch 1.10.2 and PyTDC 0.3.6.

activate conda

conda activate molopt 

25 Methods

Based the ML methodologies, all the methods are categorized into:

  • virtual screening
    • screening randomly search ZINC database.
    • molpal uses molecular property predictor to prioritize the high-scored molecules.
  • GA (genetic algorithm)
    • graph_ga based on molecular graph.
    • smiles_ga based on SMILES
    • selfies_ga based on SELFIES
    • stoned based on SELFIES
    • synnet based on synthesis
  • VAE (variational auto-encoder)
    • smiles_vae based on SMILES
    • selfies_vae based on SELFIES
    • jt_vae based on junction tree (fragment as building block)
    • dog_ae based on synthesis
  • BO (Bayesian optimization)
    • gpbo
  • RL (reinforcement learning)
    • reinvent
    • reinvent_selfies
    • graphinvent
    • moldqn
  • HC (hill climbing)
    • smiles_lstm_hc is SMILES-level HC.
    • smiles_ahc is SMILES-level augmented HC.
    • selfies_lstm_hc is SELFIES-level HC
    • mimosa is graph-level HC
    • dog_gen is synthesis based HC
  • gradient (gradient ascent)
    • dst is based molecular graph.
    • pasithea is based on SELFIES.
  • SBM (score-based modeling)
    • gflownet
    • gflownet_al
    • mars

time is the average rough clock time for a single run in our benchmark and do not involve the time for pretraining and data preprocess. We have processed the data, pretrained the model. Both are available in the repository.

assembly additional package time requires_gpu
screening - - 2 min no
molpal - ray 1 hour no
graph_ga fragment joblib 3 min no
smiles_ga SMILES joblib, nltk 2 min no
stoned SELFIES - 3 min no
selfies_ga SELFIES selfies 20 min no
graph_mcts atom - 2 min no
smiles_lstm_hc SMILES guacamol 4 min no
smiles_ahc SMILES 4 min no
selfies_lstm_hc SELFIES guacamol, selfies 4 min yes
smiles_vae SMILES botorch 20 min yes
selfies_vae SELFIES botorch, selfies 20 min yes
jt_vae fragment botorch 20 min yes
gpbo fragment botorch, networkx 15 min no
reinvent SMILES - 2 min yes
reinvent_selfies SELFIES selfies 3 min yes
moldqn atom networks, requests 60 min yes
mimosa fragment - 10 min yes
mars fragment chemprop, networkx, dgl 20 min yes
dog_gen synthesis extra conda 120 min yes
dog_ae synthesis extra conda 50 min yes
synnet synthesis dgl, pytorch_lightning, networkx, matplotlib 2-5 hours yes
pasithea SELFIES selfies, matplotlib 50 min yes
dst fragment - 120 min no
gflownet fragment torch_{geometric,sparse,cluster}, pdb 30 min yes
gflownet_al fragment torch_{geometric,sparse,cluster}, pdb 30 min yes

Run with one-line code

There are three types of runs defined in our code base:

  • simple: A single run for testing purposes for each oracle, is the defualt.
  • production: Multiple independent runs with various random seeds for each oracle.
  • tune: A hyper-parameter tuning over the search space defined in main/MODEL_NAME/hparam_tune.yaml for each oracle.
## run a single test run on qed with wandb logging online
python run.py MODEL_NAME --wandb online
## specify multiple random seeds 
python run.py MODEL_NAME --seed 0 1 2 
## run 5 runs with different random seeds with specific oracle with wandb logging offline
python run.py MODEL_NAME --task production --n_runs 5 --oracles qed 
## run a hyper-parameter tuning starting from smiles in a smi_file, 30 runs in total without wandb logging
python run.py MODEL_NAME --task tune --n_runs 30 --smi_file XX --wandb disabled --other_args XX 

MODEL_NAME are listed in the table above.

Hyperparameters

We separate hyperparameters for task-level control, defined from argparse, and algorithm-level control, defined from hparam_default.yaml. There is no clear boundary for them, but we recommend one keep all hyperparameters in the self._optimize function as task-level.

  • running hyperparameter: parser argument.
  • default model hyperparameter: hparam_default.yaml
  • tuning model hyperparameter: hparam_tune.yaml

For algorithm-level hyperparameters, we adopt the stratforward yaml file format. One should define a default set of hyper-parameters in main/MODEL_NAME/hparam_default.yaml:

population_size: 50
offspring_size: 100
mutation_rate: 0.02
patience: 5
max_generations: 1000

And the search space for hyper-parameter tuning in main/MODEL_NAME/hparam_tune.yaml:

name: graph_ga
method: random
metric:
  goal: maximize
  name: avg_top100
parameters:
  population_size:
    values: [20, 40, 50, 60, 80, 100, 150, 200]
  offspring_size:
    values: [50, 100, 200, 300]
  mutation_rate:
    distribution: uniform
    min: 0
    max: 0.1
  patience:
    value: 5
  max_generations:
    value: 1000

We use the sweep function in wandb for a convenient visualization. The yaml file should follow the format as above. Further detail is in this instruction.

Contribute

Our repository is an open-source initiative. To update a better set of parameters or incldue your model in out benchmark, check our Contribution Guidelines!

About

License:MIT License


Languages

Language:Python 92.7%Language:JavaScript 2.5%Language:Jupyter Notebook 1.8%Language:Fortran 1.7%Language:Shell 0.9%Language:Cython 0.2%Language:PureBasic 0.1%Language:CSS 0.1%Language:HTML 0.0%