This is the official structure-based drug design (SBDD) codebase of the paper
Geometry-Complete Diffusion for 3D Molecule Generation and Optimization, Nature CommsChem
- System requirements
- Installation guide
- Tutorials
- Demo
- Instructions for use
- Acknowledgements
- License
- Citation
This package supports Linux. The package has been tested on the following Linux system:
Description: AlmaLinux release 8.9 (Midnight Oncilla)
This package is developed and tested under Python 3.10.x. The primary Python packages and their versions are as follows. For more details, please refer to the environment.yaml file.
hydra-core=1.3.2
matplotlib-base=3.7.1
numpy=1.24.3
pyg=2.3.0=py310_torch_2.0.0_cu118
python=3.10.11
pytorch=2.0.1=py3.10_cuda11.8_cudnn8.7.0_0
pytorch-scatter=2.1.1=py310_torch_2.0.0_cu118
pytorch-lightning=2.0.2
scikit-learn=1.2.2
torchmetrics=0.11.4Install mamba (~500 MB: ~1 minute)
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh # (optionally) remove installer after using it
source ~/.bashrc # alternatively, one can restart their shell session to achieve the same resultInstall dependencies (~15 GB: ~10 minutes)
# clone project
git clone https://github.com/BioinfoMachineLearning/GCDM-SBDD
cd GCDM-SBDD
# create conda environment
mamba env create -f environment.yaml
conda activate GCDM-SBDD # note: one still needs to use `conda` to (de)activate environments
# install local project as package
pip3 install -e .Download checkpoints (~500 MB extracted: ~2 minutes)
Note: Make sure to be located in the project's root directory beforehand (e.g., ~/GCDM-SBDD/)
# fetch and extract model checkpoints directory
wget https://zenodo.org/record/13375913/files/GCDM_SBDD_Checkpoints.tar.gz
tar -xzf GCDM_SBDD_Checkpoints.tar.gz
rm GCDM_SBDD_Checkpoints.tar.gzNOTE: Trained EGNN baseline checkpoint files are also included in GCDM_SBDD_Checkpoints.tar.gz.
For docking, download QuickVina 2 and copy it to your Conda environment's binary (bin) directory:
wget https://github.com/QVina/qvina/raw/master/bin/qvina2.1
chmod +x qvina2.1
mv qvina2.1 $HOME/mambaforge/envs/GCDM-SBDD/binWe need MGLTools for preparing the receptor for docking (pdb -> pdbqt) but it can mess up your Mamba environment, so I recommend making a new one:
mamba create -n mgltools -c bioconda mgltoolsDownload the dataset
wget https://zenodo.org/record/13375913/files/every_part_a.zip
wget https://zenodo.org/record/13375913/files/every_part_b.zip
wget https://zenodo.org/record/13375913/files/every.csv
unzip every_part_a.zip
unzip every_part_b.zipProcess the raw data using
python process_bindingmoad.py <bindingmoad_dir>or, to suppress warnings,
python -W ignore process_bindingmoad.py <bindingmoad_dir>Download and extract the dataset as described by the authors of Pocket2Mol: https://github.com/pengxingang/Pocket2Mol/tree/main/data
Process the raw data using
python process_crossdock.py <crossdocked_dir> --no_HWe provide a two-part tutorial series of Jupyter notebooks to provide users with a real-world example of how to use GCDM-SBDD for pocket-based molecule generation and filtering, as outlined below.
To sample small molecules for a given pocket with a trained model use the following command:
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outdir <output_dir> --resi_list <list_of_pocket_residue_ids>For example:
python generate_ligands.py last.ckpt --pdbfile 1abc.pdb --outdir results/ --resi_list A:1 A:2 A:3 A:4 A:5 A:6 A:7 Alternatively, the binding pocket can also be specified based on a reference ligand in the same PDB file:
python generate_ligands.py <checkpoint>.ckpt --pdbfile <pdb_file>.pdb --outdir <output_dir> --ref_ligand <chain>:<resi>Optional flags:
| Flag | Description |
|---|---|
--n_samples |
Number of sampled molecules |
--all_frags |
Keep all disconnected fragments |
--sanitize |
Sanitize molecules (invalid molecules will be removed if this flag is present) |
--relax |
Relax generated structure in force field |
--resamplings |
Inpainting parameter (doesn't apply if conditional model is used) |
--jump_length |
Inpainting parameter (doesn't apply if conditional model is used) |
Starting a new training run:
python -u train.py config=<config>.ymlResuming a previous run:
python -u train.py config=<config>.yml resume=<checkpoint>.ckpttest.py can be used to sample molecules for the entire testing set:
python test.py <checkpoint>.ckpt --test_dir <bindingmoad_dir>/processed_noH/test/ --outdir <output_dir> --fix_n_nodesUsing the optional --fix_n_nodes flag lets the model produce ligands with the same number of nodes as the original molecule. Other optional flags are identical to generate_ligands.py.
For assessing basic molecular properties create an instance of the MoleculeProperties class and run its evaluate method:
from analysis.metrics import MoleculeProperties
mol_metrics = MoleculeProperties()
all_qed, all_sa, all_logp, all_lipinski, per_pocket_diversity = \
mol_metrics.evaluate(pocket_mols)evaluate() expects a list of lists where the inner list contains all RDKit molecules generated for one pocket.
For computing docking scores, run QuickVina as described below.
First, convert all protein PDB files to PDBQT files using MGLTools
conda activate mgltools
cd analysis
python2 docking_py27.py <bindingmoad_dir>/processed_noH/test/ <output_dir> bindingmoad
cd ..
conda deactivateThen, compute QuickVina scores:
conda activate GCDM-SBDD
python3 analysis/docking.py --pdbqt_dir <docking_py27_outdir> --sdf_dir <test_outdir> --out_dir <qvina_outdir> --write_csv --write_dict --dataset moadNOTE: One can reference analysis/inference_analysis.py and analysis/molecule_analysis.py to analyze the generated molecules.
To build this project in a Docker container, you can use the following commands:
## Build the image
docker build -t gcdm-sbdd .
## Run the container (with GPUs and mounting the current directory)
docker run -it --gpus all -v .:/mnt --name gcdm-sbdd gcdm-sbddThis Docker image is also available on Docker Hub at cford38/gcdm-sbdd, which can be run with the following command:
# docker pull cford38/gcdm-sbdd
docker run -it --gpus all -v .:/mnt --name gcdm-sbdd cford38/gcdm-sbdd(Note: This image includes the checkpoints in the main working directory /software/GCDM-SBDD/checkpoints/.)
GCDM-SBDD builds upon the source code and data from the following projects:
We thank all their contributors and maintainers!
This project is covered under the MIT License.
If you use the code or data associated with this package or otherwise find this work useful, please cite:
@article{morehead2024geometry,
title={Geometry-complete diffusion for 3D molecule generation and optimization},
author={Morehead, Alex and Cheng, Jianlin},
journal={Communications Chemistry},
volume={7},
number={1},
pages={150},
year={2024},
publisher={Nature Publishing Group UK London}
}