-
Sampled molecules sdf files from 3D-SBDD, GaphBP, Pocket2Mol, TargetDiff and our InterDiff are released in Zenodo, you can get the data here.
-
The project code has also been deployed to Google Drive and Google Collab, and you can directly run the test code through the interdiff.ipynb file.
Please use Mamba to manage the environment.
micromamba create -n interdiff python=3.8 -y
micromamba activate interdiff
micromamba install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
### url: https://pytorch-geometric.com/whl/torch-1.13.1%2Bcu116.html
### SYSTEM_TYPE: win_amd64/linux_x86_64
export SYSTEM_TYPE=linux_x86_64
### We provide the *.whl for torch_geometric in _env.
pip install _env/$SYSTEM_TYPE/torch_scatter-2.1.1+pt113cu116-cp38-cp38-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
pip install _env/$SYSTEM_TYPE/torch_cluster-1.6.1+pt113cu116-cp38-cp38-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
pip install _env/$SYSTEM_TYPE/torch_sparse-0.6.17+pt113cu116-cp38-cp38-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
pip install _env/$SYSTEM_TYPE/torch_spline_conv-1.2.2+pt113cu116-cp38-cp38-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
pip install torch_geometric -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
micromamba install rdkit=2022.03 openbabel tensorboard pyyaml easydict python-lmdb -c conda-forge
pip install meeko==0.1.dev3 scipy pdb2pqr vina==1.2.2 transformers
python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3
We provide the preprocessed data on data Google Drive folder:
pocket_with_prompt.lmdb
prompt_split.lmdb
If you want to process the dataset from scratch, you need to download CrossDocked2020 v1.1 data processed by Guan et al., the original dataset is filtered to keep the data with RMSD < 1A.
- run the script extract_pockets.py to extract the pocket:
python -m scripts.data_preparation.extract_pockets \ --source_data_path ../interdiff_data/crossdocked_v1.1_rmsd1.0 \ --save_pocket_path ../interdiff_data/crossdocked_v1.1_rmsd1.0_pocket \ --save_db_path ../interdiff_data/pocket.lmdb \ --num_workers 128
- run the script extract_prompt.py for detecting interaction types and adding prompts to the dataset:
After processing the data, you will obtain a
python -m scripts.data_preparation.extract_prompt \ --source_data_path ../interdiff_data/crossdocked_v1.1_rmsd1.0 \ --source_db_path ../interdiff_data/pocket.lmdb \ --temp_path ../interdiff_data/temp \ --save_db_path ../interdiff_data/pocket_with_prompt.lmdb \ --num_workers 128
pocket_with_prompt.lmdb
database, which contains the data required for training, and thetrain/test
split fileprompt_split.pt
.
python train_diffusion.py # Only 100 data are trained here for illustration. You can change the file to train whole dataset.
You can get pretrained model weight here, and put this *.pt file in in pretrained/checkpoints.
Here we sample ligands in the test set. Processed test set are in the path data/test_data.pt. we generat one ligand for each target. In sample.py script, we can sample multiple targets at the same time. The batch_size is 10 and we sample for ten different targets, generating one ligand for each target.
python sampling.py
we have finishing sampling the protein targets, the samples are saved in folder outputs_system with name equa_last.pt, which saves the coordinates and atom types of ligand. Now we reconstruct the molecules from equa_last.pt file and detect the interactions, the illustration results are saved in outputs_system/equa_aromatic.csv.
python sampling.py --evaluation True
Sampling for arbitrary pocket, the input format should be in PDB (Protein Data Bank) format, and the pockets need to be pre-extracted by yourself.
python -m scripts.sample_for_pocket configs/sampling.yml --pdb_path examples.pdb