This is the repo for Tencent SolCuration Project. We provide an Aqueous Solubility Dataset Curation tool and 7 Solubility Dataset Before and After Data Curation.
Simply run git clone https://github.com/Mengjintao/SolCuration
.
The above command will download the data curation code and all the original and curated datasets automatically. You can also download your favorite dataset directly from our repo manually.
The easiest way to install the dependencies for Chemprop and AttentiveFP is via conda. Here are the steps:
- Install conda from Anaconda Distribution
cd /path/to/SolCuration
conda env create -f environment.yml
source activate mpnn
(orconda activate mpnn
for newer versions of conda)pip install -r requirements.txt
After the installation, you can check your enviroment with the following commands:
conda activate mpnn
python --version
should returnsPython 3.7.6
python
import torch
print (torch.cuda.device_count())
should returns the number of GPU cards in your host.
cd /path/to/SolCuration
git clone https://github.com/Mengjintao/Chemprop
cd Chemprop
bash test.sh
can lanch a test run for Chemprop.
cd /path/to/SolCuration
git clone https://github.com/Mengjintao/AttentiveFP
cd Chemprop
python runScaffold.py ../../org/esol/esol_org.csv 124 0 2 5 100 5 2
can lanch a test run for AttentiveFP.
Currently, we use szsc partition (2000 computing nodes with 6,400 CPU cores and 8000 GPUs) in National Supercomputer Center in Shenzhen as our GPU cluster.
cd /path/to/Chemprop
bash Chemprop_GridSearch.sh
cd /path/to/AttentiveFP
bash AttentiveFP_GridSearch.sh
Chemprop
cd /path/to/Chemprop
bash Collect_RMSE.sh > RMSE.txt
- We can load data from RMSE.txt using MS Excel for further analysis.
AttentiveFP
cd /path/to/AttentiveFP
cd solubility
bash Collect_RMSE.sh > RMSE.txt
- We can load data from RMSE.txt using MS Excel for further analysis.
Collect Pearson Correlation Coefficient R^2 and Spearman's Rank-Order Correlation Coefficient R_s on BPU, BDZ, PCA, CDK, BPZ&BDZ series evaluation dataset.
Chemprop
cd /path/to/Chemprop
bash Collect_Coefficient.sh > R2out.txt
- We can load data from R2out.txt using MS Excel and plot the figures for further analysis.
AttentiveFP
cd /path/to/AttentiveFP
cd solubility
bash Collect_Coefficient.sh > R2out.txt
- We can load data from R2out.txt using MS Excel and plot the figures for further analysis.
- Python 3.6+
- PyTorch 1.0+
- RDKit
- torchvision
- pandas
- tqdm
- openbabel
Adjust the data curation parameters and rerun the data curation workflow to generate customed curated datasets.
cd /path/to/SolCuration/src
- Modify or add the weight at line 477~484 in cure.cpp.
- recompile cure.cpp with
g++ cure.cpp -o cure
cd ..
- Generated customed datasets with
bash dataGeneration.sh
- Generated additional features with rdkit
bash featureGeneration.sh
- Sorkun, Murat Cihan and Khetan, Abhishek and Er, Suleyman, AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds, Nature Scientific data, 6, pp.1-8 (2019) link
- Yang, Kevin and Swanson, Kyle and Jin, Wengong and Coley, Connor and Eiden, Philipp and Gao, Hua and Guzman-Perez, Angel and Hopper, Timothy and Kelley, Brian and Mathea, Miriam and others. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59, pp. 3370-3388 (2019) link
- Xiong, Zhaoping and Wang, Dingyan and Liu, Xiaohong and Zhong, Feisheng and Wan, Xiaozhe and Li, Xutong and Li, Zhaojun and Luo, Xiaomin and Chen, Kaixian and Jiang, Hualiang and others, Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry (2019) link
Please cite as: