arneschneuing / DiffSBDD

A Euclidean diffusion model for structure-based drug design.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data preprocessing with coarse graining does not seem to work

stratisMarkou opened this issue · comments

Following the readme instructions I have downloaded the crossdocked, unzipped it and am trying to run the preprocessing script on it with and without the flag --ca_only.

Running

python process_crossdock.py my_data_dir --no_H

runs without errors, but running

python process_crossdock.py .data --no_H --ca_only

fails, giving the error

KeyError "'R' not in amino acid dict (.data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb, .data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf)" WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf
#failed: 10: 100%|█████████████| 10/10 [00:00<00:00, 128.31it/s]
Traceback (most recent call last):
  File "/home/stratis/repos/DiffSBDD/process_crossdock.py", line 364, in <module>
    lig_coords = np.concatenate(lig_coords, axis=0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: need at least one array to concatenate

It looks like in the second case, the script is failing to find certain entries in the amino acid dict and is skipping all protein-ligand complexes, resulting in an empty list for lig_coords which can't be concatenated. Looking at the dataset_params dictionary, it seems that there's two sets of preprocessing parameter settings crossdock_full and crossdock. Changing line 24 in the preprocessing script from dataset_info = dataset_params['crossdock_full'] to dataset_info = dataset_params['crossdock'] and running the preprocessing with --ca_only works without any errors, but I'm not sure the resulting data is correctly preprocessed. Is there something wrong with the preprocessing script or am I doing something wrong on my side?

Hi Stratis,
I think the process_crossdock.py file is indeed outdated and should be updated. As far as I can tell, your solution should be fine as a temporary fix because dataset_params['crossdock'] contains the correct amino acid types required for the coarse-grained model (maybe @yuanqidu can confirm). We will try to upload a correct version as soon as possible.
Sorry for the inconvenience!