PNNL-CompBio / CLEAN-Contact

PyTorch Implementation of CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

Home Page:https://www.biorxiv.org/content/10.1101/2024.05.14.594148.abstract

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

not grabbing PDBs from Alphafold

abbyjerger opened this issue · comments

PDB files should be grabbed from the Alphafold URL in the script extract_structure_representation.py. Currently the script uses urllib.request.urlretrieve(), which doesn't seem to work with certain security protocols for systems such as PNNL's HPC Deception. A new way to pull these PDBS should be used.

Closing this issue because it now seems that this is not the actual problem I'm running into. I'll open a new issue to address the actual problem.

PDB files are being successfully obtained (if they exist in the Alphafold database, as expected) when I run extract_structure_representation.py on PNNL's HPC. Errors before might have been related to how I was testing.

Potential edits to extract_structure_representation.py or functionality relating to PDBs we should discuss later:

  • Only create the error_ids.txt if it contains any IDs.
  • Perhaps we set up a script that allows the user to check if their IDs are in Alphafold, before running any other steps.
  • At the end of extract_structure_representation.py, we might want to check that the IDs are now found in all the affected folders such as the user's PDB file location, data/contact_maps, data/resnet_data, and (from the previous retrieve_esm2_embedding step) data/esm2_data.
  • Include more specific exception handling (if an ID shows up in error_ids.txt is it because of an issue with urlretrieve() or the URL, or because the ID just doesn't exist in Alphafold?).