Cannot retrieve some cluster files

Question

Cannot retrieve some cluster files

L40S38 opened this issue 2 years ago · comments

Hi.

I executed the command to evaluate on the Vertex dataset or the ProSPECCTS dataset.
But I found almost the same error like below.

(I exported as $STRUCTURE_DATA_DIR = $DEEPLYTOUGH/datasets_structure. Also, I omitted the path to the repository)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5324k 100 5324k 0 0 1471k 0 0:00:03 0:00:03 --:--:-- 1472k
INFO:datasets.vertex:Preprocessing: downloading data and extracting pockets, this will take time.
INFO:root:cluster file path: DeeplyTough/datasets_structure/bc-30.out
WARNING:root:Cluster definition not found, will download a fresh one.
WARNING:root:However, this will very likely lead to silent incompatibilities with any old 'pdbcode_mappings.pickle' files! Please better remove those manually.
Traceback (most recent call last):
File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 68, in
main()
File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 32, in main
database.preprocess_once()
File "DeeplyTough/deeplytough/datasets/vertex.py", line 49, in preprocess_once
clusterer = RcsbPdbClusters(identity=30)
File "DeeplyTough/deeplytough/misc/utils.py", line 248, in init
self._fetch_cluster_file()
File "DeeplyTough/deeplytough/misc/utils.py", line 262, in _fetch_cluster_file
self._download_cluster_sets(cluster_file_path)
File "DeeplyTough/deeplytough/misc/utils.py", line 253, in _download_cluster_sets
request.urlretrieve(f'https://cdn.rcsb.org/resources/sequence/clusters/bc-{self.identity}.out', cluster_file_path)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

I successed when evaluated on the TOUGH-M1 dataset, so I'm afraid of some URL to the Vertex and ProSPECCTS data is expired.
Would you mind check about that?

Joshua Meyers · Answer 1 · Sat Sep 24 2022 21:37:46 GMT+0800 (China Standard Time)

Hey @L40S38, thanks for opening a ticket. It seems this is due to the RCSB PDBs cluster file moving. See https://www.rcsb.org/news/feature/6205750d8f40f9265109d39f (in fact its discontinued and changed, so this may even have scientific implications for DeeplyTough)

I will have a look into it. If you don't need to use the cluster file (e.g. if you are happy with random splitting, or you just want to run the existing models) I believe you can just specify a different splitting method.

L40S38 · Answer 2 · Sat Dec 24 2022 16:35:21 GMT+0800 (China Standard Time)

Hi, Long time no see

I solved this problem, so I tell you the way.

・the URL to retrieve the cluster file (in deeplytough/misc/utils.py) should be changed as below.

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-{self.identity}.txt

・Also, the expression of sequences in the cluster file was changed to {protein_id}_{entity_id} from {protein_id}_{chain_id}. Then I couldn't get cluster id of most proteins.
Thus, you should get the entity id of the chain in some way and search the cluster it belongs, or split in other way (e.g. uniprot_folds)
In my case, when getting pdbcode_mappings.pickle in preprocessing before using TOUGH-M1 dataset, get entity_id in pdb_chain_to_uniprot in deeplytough/datasets/toughm1.py

def pdb_chain_to_uniprot(pdb_code, query_chain_id):
            """
            Get pdb chain mapping to uniprot accession using the pdbe api
            """
            result = 'None'
            entity_id = 'None'
            r = requests.get(f'http://www.ebi.ac.uk/pdbe/api/mappings/uniprot/{pdb_code}')
            fam = r.json()[pdb_code]['UniProt']

            for fam_id in fam.keys():
                for chain in fam[fam_id]['mappings']:
                    if chain['chain_id'] == query_chain_id:
                        if result != 'None' and fam_id != result:
                            logger.warning(f'DUPLICATE {fam_id} {result}')
                        result = fam_id
                        entity_id = chain['entity_id']
            if result == 'None':
                logger.warning(f'No uniprot accession found for {pdb_code}: {query_chain_id}')
            return entity_id

I wish you well in your execution.