`ncfp` not recovering all coding sequences from NCBI
widdowquinn opened this issue · comments
Leighton Pritchard commented
Summary:
ncfp
does not recover all coding sequences from NCBI, even if a coding sequence is available
Description:
The UniProt sequence below
>tr|F5NV06|F5NV06_SHIFL MliC domain-containing protein OS=Shigella flexneri K-227 OX=766147 GN=SFK227_1958 PE=4 SV=1
MKKLLIIILPVLLSGCSAFNQLVERMQTDTLEYQCDEKPLTVKLNNPCQEVSFVYDNQLL
HLKQGLSASGARYSDGIYVFWSKGEEATVYKRDRIVLNNCQLQNPQR
corresponds to the NCBI record
https://www.ncbi.nlm.nih.gov/protein/333018885
whose coding sequence is in the nucleotide accession
https://www.ncbi.nlm.nih.gov/nuccore/AFGY01000021.1
but in debug mode ncfp
reports:
[DEBUG] [ncbi_cds_from_protein.sequences]: Guessing sequence type for tr|F5NV06|F5NV06_SHIFL...
[DEBUG] [ncbi_cds_from_protein.sequences]: ...guessed UniProt
[DEBUG] [ncbi_cds_from_protein.sequences]: Uniprot record has GN field: SFK227_1958
[DEBUG] [ncbi_cds_from_protein.sequences]: Recovered EMBL database record: AFGY01000021
[DEBUG] [ncbi_cds_from_protein.sequences]: Adding record tr|F5NV06|F5NV06_SHIFL to cache with query AFGY01000021
Process input sequences: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.12it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: 1 sequences taken forward with query
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Identifying nucleotide accessions...
Search NT IDs: 0%| | 0/1 [00:00<?, ?it/s][DEBUG] [ncbi_cds_from_protein.entrez]: Entry has nt query, using direct ESearch
[DEBUG] [ncbi_cds_from_protein.entrez]: ESearch query: ('AFGY01000021',)
Search NT IDs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.81it/s]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Added 1 new UIDs to cache
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Collecting GenBank accessions...
Fetch UID accessions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.24s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Updated GenBank accessions for 1 UIDs
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetching GenBank headers...
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 UIDs with no GenBank headers
[DEBUG] [ncbi_cds_from_protein.entrez]: Checking EPost histories, batch size is 1
[DEBUG] [ncbi_cds_from_protein.entrez]: Found 1 EPost histories, fetching headers
[...]
DEBUG:ncbi_cds_from_protein.entrez:Parsed 1 records
Fetching GenBank headers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.22s/it]
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Fetched GenBank headers for 0 UIDs
INFO:ncbi_cds_from_protein.scripts.ncfp:Fetched GenBank headers for 0 UIDs
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No GenBank header downloads were required! (in cache?)
WARNING:ncbi_cds_from_protein.scripts.ncfp:No GenBank header downloads were required! (in cache?)
[...]
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input tr|F5NV06|F5NV06_SHIFL
WARNING:ncbi_cds_from_protein.scripts.ncfp:No record found for sequence input tr|F5NV06|F5NV06_SHIFL
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Matched 0/1 records
INFO:ncbi_cds_from_protein.scripts.ncfp:Matched 0/1 records
and the ncfp*.fasta
output files are empty.
Reproducible Steps:
- Create an input file containing only the sequence above.
- Call
ncfp
on that input file, e.g. withncfp --debug -l test.log -b 1 --keepcache test.fasta test_ncfp me@my.email
ncfp
Version:
Commit 0f70697
Python Version:
Python 3.8
Operating System:
macOS
Leighton Pritchard commented
It may be relevant that, locally, the tests fail with warnings like:
[...]
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input XP_004520832.1
[...]
Leighton Pritchard commented
Issue closed with fix in 3a5eb88