When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.
widdowquinn opened this issue · comments
Leighton Pritchard commented
Summary:
Some queries are ambiguous in terms of matching CDS (e.g. a gene name is provided, but no protein_id or other precise accession, and the "wrong one" can be recovered. This currently has two potential results:
- the query is skipped because the conceptual translation doesn't match the query
- an error is thrown because the Stockholm region of the query falls outside the coding sequence
Reproducible Steps:
With
>tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 [subseq from] Ammonium transporter OS=Collimonas pratensis OX=279113 GN=amt PE=3 SV=1
DVFGVHGVGGIMGALLTGVFAAPSL
as bad_sequence.fasta
issue
ncfp --unify_seqid --debug -l bad_sequence.log -s bad_sequence.fasta bad_sequence dev@null.me
Current Output:
Key: translation, Value: ['MPINIGNTAFMLLCSSLVMLMTPGLAFFYGGLVGRKNVLAIMMQSFISLGWTTVLWFAFGYSMCFGPSWHGIIGDPTYYAFLHGITLSSMYTGNDAGIPLIVHVAYQMMFAIITPALITGAFANRVTFKAYFLFLTGWLVFVYFPFVHMVWSPDGLFAKWGVLDYAGGIVVHNTAGFAALASVLYVGRRQKVELKPHNVPLIALGSGLLWFGWYGFNAGSEFRVDAVTASAFLNTDVAASFGAITWLFIEWFYHKKPKFIGLLTGGVAGLATITPAAGYVSLGTAAIIGICAGLICFYAVALKNRLGWDDALDVWGVHGVGGMAGTILLGVFASKAWNANGADGLLLGNTSFFFAQCGAVIISGIWAFAFTYGMLWLINLFTPVKVGAATQDRMDEDLHGEDAYLHA']
[DEBUG] [ncbi_cds_from_protein.sequences]: Trimming CDS to Stockholm coordinates: 414..438
Traceback (most recent call last):
File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/bin/ncfp", line 33, in <module>
sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 379, in run_main
nt_sequences = extract_cds_features(seqrecords, cachepath, args)
File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 192, in extract_cds_features
ntseq, aaseq = extract_feature_cds(
File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/sequences.py", line 320, in extract_feature_cds
if aaseq[-1] == "*":
File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
return chr(self._data[index])
IndexError: index out of range
Expected Output:
Graceful fail (message/warning) saying that the sequence couldn't be matched automatically, or the correct sequence returned.
ncfp
Version:
git HEAD
Python Version:
3.10
Operating System:
macOS