When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.

Question

When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.

widdowquinn opened this issue 2 years ago · comments

Leighton Pritchard commented 2 years ago

Summary:

Some queries are ambiguous in terms of matching CDS (e.g. a gene name is provided, but no protein_id or other precise accession, and the "wrong one" can be recovered. This currently has two potential results:

the query is skipped because the conceptual translation doesn't match the query
an error is thrown because the Stockholm region of the query falls outside the coding sequence

Reproducible Steps:

With

>tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 [subseq from] Ammonium transporter OS=Collimonas pratensis OX=279113 GN=amt PE=3 SV=1
DVFGVHGVGGIMGALLTGVFAAPSL

as bad_sequence.fasta

issue

ncfp --unify_seqid --debug -l bad_sequence.log -s bad_sequence.fasta bad_sequence dev@null.me

Current Output:

    Key: translation, Value: ['MPINIGNTAFMLLCSSLVMLMTPGLAFFYGGLVGRKNVLAIMMQSFISLGWTTVLWFAFGYSMCFGPSWHGIIGDPTYYAFLHGITLSSMYTGNDAGIPLIVHVAYQMMFAIITPALITGAFANRVTFKAYFLFLTGWLVFVYFPFVHMVWSPDGLFAKWGVLDYAGGIVVHNTAGFAALASVLYVGRRQKVELKPHNVPLIALGSGLLWFGWYGFNAGSEFRVDAVTASAFLNTDVAASFGAITWLFIEWFYHKKPKFIGLLTGGVAGLATITPAAGYVSLGTAAIIGICAGLICFYAVALKNRLGWDDALDVWGVHGVGGMAGTILLGVFASKAWNANGADGLLLGNTSFFFAQCGAVIISGIWAFAFTYGMLWLINLFTPVKVGAATQDRMDEDLHGEDAYLHA']

[DEBUG] [ncbi_cds_from_protein.sequences]: Trimming CDS to Stockholm coordinates: 414..438
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/bin/ncfp", line 33, in <module>
    sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 379, in run_main
    nt_sequences = extract_cds_features(seqrecords, cachepath, args)
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 192, in extract_cds_features
    ntseq, aaseq = extract_feature_cds(
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/sequences.py", line 320, in extract_feature_cds
    if aaseq[-1] == "*":
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range

Expected Output:

Graceful fail (message/warning) saying that the sequence couldn't be matched automatically, or the correct sequence returned.

`ncfp` Version:

git HEAD

Python Version:

3.10

Operating System:

macOS

When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.

Summary:

Reproducible Steps:

Current Output:

Expected Output:

ncfp Version:

Python Version:

Operating System:

`ncfp` Version: