widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.

Home Page:https://widdowquinn.github.io/ncfp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When multiple possible CDS matches are found, sometimes the 'wrong one' is processed.

widdowquinn opened this issue · comments

Summary:

Some queries are ambiguous in terms of matching CDS (e.g. a gene name is provided, but no protein_id or other precise accession, and the "wrong one" can be recovered. This currently has two potential results:

  1. the query is skipped because the conceptual translation doesn't match the query
  2. an error is thrown because the Stockholm region of the query falls outside the coding sequence

Reproducible Steps:

With

>tr|A0A127QBK9|A0A127QBK9_9BURK/414-438 [subseq from] Ammonium transporter OS=Collimonas pratensis OX=279113 GN=amt PE=3 SV=1
DVFGVHGVGGIMGALLTGVFAAPSL

as bad_sequence.fasta

issue

ncfp --unify_seqid --debug -l bad_sequence.log -s bad_sequence.fasta bad_sequence dev@null.me

Current Output:

    Key: translation, Value: ['MPINIGNTAFMLLCSSLVMLMTPGLAFFYGGLVGRKNVLAIMMQSFISLGWTTVLWFAFGYSMCFGPSWHGIIGDPTYYAFLHGITLSSMYTGNDAGIPLIVHVAYQMMFAIITPALITGAFANRVTFKAYFLFLTGWLVFVYFPFVHMVWSPDGLFAKWGVLDYAGGIVVHNTAGFAALASVLYVGRRQKVELKPHNVPLIALGSGLLWFGWYGFNAGSEFRVDAVTASAFLNTDVAASFGAITWLFIEWFYHKKPKFIGLLTGGVAGLATITPAAGYVSLGTAAIIGICAGLICFYAVALKNRLGWDDALDVWGVHGVGGMAGTILLGVFASKAWNANGADGLLLGNTSFFFAQCGAVIISGIWAFAFTYGMLWLINLFTPVKVGAATQDRMDEDLHGEDAYLHA']

[DEBUG] [ncbi_cds_from_protein.sequences]: Trimming CDS to Stockholm coordinates: 414..438
Traceback (most recent call last):
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/bin/ncfp", line 33, in <module>
    sys.exit(load_entry_point('ncfp', 'console_scripts', 'ncfp')())
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 379, in run_main
    nt_sequences = extract_cds_features(seqrecords, cachepath, args)
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/scripts/ncfp.py", line 192, in extract_cds_features
    ntseq, aaseq = extract_feature_cds(
  File "/Users/lpritc/Development/ncfp/ncbi_cds_from_protein/sequences.py", line 320, in extract_feature_cds
    if aaseq[-1] == "*":
  File "/Users/lpritc/opt/anaconda3/envs/ncfp_py310/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range

Expected Output:

Graceful fail (message/warning) saying that the sequence couldn't be matched automatically, or the correct sequence returned.

ncfp Version:

git HEAD

Python Version:

3.10

Operating System:

macOS