widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.

Home Page:https://widdowquinn.github.io/ncfp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not identifying CDS and removed or suppressed by NCBI

tharis opened this issue · comments

Summary:

Finds matching genbank entry, tries using locus tag when it finds one, moves onto GN field after this and then does not find CDS feature.

Some sequences also say that NCBI has removed or suppressed the sequence

Description:

Please describe the issue as clearly as possible, taking as much space as you need.

Reproducible Steps:

using this code

ncfp --unify_seqid -v -s -l out.log in.fasta out.out email -d caches/new_ncfp -c ncfp_cache

will include examples of sequences that produce error

CDS_not_identified.txt

removed_or_suppressed.txt

Current Output:

[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: No record found for sequence input tr|F2YIC4|F2YIC4_METMG/66-85 - please check this sequence manually
[WARNING] [ncbi_cds_from_protein.scripts.ncfp]: This record may have been removed from the NCBI database, or suppressed by NCBI

[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Sequence sp|P54145|AMT1_CAEEL/31-51 matches GenBank entry FO080371.2
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Extracting CDS by locus tag with AA query ID: ('C05E11.4',)
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Did not find feature with locus tag ('C05E11.4',), trying GN field
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Searching for CDS: amt-1
[INFO] [ncbi_cds_from_protein.scripts.ncfp]: Could not identify CDS feature for sp|P54145|AMT1_CAEEL/31-51

Expected Output:

To find the CDS feature at least for the ones where there is somesort of match, not really expecting anything from the suppressed sequences

ncfp Version:

merged my version on 07-06-24

Python Version:

3.9.18

Operating System:

mac