widdowquinn / ncfp

Program and package that retrieves nucleotide coding sequences from NCBI that correspond to a set of input protein sequences.

Home Page:https://widdowquinn.github.io/ncfp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use protein IDs in output

HobnobMancer opened this issue · comments

Summary:

Provide an option to use the protein IDs of the query protein sequences in the FASTA file output.

Description:

For backtranslating nucleotide sequences on to aligned protein sequences (for example using tools such as tcoffee) requires the nucleotide sequences and their associated protein sequence to be identifiable by sharing the same ID.

ncfp writes out the ID retrieved from the nucleotide record. Sometimes this is the same ID as the query protein sequence, sometimes this is a different ID. Therefore, using the ncfp output for backthreading nucleotide sequences onto a protein MSA requires additional parsing of the ncfp output, to overwrite the IDs listed in the FASTA output with the IDs of the query protein sequences.

An option such as --use_protein_ids could be included, so that ncfp writes out the protein ID of the query protein for each nucleotide sequence written to the resulting FASTA file.

Current Output:

The ID of the nucleotide record retrieved from NCBI Entrez.

>AN2569.2 coding sequence

Expected Output:

The protein ID of the query protein sequence provided to ncfp

>EAA64674.1 coding sequence

ncfp Version:

v0.2.0

Python Version:

v3.8.6

Operating System:

Ubuntu 20.04.2 LTS