Use protein IDs in output
HobnobMancer opened this issue · comments
Summary:
Provide an option to use the protein IDs of the query protein sequences in the FASTA file output.
Description:
For backtranslating nucleotide sequences on to aligned protein sequences (for example using tools such as tcoffee
) requires the nucleotide sequences and their associated protein sequence to be identifiable by sharing the same ID.
ncfp
writes out the ID retrieved from the nucleotide record. Sometimes this is the same ID as the query protein sequence, sometimes this is a different ID. Therefore, using the ncfp
output for backthreading nucleotide sequences onto a protein MSA requires additional parsing of the ncfp
output, to overwrite the IDs listed in the FASTA output with the IDs of the query protein sequences.
An option such as --use_protein_ids
could be included, so that ncfp
writes out the protein ID of the query protein for each nucleotide sequence written to the resulting FASTA file.
Current Output:
The ID of the nucleotide record retrieved from NCBI Entrez.
>AN2569.2 coding sequence
Expected Output:
The protein ID of the query protein sequence provided to ncfp
>EAA64674.1 coding sequence
ncfp
Version:
v0.2.0
Python Version:
v3.8.6
Operating System:
Ubuntu 20.04.2 LTS