gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Output proteome file has unexpected sequences

alexvasilikop opened this issue · comments

Hello,

I want to extract the translated cds features (concatenated per gene) from the gff but some extracted sequences have a "." character in the sequence.
Is this expected?

e.g. see below:
$ /mnt/sda1/Alex/software/gffread-0.12.7.Linux_x86_64/gffread -C -g Schmidtea_mediterranea.assembly.fa -y Schmidtea_mediterranea.pep.fa --no-pseudo Schmidtea_mediterranea.no_iso.gff

One sequence in the fasta looks like this:

SMEST011213001.1
MASLKDERSSAEHIRV.LETEAGEYDKLNEKLTDKGNNVKSPEPEISIQLKTSTTKEMKKKLREKINQEL
PSKNSDETEIYSRKSTMYEITRDEPEMRKQEPIYSSLKRNIQEMHSERKCNEEDLNEKKRNWKFGKENS

You can see there is a dot there in the first line.

Best and thanks for the help

Hello,

I have encountered the same confusion. Is this problem a comment problem or something? This "." should be deleted or replaced with another character.

Best and thanks for the help

That period character represents a stop codon encountered in the translation. I know the "standard" is unfortunately to use the star ( *) character instead, which seems rather inappropriate and misleading for my regex-biased mind :). Period means "end of sentence" so it seemed natural to use that character to depict the stop codon "translation".
Anyway, gffread has a -S option to force the translated output use * instead of . for stop codons, if you prefer that.