missing and inconsistent protein annotation usage

Question

missing and inconsistent protein annotation usage

git-jemiller opened this issue 5 years ago · comments

I'm trying to annotate a protein with its genomic coordinates using transvar and for most proteins it works fine, but sometimes nothing is returned except for the header of the output. How should I interpret this result? Or am I doing something wrong?

transvar panno --ensembl --idmap uniprot -i 'W5XKT8'
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

Also, why do some proteins need their isoform to get any output and others do not?

Here's an example:


#returns output
transvar panno -i 'Q6N069-1' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q6N069-1	ENST00000379406 (protein_coding)	NAA16	+	chr13:g.41885341_41951166/c.1_2592/p.M1_I864	whole_transcript	promoter=chr13:41884341_41885341;#exons=20;cds=chr13:41885665_41949735

#no output
transvar panno -i 'Q6N069' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

#returns output without providing isoform number
transvar panno -i 'Q9H1K6' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q9H1K6	ENST00000267984 (protein_coding)	MESDC1	+	chr15:g.81293295_81296342/c.1_1086/p.M1_N362	whole_transcript	promoter=chr15:81292295_81293295;#exons=1;cds=chr15:81294613_81295698

Thanks!

Wanding Zhou - Bioinformatics · Answer 1 · Sun Aug 18 2019 02:38:48 GMT+0800 (China Standard Time)

Hi,

Sorry for the late response. TransVar has been using the ID mapping from uniprot. More specifically it's from this file

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz

Therefore if your identifier isn't linked to any transcript id in this file, transvar wouldn't be able to locate transcript definition. That's what happened to W5XKT8 and Q6N069. There has also to be a match between the transcript ID from the id mapping file and the transcript definition used. You could also use a customized ID mapping if you know how to project Uniprot ID to transcript ID (Ensembl, Refseq etc). This is done by

transvar index --idmap <idmapping file> -o <output_idx>

idmapping file has two columns, the first being uniprot ID, the second being the transcript ID.
once done
you could use something like

transvar panno --idmap <output_idx>

as usual.

Let me know if you know a better way to map these IDs. Thanks!