zwdzwd / transvar

TransVar - multiway annotator for precision medicine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem in EnsemblDB for gene_ids with lowercase characters and no gene name

manulera opened this issue · comments

Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.

I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id and does not have a gene name:

II	PomBase	gene	177469	180628	.	-	.	gene_id "SPBC1198.04c"; gene_biotype "protein_coding";

I can create a database and retreive the gene:

db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))

However, because of this:

transvar/transvar/anno.py

Lines 193 to 195 in f7c17a8

q.tok = q.tok.upper()
genefound = False
for q.gene in db.get_gene(q.tok, args.strictversion):

Or this:

transvar/transvar/anno.py

Lines 153 to 155 in f7c17a8

q.tok = q.tok.upper()
genefound = False
for q.gene in db.get_gene(q.tok, args.strictversion):

It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A, because the q.tok is made uppercase and it does not exist in db.

A possible fix for this particular case would be to do gene_id.upper() here, although I am not sure it would break something else.

if 'gene_name' in info:
g.name = info['gene_name'].upper()
else:
g.name = gene_id

Minimal example to show the problem is below, and the files to reproduce:

Then, to set up the transvar config:

# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"

transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa

Then run the code that gives the error (transvar_main_script below is bin/transvar from which I import the parser functions)

from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial

parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()

p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))

variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])

db = AnnoDB(args, read_config())

print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))

print('the gene is not found in the panno call')
args.func(args)

That solution is not great though, because then when printing the line, it will print the gene_id in uppercase, which is not ideal.

Instead, one could make get_gene try with and without uppercase?