Problem in EnsemblDB for gene_ids with lowercase characters and no gene name
manulera opened this issue · comments
Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.
I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id
and does not have a gene name
:
II PomBase gene 177469 180628 . - . gene_id "SPBC1198.04c"; gene_biotype "protein_coding";
I can create a database and retreive the gene:
db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))
However, because of this:
Lines 193 to 195 in f7c17a8
Or this:
Lines 153 to 155 in f7c17a8
It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A
, because the q.tok
is made uppercase and it does not exist in db
.
A possible fix for this particular case would be to do gene_id.upper()
here, although I am not sure it would break something else.
Lines 574 to 577 in f7c17a8
Minimal example to show the problem is below, and the files to reproduce:
data/pombe_genome.fa
: https://curation.pombase.org/dumps/latest_build/fasta/chromosomes/Schizosaccharomyces_pombe_all_chromosomes.fa.gz -o data/pombe_genome.fa.gzdata/pombe_genome.gtf
: pombe_genome.gtf.zip
Then, to set up the transvar config:
# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"
transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa
Then run the code that gives the error (transvar_main_script
below is bin/transvar from which I import the parser functions)
from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial
parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()
p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))
variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])
db = AnnoDB(args, read_config())
print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))
print('the gene is not found in the panno call')
args.func(args)
That solution is not great though, because then when printing the line, it will print the gene_id in uppercase, which is not ideal.
Instead, one could make get_gene try with and without uppercase?