daler / gffutils

GFF and GTF file manipulation and interconversion

Home Page:http://daler.github.io/gffutils

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent Ordering of exon_number attribute in introns

mbiokyle29 opened this issue · comments

Not sure this is technically a bug, but something that could cause issues. I noticed an inconsistency with the ordering of values in the exon_number attribute in introns (generated from create_introns). For most introns, the exon_number field will contain a list like ["1", "2"]. The idea being that this intron occurs after exon 1 and before exon 2, and is therefore "intron 1". This holds for all introns until the intron spanning exon 9 -> exon 10. The sorting is done using strings (at least based on how its formatted in an NCBI gtf). "10" will sort before "9", and thus the result will be ["10", "9"].

Example:

# all BCRA2 exons from GCF_000001405.39_GRCh38.p13_genomic.gtf 
NC_000013.11	BestRefSeq	exon	32315508	32315667	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "1"; 
NC_000013.11	BestRefSeq	exon	32316422	32316527	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "2"; 
NC_000013.11	BestRefSeq	exon	32319077	32319325	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "3"; 
NC_000013.11	BestRefSeq	exon	32325076	32325184	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "4"; 
NC_000013.11	BestRefSeq	exon	32326101	32326150	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "5"; 
NC_000013.11	BestRefSeq	exon	32326242	32326282	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "6"; 
NC_000013.11	BestRefSeq	exon	32326499	32326613	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "7"; 
NC_000013.11	BestRefSeq	exon	32329443	32329492	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "8"; 
NC_000013.11	BestRefSeq	exon	32330919	32331030	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "9"; 
NC_000013.11	BestRefSeq	exon	32332272	32333387	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "10"; 
NC_000013.11	BestRefSeq	exon	32336265	32341196	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "11"; 
NC_000013.11	BestRefSeq	exon	32344558	32344653	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "12"; 
NC_000013.11	BestRefSeq	exon	32346827	32346896	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "13"; 
NC_000013.11	BestRefSeq	exon	32354861	32355288	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "14"; 
NC_000013.11	BestRefSeq	exon	32356428	32356609	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "15"; 
NC_000013.11	BestRefSeq	exon	32357742	32357929	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "16"; 
NC_000013.11	BestRefSeq	exon	32362523	32362693	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "17"; 
NC_000013.11	BestRefSeq	exon	32363179	32363533	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "18"; 
NC_000013.11	BestRefSeq	exon	32370402	32370557	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "19"; 
NC_000013.11	BestRefSeq	exon	32370956	32371100	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "20"; 
NC_000013.11	BestRefSeq	exon	32376670	32376791	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "21"; 
NC_000013.11	BestRefSeq	exon	32379317	32379515	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "22"; 
NC_000013.11	BestRefSeq	exon	32379750	32379913	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "23"; 
NC_000013.11	BestRefSeq	exon	32380007	32380145	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "24"; 
NC_000013.11	BestRefSeq	exon	32394689	32394933	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "25"; 
NC_000013.11	BestRefSeq	exon	32396898	32397044	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "26"; 
NC_000013.11	BestRefSeq	exon	32398162	32400268	.	+	.	gene_id "BRCA2"; transcript_id "NM_000059.4"; db_xref "Ensembl:ENST00000380152.8"; db_xref "GeneID:675"; gene "BRCA2"; product "BRCA2 DNA repair associated"; tag "MANE Select"; transcript_biotype "mRNA"; exon_number "27"; 

Code:

import gffutils

db_filename = 'brca2_bug.db'
feature_db = gffutils.create_db(
    'brca2.gtf',
    db_filename,
    merge_strategy='warning',
)

introns = [f for f in feature_db.create_introns()]
[i.attributes['exon_number'] for i in introns]
Out[12]:
[['1', '2'],
 ['2', '3'],
 ['3', '4'],
 ['4', '5'],
 ['5', '6'],
 ['6', '7'],
 ['7', '8'],
 ['8', '9'],
 ['10', '9'],  # <---- this is the issue
 ['10', '11'], 
 ['11', '12'],
 ['12', '13'],
 ['13', '14'],
 ['14', '15'],
 ['15', '16'],
 ['16', '17'],
 ['17', '18'],
 ['18', '19'],
 ['19', '20'],
 ['20', '21'],
 ['21', '22'],
 ['22', '23'],
 ['23', '24'],
 ['24', '25'],
 ['25', '26'],
 ['26', '27']]

I think the issue is here:

return dict((k, sorted(set(v))) for k, v in new_d.items())

Might be tricky to get this to work in all cases, but seems like you could check str.isdigit on each item in v. If all are true, cast to int for the sorting, then cast back to string. Perhaps enable this via a flag.

Thanks for the very clear writeup and example -- this is fixed in #192.

Now addressed in v0.11.