cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium

Home Page:https://cumc.github.io/xqtl-protocol/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gtf files different positions

SamuelSun2000 opened this issue · comments

While making the list of each genes' start and end, I spotted a issue:
In Homo_sapiens.GRCh38.103.chr.gtf (https://www.synapse.org/#!Synapse:syn36419237), I searched a gene ENSG00000223764. The result is like:

grep "ENSG00000223764" Homo_sapiens.GRCh38.103.chr.gtf 
1	havana	gene	916865	921016	.	-	.	gene_id "ENSG00000223764"; gene_version "2"; gene_name "LINC02593"; gene_source "havana"; gene_biotype "lncRNA";

However, in Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf (https://www.synapse.org/#!Synapse:syn36419586), the result is:

grep "ENSG00000223764" Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf
chr1	havana	gene	916870	919692	.	-	.	gene_id "ENSG00000223764"; transcript_id "ENSG00000223764"; gene_type "lncRNA"; gene_name "LINC02593"; transcript_type "lncRNA"; transcript_name "LINC02593"; gene_version "2"; gene_source "havana";

Here we can see that the position of two genes are not the same. Some trainscripts/exons in Homo_sapiens.GRCh38.103.chr.gtf have the same position as the lower reformatted one, but not the gene.

A total number of 692 genes have the same issue, meaning that the two gtf have slight differences. I wonder why and will it impact our downstream results?

@gaow @hsun3163

The collapse_only.gene.gtf is constructed by https://github.com/broadinstitute/gtex-pipeline/blob/master/gene_model/collapse_annotation.py in the gTEX pipeline. The idea is that it will take the union of all start/end for a genes, then use the outmost boundary as the start/end of the new entries, as demonstrated in the following codes:

   for g in annot.genes:
       if g.id in new_coord_dict:
           start_pos = str(np.min([i[0] for i in new_coord_dict[g.id]]))
           end_pos = str(np.max([i[1] for i in new_coord_dict[g.id]]))
commented

Thanks @hsun3163 that is correct @SamuelSun2000 you can use the collapsed model so you get the min(start)