gtf files different positions
SamuelSun2000 opened this issue · comments
While making the list of each genes' start and end, I spotted a issue:
In Homo_sapiens.GRCh38.103.chr.gtf (https://www.synapse.org/#!Synapse:syn36419237), I searched a gene ENSG00000223764. The result is like:
grep "ENSG00000223764" Homo_sapiens.GRCh38.103.chr.gtf
1 havana gene 916865 921016 . - . gene_id "ENSG00000223764"; gene_version "2"; gene_name "LINC02593"; gene_source "havana"; gene_biotype "lncRNA";
However, in Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf (https://www.synapse.org/#!Synapse:syn36419586), the result is:
grep "ENSG00000223764" Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf
chr1 havana gene 916870 919692 . - . gene_id "ENSG00000223764"; transcript_id "ENSG00000223764"; gene_type "lncRNA"; gene_name "LINC02593"; transcript_type "lncRNA"; transcript_name "LINC02593"; gene_version "2"; gene_source "havana";
Here we can see that the position of two genes are not the same. Some trainscripts/exons in Homo_sapiens.GRCh38.103.chr.gtf have the same position as the lower reformatted one, but not the gene.
A total number of 692 genes have the same issue, meaning that the two gtf have slight differences. I wonder why and will it impact our downstream results?
The collapse_only.gene.gtf is constructed by https://github.com/broadinstitute/gtex-pipeline/blob/master/gene_model/collapse_annotation.py in the gTEX pipeline. The idea is that it will take the union of all start/end for a genes, then use the outmost boundary as the start/end of the new entries, as demonstrated in the following codes:
for g in annot.genes: if g.id in new_coord_dict: start_pos = str(np.min([i[0] for i in new_coord_dict[g.id]])) end_pos = str(np.max([i[1] for i in new_coord_dict[g.id]]))
Thanks @hsun3163 that is correct @SamuelSun2000 you can use the collapsed model so you get the min(start)